Project UCTF: An Open Research Program on Machine-Native AI Training Representations

wpnews.pro

This post is a follow-up to my earlier concept proposal on the Universal Compressed Training Format (UCTF).

The discussion that followed—especially the detailed technical feedback from @John6666—made me realize that the original concept combined several different research questions into a single proposal. Rather than trying to solve everything at once, I’ve decided to restructure the project into an open, hypothesis-driven research program.

This project is open to feedback, criticism, collaboration, and independent replication. Every stage will be documented publicly, regardless of whether the results support or challenge the original idea.

-–

Core Research Question

Can a machine-native intermediate representation serve as an alternative training representation for multilingual AI models, reducing cross-lingual semantic redundancy while preserving the information required for downstream generation?

This is a hypothesis—not a conclusion.

The purpose of Project UCTF is to investigate this question through a sequence of independent, measurable experiments.

-–

Research Philosophy

Instead of attempting to prove UCTF from the beginning, each stage will investigate one specific research question.

The goal is to let evidence shape the direction of the project.

Positive results, negative results, and unexpected findings will all be published.

If later experiments show that the hypothesis is incorrect or only partially correct, those results will still be considered valuable contributions. -–

Why a Multi-Paper Research Program?

The original concept covered multiple independent problems:

multilingual semantic representations
redundancy in multilingual corpora

- machine-native representations

- language-independent training

efficient AI training

These deserve to be investigated separately.

By splitting the work into independent papers:

Each paper has standalone value.
Earlier work remains useful even if later stages fail.
Community feedback can improve each stage before moving to the next.
The project evolves based on evidence rather than assumptions.

-–

Planned Research Roadmap

Paper 1

Measuring Semantic Redundancy in Multilingual Training Corpora

Research Question

How much semantic redundancy actually exists across multilingual training datasets?

Objective

Measure rather than assume the problem.

Possible initial evaluation resources may include multilingual benchmark datasets such as FLORES+, together with multilingual embedding baselines. The exact datasets and models will be selected after further literature review and community feedback.

Deliverables

Redundancy measurements
Dataset statistics
Visualizations
Manual inspection of representative examples
Open notebook and reproducible analysis

-–

Paper 2

Characterizing Universal and Language-Specific Knowledge

Research Question

Which information is universal across languages, and which information is inherently language-dependent?

Objective

Develop an empirical taxonomy separating:

universal factual knowledge
mathematical knowledge
scientific knowledge

- language-dependent structure

- culture-dependent expressions

idioms
conversational nuances

This paper defines what any machine-native representation must preserve.

-–

Paper 3

Design Requirements for a Machine-Native Intermediate Representation

Research Question

Given the findings from Papers 1 and 2, what should a UCTF-like representation contain?

Objective

Produce a formal design specification.

This stage focuses on defining requirements—not implementing them.

-–

Paper 4

Prototype Machine-Native Representation

Research Question

Can a prototype intermediate representation satisfy the requirements identified in earlier stages?

Objective

Develop an experimental prototype and evaluate:

semantic preservation
compression behavior
reconstruction quality
degradation across multiple settings

The implementation approach will be chosen based on the evidence gathered during Papers 1–3 rather than fixed in advance.

-–

Paper 5

Initial Training Validation

Research Question

Can a model trained on a machine-native representation achieve comparable downstream performance to conventional multilingual training?

Objective

Perform the first controlled comparison between:

conventional multilingual training
machine-native representation training

Measurements may include:

dataset size
representation efficiency
training time
computational cost
downstream performance
observed limitations

Regardless of the outcome, all results will be published.

-–

Open Research Principles

This project follows several principles:

Every experiment should answer one question.
Conclusions should follow evidence.
Negative results are valuable.
All code, notebooks, and intermediate findings will be released publicly whenever practical.
Community feedback will be incorporated before major milestones.
Credit will always be given to contributors whose feedback shapes the project.

-–

How You Can Contribute

Contributions are welcome at every stage.

Examples include:

multilingual NLP expertise
literature recommendations
dataset suggestions
methodology review
experiment replication
constructive criticism
identifying prior work
identifying flaws before implementation

Finding weaknesses early is just as valuable as proposing improvements.

-–

Current Status

The project is currently entering Paper 1.

The immediate objective is not to build UCTF, but to investigate whether multilingual semantic redundancy exists at a scale that justifies further research.

Community suggestions regarding evaluation datasets, multilingual benchmarks, encoder baselines, and experimental methodology are highly appreciated.

-–

Acknowledgment

Special thanks to @John6666 for providing detailed technical feedback on the original UCTF proposal.

Their comments significantly influenced the restructuring of this work into a staged, hypothesis-driven research program.

-–

This is an open research effort.

The goal is not to prove UCTF right.

The goal is to investigate whether the underlying hypothesis is worth pursuing.

— K7007

#UCTF #MachineLearning #NLP #MultilingualAI #OpenResearch #RepresentationLearning Research

source & further reading

discuss.huggingface.co — original article Rakarrack-0.6.1 port making progress! ( AI assisted ) Cloud Storage Poll Welcome to Haiku basic(Haiku Docs, Haiku slide and Haiku sheets)

Project UCTF: An Open Research Program on Machine-Native AI Training Representations

Run your AI side-project on zahid.host