Project UCTF: An Open Research Program on Machine-Native AI Training Representations
This post is a follow-up to my earlier concept proposal on the Universal Compressed Training Format (UCTF).
The discussion that followed—especially the detailed technical feedback from @John6666—made me realize that the original concept combined several different research questions into a single proposal. Rather than trying to solve everything at once, I’ve decided to restructure the project into an open, hypothesis-driven research program.
This project is open to feedback, criticism, collaboration, and independent replication. Every stage will be documented publicly, regardless of whether the results support or challenge the original idea.
-–
Core Research Question
Can a machine-native intermediate representation serve as an alternative training representation for multilingual AI models, reducing cross-lingual semantic redundancy while preserving the information required for downstream generation?
This is a hypothesis—not a conclusion.
The purpose of Project UCTF is to investigate this question through a sequence of independent, measurable experiments.
-–
Research Philosophy
Instead of attempting to prove UCTF from the beginning, each stage will investigate one specific research question.
The goal is to let evidence shape the direction of the project.
Positive results, negative results, and unexpected findings will all be published.
If later experiments show that the hypothesis is incorrect or only partially correct, those results will still be considered valuable contributions. -–
Why a Multi-Paper Research Program?
The original concept covered multiple independent problems:
-
multilingual semantic representations
-
redundancy in multilingual corpora
- machine-native representations
- language-independent training
- efficient AI training
These deserve to be investigated separately.
By splitting the work into independent papers:
-
Each paper has standalone value.
-
Earlier work remains useful even if later stages fail.
-
Community feedback can improve each stage before moving to the next.
-
The project evolves based on evidence rather than assumptions.
-–
Planned Research Roadmap
Paper 1
Measuring Semantic Redundancy in Multilingual Training Corpora
Research Question
How much semantic redundancy actually exists across multilingual training datasets?
Objective
Measure rather than assume the problem.
Possible initial evaluation resources may include multilingual benchmark datasets such as FLORES+, together with multilingual embedding baselines. The exact datasets and models will be selected after further literature review and community feedback.
Deliverables
-
Redundancy measurements
-
Dataset statistics
-
Visualizations
-
Manual inspection of representative examples
-
Open notebook and reproducible analysis
-–
Paper 2
Characterizing Universal and Language-Specific Knowledge
Research Question
Which information is universal across languages, and which information is inherently language-dependent?
Objective
Develop an empirical taxonomy separating:
-
universal factual knowledge
-
mathematical knowledge
-
scientific knowledge
- language-dependent structure
- culture-dependent expressions
-
idioms
-
conversational nuances
This paper defines what any machine-native representation must preserve.
-–
Paper 3
Design Requirements for a Machine-Native Intermediate Representation
Research Question
Given the findings from Papers 1 and 2, what should a UCTF-like representation contain?
Objective
Produce a formal design specification.
This stage focuses on defining requirements—not implementing them.
-–
Paper 4
Prototype Machine-Native Representation
Research Question
Can a prototype intermediate representation satisfy the requirements identified in earlier stages?
Objective
Develop an experimental prototype and evaluate:
-
semantic preservation
-
compression behavior
-
reconstruction quality
-
degradation across multiple settings
The implementation approach will be chosen based on the evidence gathered during Papers 1–3 rather than fixed in advance.
-–
Paper 5
Initial Training Validation
Research Question
Can a model trained on a machine-native representation achieve comparable downstream performance to conventional multilingual training?
Objective
Perform the first controlled comparison between:
-
conventional multilingual training
-
machine-native representation training
Measurements may include:
-
dataset size
-
representation efficiency
-
training time
-
computational cost
-
downstream performance
-
observed limitations
Regardless of the outcome, all results will be published.
-–
Open Research Principles
This project follows several principles:
-
Every experiment should answer one question.
-
Conclusions should follow evidence.
-
Negative results are valuable.
-
All code, notebooks, and intermediate findings will be released publicly whenever practical.
-
Community feedback will be incorporated before major milestones.
-
Credit will always be given to contributors whose feedback shapes the project.
-–
How You Can Contribute
Contributions are welcome at every stage.
Examples include:
-
multilingual NLP expertise
-
literature recommendations
-
dataset suggestions
-
methodology review
-
experiment replication
-
constructive criticism
-
identifying prior work
-
identifying flaws before implementation
Finding weaknesses early is just as valuable as proposing improvements.
-–
Current Status
The project is currently entering Paper 1.
The immediate objective is not to build UCTF, but to investigate whether multilingual semantic redundancy exists at a scale that justifies further research.
Community suggestions regarding evaluation datasets, multilingual benchmarks, encoder baselines, and experimental methodology are highly appreciated.
-–
Acknowledgment
Special thanks to @John6666 for providing detailed technical feedback on the original UCTF proposal.
Their comments significantly influenced the restructuring of this work into a staged, hypothesis-driven research program.
-–
This is an open research effort.
The goal is not to prove UCTF right.
The goal is to investigate whether the underlying hypothesis is worth pursuing.
— K7007
#UCTF #MachineLearning #NLP #MultilingualAI #OpenResearch #RepresentationLearning Research