Project UCTF: An Open Research Program on Machine-Native AI Training Representations

An open research program called Project UCTF has been launched to investigate whether a machine-native intermediate representation can serve as an alternative training representation for multilingual AI models, reducing cross-lingual semantic redundancy. The project will be structured as a multi-paper research program, with each stage investigating a specific research question and publishing results publicly regardless of outcome.

Project UCTF: An Open Research Program on Machine-Native AI Training Representations This post is a follow-up to my earlier concept proposal on the Universal Compressed Training Format UCTF . The discussion that followed—especially the detailed technical feedback from @John6666 https://discuss.huggingface.co/u/john6666 —made me realize that the original concept combined several different research questions into a single proposal. Rather than trying to solve everything at once, I’ve decided to restructure the project into an open, hypothesis-driven research program. This project is open to feedback, criticism, collaboration, and independent replication. Every stage will be documented publicly, regardless of whether the results support or challenge the original idea. -– Core Research Question Can a machine-native intermediate representation serve as an alternative training representation for multilingual AI models, reducing cross-lingual semantic redundancy while preserving the information required for downstream generation? This is a hypothesis—not a conclusion. The purpose of Project UCTF is to investigate this question through a sequence of independent, measurable experiments. -– Research Philosophy Instead of attempting to prove UCTF from the beginning, each stage will investigate one specific research question. The goal is to let evidence shape the direction of the project. Positive results, negative results, and unexpected findings will all be published. If later experiments show that the hypothesis is incorrect or only partially correct, those results will still be considered valuable contributions. -– Why a Multi-Paper Research Program? The original concept covered multiple independent problems: - multilingual semantic representations - redundancy in multilingual corpora - machine-native representations - language-independent training - efficient AI training These deserve to be investigated separately. By splitting the work into independent papers: - Each paper has standalone value. - Earlier work remains useful even if later stages fail. - Community feedback can improve each stage before moving to the next. - The project evolves based on evidence rather than assumptions. -– Planned Research Roadmap Paper 1 Measuring Semantic Redundancy in Multilingual Training Corpora Research Question How much semantic redundancy actually exists across multilingual training datasets? Objective Measure rather than assume the problem. Possible initial evaluation resources may include multilingual benchmark datasets such as FLORES+, together with multilingual embedding baselines. The exact datasets and models will be selected after further literature review and community feedback. Deliverables - Redundancy measurements - Dataset statistics - Visualizations - Manual inspection of representative examples - Open notebook and reproducible analysis -– Paper 2 Characterizing Universal and Language-Specific Knowledge Research Question Which information is universal across languages, and which information is inherently language-dependent? Objective Develop an empirical taxonomy separating: - universal factual knowledge - mathematical knowledge - scientific knowledge - language-dependent structure - culture-dependent expressions - idioms - conversational nuances This paper defines what any machine-native representation must preserve. -– Paper 3 Design Requirements for a Machine-Native Intermediate Representation Research Question Given the findings from Papers 1 and 2, what should a UCTF-like representation contain? Objective Produce a formal design specification. This stage focuses on defining requirements—not implementing them. -– Paper 4 Prototype Machine-Native Representation Research Question Can a prototype intermediate representation satisfy the requirements identified in earlier stages? Objective Develop an experimental prototype and evaluate: - semantic preservation - compression behavior - reconstruction quality - degradation across multiple settings The implementation approach will be chosen based on the evidence gathered during Papers 1–3 rather than fixed in advance. -– Paper 5 Initial Training Validation Research Question Can a model trained on a machine-native representation achieve comparable downstream performance to conventional multilingual training? Objective Perform the first controlled comparison between: - conventional multilingual training - machine-native representation training Measurements may include: - dataset size - representation efficiency - training time - computational cost - downstream performance - observed limitations Regardless of the outcome, all results will be published. -– Open Research Principles This project follows several principles: - Every experiment should answer one question. - Conclusions should follow evidence. - Negative results are valuable. - All code, notebooks, and intermediate findings will be released publicly whenever practical. - Community feedback will be incorporated before major milestones. - Credit will always be given to contributors whose feedback shapes the project. -– How You Can Contribute Contributions are welcome at every stage. Examples include: - multilingual NLP expertise - literature recommendations - dataset suggestions - methodology review - experiment replication - constructive criticism - identifying prior work - identifying flaws before implementation Finding weaknesses early is just as valuable as proposing improvements. -– Current Status The project is currently entering Paper 1. The immediate objective is not to build UCTF, but to investigate whether multilingual semantic redundancy exists at a scale that justifies further research. Community suggestions regarding evaluation datasets, multilingual benchmarks, encoder baselines, and experimental methodology are highly appreciated. -– Acknowledgment Special thanks to @John6666 https://discuss.huggingface.co/u/john6666 for providing detailed technical feedback on the original UCTF proposal. Their comments significantly influenced the restructuring of this work into a staged, hypothesis-driven research program. -– This is an open research effort. The goal is not to prove UCTF right. The goal is to investigate whether the underlying hypothesis is worth pursuing. — K7007 UCTF MachineLearning NLP MultilingualAI OpenResearch RepresentationLearning Research https://discuss.huggingface.co/c/research/7