Text-Conditional JEPA for Learning Semantically Rich Visual Representations

wpnews.pro

cd /news/computer-vision/text-conditional-jepa-for-learning-s… · home › topics › computer-vision › article

[ARTICLE · art-17313] src=machinelearning.apple.com ↗ pub=2026-05-07T00:00Z topic=computer-vision verified=true sentiment=↑ positive

Text-Conditional JEPA for Learning Semantically Rich Visual Representations

Researchers Chen Huang, Xianhang Li, Vimal Thilak, Etai Littwin, and Josh Susskind have developed Text-Conditional JEPA (TC-JEPA), a visual self-supervised learning model that uses image captions to reduce prediction uncertainty in masked feature learning. By modulating predicted patch features through sparse cross-attention over text tokens, the approach produces more semantically meaningful representations and improves downstream performance and training stability. TC-JEPA establishes a new vision-language pretraining paradigm based solely on feature prediction, outperforming contrastive methods on tasks requiring fine-grained visual understanding and reasoning.

read2 min views11 publishedMay 7, 2026

content type paperpublished May 2026 Text-Conditional JEPA for Learning Semantically Rich Visual Representations

AuthorsChen Huang, Xianhang Li, Vimal Thilak, Etai Littwin, Josh Susskind

Text-Conditional JEPA for Learning Semantically Rich Visual Representations

AuthorsChen Huang, Xianhang Li, Vimal Thilak, Etai Littwin, Josh Susskind

Image-based Joint-Embedding Predictive Architecture (I-JEPA) offers a promising approach to visual self-supervised learning through masked feature prediction. However with the inherent visual uncertainty at masked positions, feature prediction remains challenging and may fail to learn semantic representations. In this work, we propose Text-Conditional JEPA (TC-JEPA) that uses image captions to reduce the prediction uncertainty. Specifically, we modulate the predicted patch features using a fine-grained text conditioner that computes sparse cross-attention over input text tokens. With such conditioning, patch features become predictable as a function of text, thus are more semantically meaningful. We show TC-JEPA improves downstream performance and training stability, with promising scaling properties. TC-JEPA also offers a new vision-language pretraining paradigm based on feature prediction only, outperforming contrastive methods on diverse tasks, especially those requiring fine-grained visual understanding and reasoning.

Rethinking JEPA: Compute-Efficient Video SSL with Frozen Teachers

October 8, 2025research area Computer Vision, research area Methods and Algorithms conference ICLR

Video Joint Embedding Predictive Architectures (V-JEPA) learn generalizable off-the-shelf video representation by predicting masked regions in latent space with an exponential moving average (EMA)-updated teacher. While EMA prevents representation collapse, it complicates scalable model selection and couples teacher and student architectures. We revisit masked-latent prediction and show that a frozen teacher suffices. Concretely, we (i) train a…

How JEPA Avoids Noisy Features: The Implicit Bias of Deep Linear Self Distillation Networks

November 20, 2024research area Computer Vision, research area Methods and Algorithms conference NeurIPS

Two competing paradigms exist for self-supervised learning of data representations. Joint Embedding Predictive Architecture (JEPA) is a class of architectures in which semantically similar inputs are encoded into representations that are predictive of each other. A recent successful approach that falls under the JEPA framework is self-distillation, where an online encoder is trained to predict the output of the target encoder, sometimes using a…

source & further reading

machinelearning.apple.com — original article Proactive Agent Research Environment: Simulating Active Users to Evaluate Proactive Assistants Behavioral Privacy Leakage in Agentic Negotiation: Formalizing and Mitigating Inference Attacks via Randomized Policies Incentivizing Temporal-Awareness in Egocentric Video Understanding Models

~/api · this article 200

$curl api.wpnews.pro/v1/news/text-conditional-jepa-fo…

Read original on machinelearning.apple.com → machinelearning.apple.com/research/text-conditio…

mentioned entities

Chen Huang

Xianhang Li

Vimal Thilak

Etai Littwin

Josh Susskind

I-JEPA

TC-JEPA

metadata

slugtext-conditional-jepa-for-learning-semantically-rich-visual-representations

topic#computer-vision

secondary4 topics

sentimentpositive

canonicalmachinelearning.apple.com

navigation

← prevas

next →Code with Claude: The 5 biggest …

── more in #computer-vision 4 stories · sorted by recency

benjamin-bai.com · 14 Jul · #computer-vision

LeMario: Training a JEPA World Model on Super Mario Bros

machinebrief.com · 14 Jul · #computer-vision

Human Pose Modeling with Neural Priors

machinebrief.com · 14 Jul · #computer-vision

MR Elastography with Deep Learning

machinebrief.com · 14 Jul · #computer-vision

Vision Transformers: GradSkip Sets a New Benchmark

── more on @chen huang 3 stories trending now

wpnews · 23 May · #artificial-intelligence

AccessLens — a blind person's lanyard, powered by Gemma 4 on-device

wpnews · 27 May · #artificial-intelligence

How I Run Two Claude Accounts as One

wpnews · 8 Jul · #artificial-intelligence

SpaceXAI unveils Grok 4.5 AI model ahead of July 2026 public release

sponsored brought to you by zahid.host 4,200+ EU-deployed projects

reading about agents? ship yours in a single git push.

Run your AI side-project on zahid.host

EU-based hosting, git-push deploys, automatic HTTPS, no cold starts. Free tier with a custom domain — perfect for shipping the agent you just read about.

$git push zahid main

→ Live at https://your-agent.zahid.host ✓

Get free account → Pricing

from €0/mo · no card required