cd /news/machine-learning/mjepa-a-simple-and-scalable-joint-em… · home topics machine-learning article
[ARTICLE · art-38789] src=arxiv.org ↗ pub= topic=machine-learning verified=true sentiment=↑ positive

MJEPA: A Simple and Scalable Joint-Embedding Predictive Architecture for Audio-Visual Learning

Researchers introduced MJEPA, a joint-embedding predictive architecture for audio-visual learning that uses a single unified encoder and a single predictive objective. The model outperforms prior frozen baselines by over 6.8 mAP on AudioSet-20K and surpasses fully finetuned models on ESC-50 and FSD50K, demonstrating the effectiveness of cross-modal prediction.

read1 min views1 publishedJun 25, 2026

arXiv:2606.25225v1 Announce Type: new Abstract: Self-supervised learning from large-scale video data has emerged as a dominant paradigm for visual representation learning. Since audio and visual streams naturally co-occur in video data, extending this success to jointly learn from both modalities is a natural next step, yet it remains challenging. Existing audio-visual self-supervised methods rely on modality-specific encoders and complex combinations of contrastive or reconstruction objectives, limiting cross-modal synergy and scalability. Joint Embedding Predictive Architectures (JEPAs) offer a simple, modality-agnostic alternative, but have to date been applied primarily to individual modalities. We introduce MJEPA, a joint-embedding predictive architecture for audio-visual learning that uses a single, unified encoder for both modalities. Our approach uses only a single predictive objective, applied both within and across modalities. We show that cross-modal prediction is critical: without it, a shared encoder degrades below unimodal baselines; with it, each modality's representation benefits from the other. Our frozen ViT-g model outperforms the best prior frozen baseline by over 6.8 mAP on AudioSet-20K, surpasses fully finetuned models on ESC-50 and FSD50K, and is competitive on video benchmarks despite using 10x less video data.

── more in #machine-learning 4 stories · sorted by recency
── more on @mjepa 3 stories trending now
sponsored brought to you by zahid.host 4,200+ EU-deployed projects
reading about agents? ship yours in a single git push.

Run your AI side-project on zahid.host

EU-based hosting, git-push deploys, automatic HTTPS, no cold starts. Free tier with a custom domain — perfect for shipping the agent you just read about.

$git push zahid main
Live at https://your-agent.zahid.host
Get free account → Pricing
from €0/mo · no card required
LIVE [news/mjepa-a-simple-and-s…] indexed:0 read:1min 2026-06-25 ·