MJEPA: A Simple and Scalable Joint-Embedding Predictive Architecture for Audio-Visual Learning

wpnews.pro

cd /news/machine-learning/mjepa-a-simple-and-scalable-joint-em… · home › topics › machine-learning › article

[ARTICLE · art-38789] src=arxiv.org ↗ pub=2026-06-25T04:00Z topic=machine-learning verified=true sentiment=↑ positive

MJEPA: A Simple and Scalable Joint-Embedding Predictive Architecture for Audio-Visual Learning

Researchers introduced MJEPA, a joint-embedding predictive architecture for audio-visual learning that uses a single unified encoder and a single predictive objective. The model outperforms prior frozen baselines by over 6.8 mAP on AudioSet-20K and surpasses fully finetuned models on ESC-50 and FSD50K, demonstrating the effectiveness of cross-modal prediction.

read1 min views1 publishedJun 25, 2026

arXiv:2606.25225v1 Announce Type: new Abstract: Self-supervised learning from large-scale video data has emerged as a dominant paradigm for visual representation learning. Since audio and visual streams naturally co-occur in video data, extending this success to jointly learn from both modalities is a natural next step, yet it remains challenging. Existing audio-visual self-supervised methods rely on modality-specific encoders and complex combinations of contrastive or reconstruction objectives, limiting cross-modal synergy and scalability. Joint Embedding Predictive Architectures (JEPAs) offer a simple, modality-agnostic alternative, but have to date been applied primarily to individual modalities. We introduce MJEPA, a joint-embedding predictive architecture for audio-visual learning that uses a single, unified encoder for both modalities. Our approach uses only a single predictive objective, applied both within and across modalities. We show that cross-modal prediction is critical: without it, a shared encoder degrades below unimodal baselines; with it, each modality's representation benefits from the other. Our frozen ViT-g model outperforms the best prior frozen baseline by over 6.8 mAP on AudioSet-20K, surpasses fully finetuned models on ESC-50 and FSD50K, and is competitive on video benchmarks despite using 10x less video data.

source & further reading

arxiv.org — original article

~/api · this article 200

$curl api.wpnews.pro/v1/news/mjepa-a-simple-and-scala…

Read original on arxiv.org → arxiv.org/abs/2606.25225

mentioned entities

MJEPA

AudioSet-20K

ESC-50

FSD50K

ViT-g

metadata

slugmjepa-a-simple-and-scalable-joint-embedding-predictive-architecture-for-audio

topic#machine-learning

secondary2 topics

sentimentpositive

canonicalarxiv.org

navigation

← prevChinese models are sometimes bet…

next →Most teams will ship AI-written …

── more in #machine-learning 4 stories · sorted by recency

arxiv.org · 25 Jun · #machine-learning

Are We There Yet? Exploring the Capabilities of MLLMs in Assistive AI Applications

wan-streamer.com · 25 Jun · #machine-learning

Wan Streamer v0.1: End-to-End Real-Time Interactive Foundation Models

arxiv.org · 25 Jun · #machine-learning

AgentOdyssey: Open-Ended Long-Horizon Text Game Generation for Test-Time Continual Learning Agents

arxiv.org · 25 Jun · #machine-learning

Chorus II: Cross-Request Sparsity Reuse for Efficient Image-to-Video Generation

── more on @mjepa 3 stories trending now

wpnews · 22 Jun · #generative-ai

Bain tests software takeover targets using vibecoding AI replicas

wpnews · 28 May · #ai-startups

The Niche SaaS Opportunity Map 2026: Highly Demanded Subscribed Categories Beyond Mainstream

wpnews · 24 Jun · #ai-policy

An AI startup is suing the US government for taking away Anthropic's new model

sponsored brought to you by zahid.host 4,200+ EU-deployed projects

reading about agents? ship yours in a single git push.

Run your AI side-project on zahid.host

EU-based hosting, git-push deploys, automatic HTTPS, no cold starts. Free tier with a custom domain — perfect for shipping the agent you just read about.

$git push zahid main

→ Live at https://your-agent.zahid.host ✓

Get free account → Pricing

from €0/mo · no card required