{"slug": "mjepa-a-simple-and-scalable-joint-embedding-predictive-architecture-for-audio", "title": "MJEPA: A Simple and Scalable Joint-Embedding Predictive Architecture for Audio-Visual Learning", "summary": "Researchers introduced MJEPA, a joint-embedding predictive architecture for audio-visual learning that uses a single unified encoder and a single predictive objective. The model outperforms prior frozen baselines by over 6.8 mAP on AudioSet-20K and surpasses fully finetuned models on ESC-50 and FSD50K, demonstrating the effectiveness of cross-modal prediction.", "body_md": "arXiv:2606.25225v1 Announce Type: new\nAbstract: Self-supervised learning from large-scale video data has emerged as a dominant paradigm for visual representation learning. Since audio and visual streams naturally co-occur in video data, extending this success to jointly learn from both modalities is a natural next step, yet it remains challenging. Existing audio-visual self-supervised methods rely on modality-specific encoders and complex combinations of contrastive or reconstruction objectives, limiting cross-modal synergy and scalability. Joint Embedding Predictive Architectures (JEPAs) offer a simple, modality-agnostic alternative, but have to date been applied primarily to individual modalities. We introduce MJEPA, a joint-embedding predictive architecture for audio-visual learning that uses a single, unified encoder for both modalities. Our approach uses only a single predictive objective, applied both within and across modalities. We show that cross-modal prediction is critical: without it, a shared encoder degrades below unimodal baselines; with it, each modality's representation benefits from the other. Our frozen ViT-g model outperforms the best prior frozen baseline by over 6.8 mAP on AudioSet-20K, surpasses fully finetuned models on ESC-50 and FSD50K, and is competitive on video benchmarks despite using 10x less video data.", "url": "https://wpnews.pro/news/mjepa-a-simple-and-scalable-joint-embedding-predictive-architecture-for-audio", "canonical_source": "https://arxiv.org/abs/2606.25225", "published_at": "2026-06-25 04:00:00+00:00", "updated_at": "2026-06-25 04:19:54.864294+00:00", "lang": "en", "topics": ["machine-learning", "computer-vision", "natural-language-processing"], "entities": ["MJEPA", "AudioSet-20K", "ESC-50", "FSD50K", "ViT-g"], "alternates": {"html": "https://wpnews.pro/news/mjepa-a-simple-and-scalable-joint-embedding-predictive-architecture-for-audio", "markdown": "https://wpnews.pro/news/mjepa-a-simple-and-scalable-joint-embedding-predictive-architecture-for-audio.md", "text": "https://wpnews.pro/news/mjepa-a-simple-and-scalable-joint-embedding-predictive-architecture-for-audio.txt", "jsonld": "https://wpnews.pro/news/mjepa-a-simple-and-scalable-joint-embedding-predictive-architecture-for-audio.jsonld"}}