The 90-year-old idea behind JEPA models: Canonical Correlation Analysis

Harold Hotelling's 1936 Canonical Correlation Analysis (CCA) is the conceptual foundation for modern Joint Embedding Predictive Architecture (JEPA) models, as JEPA implicitly performs a nonlinear generalization of CCA. The connection highlights that maximizing correlation in embedding space, a core idea in JEPA, originated with Hotelling, not recent AI researchers.

The 90-year-old idea behind JEPA models: Canonical Correlation Analysis CCA Embedding prediction Introduction Concepts of correlation and regression may be applied not only to ordinary one-dimensional variates but also to variates of two or more dimensions. This is the first sentence from the paper “Relations Between Two Sets of Variates” Hotelling 1936 ref-hotelling1936 by statistician and economist Harold Hotelling. This paper introduced Canonical Correlation Analysis CCA . In modern terminology, “CCA is used to find a common signal among two large matrices” Bykhovskaya and Gorin 2025 ref-bykhovskaya2025 . In JEPA, the objective is the same except the second data matrix happens to be simply a different view of the same data in the first dataset e.g. via data augmentation or spatial or temporal proximity . One of the recent papers to acknowledge a connection states, “JEPA-based models implicitly perform a non-linear generalization of Canonical Correlation Analysis”. Huang 2026 ref-huang2026 CCA’s connection to JEPA is relevant to Schmidhuber’s debate on who invented JEPA https://people.idsia.ch/~juergen/who-invented-jepa.html , which is directed at Yann LeCun. Personally, I think Hotelling deserves the credit for the idea of maximizing correlation in embedding space. Of course, the CCA model has many differences from JEPA. For one, CCA does not enforce a shared encoder. But the biggest difference is that CCA is linear. Non-linear neural variants of CCA have been researched with the earliest usage of the term “Deep CCA” being Andrew et al. 2013 ref-andrew2013 . Connecting JEPA models back to its CCA roots is genuinely useful. Another Deep CCA paper Benton et al. 2017 ref-benton2017 relaxed the assumption of two sets of variables to an arbitrary number based on a generalization of CCA proposed in 1961 Horst 1961 ref-horst1961 . Conceivably, JEPAs could be expanded to handle more than two views as well. CCA vs. JEPA Overview CCA Suppose we have zero-mean matrices \ X= x 1,...,x n ^T\in \mathbb R^{n\times d x}\ and \ Y= y 1,...,y n ^T\in\mathbb R^{n\times d y}\ . Let \ k\leq \min d x,d y, n \ and \ A\in \mathbb R^{d x\times k}\ and \ B\in \mathbb R^{d y\times k}\ so that \ XA=z x\in\mathbb R^{n \times k}\ and \ YB=z y\in\mathbb R^{n \times k}\ . CCA solves the following maximization problem, \ \max {A,B} \text{tr}\left \frac{1}{n}z x^Tz y\right \ \ \text{s.t}\ \ \frac{1}{n}z x^Tz x=\frac{1}{n}z y^Tz y=I\ This maximizes the trace of the cross-correlation matrix, while constraining embedding vectors to unit variance and zero covariance. Similar to the equivalence between maximizing variance and minimizing prediction error in solving PCA, we have a relationship between the trace of the cross-correlation matrix and embedding prediction error, \ \frac{1}{n}\sum {i=1}^n ||z x^{ i }-z y^{ i }||^2=\frac{1}{n}||z x-z y|| F^2= \frac{1}{n}\text{tr} z x^Tz x + \frac{1}{n}\text{tr} z y^Tz y - \frac{2}{n}\text{tr} z x^Tz y \ And due to the whitening constraints, \ =2k- \frac{2}{n}\text{tr} z x^Tz y \ So maximizing the trace of the cross-correlation under the whitening constraints is equivalent to minimizing the MSE of the embedding representations. Therefore we can write CCA as, \ \min {A,B} \frac{1}{n}\sum {i=1}^n ||z x^{ i }-z y^{ i }||^2\ \ \text{s.t}\ \ \frac{1}{n}z x^Tz x=\frac{1}{n}z y^Tz y=I\ JEPA Adopting the previous notation, JEPA is constrained to \ d x=d y=d\ as a result of the joint-embedding. In JEPA, we have the encoder \ f \theta:\mathbb R^{d}\rightarrow \mathbb R^k\ , and predictor \ g \varphi:\mathbb R^{k}\rightarrow \mathbb R^k\ . Let \ z x^{ i }=g \varphi f \theta x i \ , \ z y^{ i }=f \theta y i \ . Then we solve, \ \min {\theta,\varphi}\frac{1}{n} \sum {i=1}^n ||z x^{ i }-z y^{ i }||^2\ Note the similarity in the objective function but the lack of whitening constraints. The lack of whitening constraints results in representational and dimensional collapse. For example, a trivial solution to the above problem is \ z x^{ i }=z y^{ i }=c\ . As discussed in my previous blog post ../../posts/sigreg-sketched-isotropic-gaussian-regularization/ SIGReg Balestriero and LeCun 2025 ref-balestriero2025 fixes this problem. What does it do? It encourages the embeddings \ z x\ and \ z y\ to have an isotropic i.e. unit variance, uncorrelated Gaussian distribution. As a result it encourages, \ \frac{1}{n}z x^Tz x=\frac{1}{n}z y^Tz y=I\ Conclusion As I mentioned in the introduction, Schmidhuber has debated who invented JEPA https://people.idsia.ch/~juergen/who-invented-jepa.html and said this about LeCun, Dr. LeCun’s heavily promoted Joint Embedding Predictive Architecture JEPA is the heart of his new company. However, the core ideas are not original to LeCun. Instead, JEPA is essentially identical to our 1992 Predictability Maximization system. Schmidhuber references Yann LeCun’s response, JEPA is merely a name for a general concept. The question is, and has always been, how do you make it work particularly how do you prevent it from collapsing , and how do you make it work at scale with SOTA results on non-toy problems. That’s the hard part. Ideas are a dime a dozen. Making them work is what the community will give you credit for. Do I agree with LeCun? Yes and no. Yes, because of course you will get credit for making things work, and ideas are indeed arguably “a dime a dozen”. No, because the thread of citations is important for progress. If important citations are missed, whether intentionally or not, the correct thing to do is just add them. We’re all only the better for doing so. The connection that JEPA models have to CCA is informative. My opinion is that JEPA/Predictability Maximization models are architectural enhancements layered on top of CCA. Non-linearity is an enhancement. Ultimately, these models all have the same objective function introduced by CCA: find the transformations that result in maximal correlation between sets of multidimensional data. References International Conference on Machine Learning , 1247–55. https://proceedings.mlr.press/v28/andrew13.html https://proceedings.mlr.press/v28/andrew13.html . LeJEPA: Provable and Scalable Self-Supervised Learning Without the Heuristics . https://arxiv.org/abs/2511.08544 https://arxiv.org/abs/2511.08544 . Deep Generalized Canonical Correlation Analysis . https://arxiv.org/abs/1702.02519 https://arxiv.org/abs/1702.02519 . Canonical Correlation Analysis: Review . https://arxiv.org/abs/2411.15625 https://arxiv.org/abs/2411.15625 . Generalized Canonical Correlations and Their Application to Experimental Data . Journal of clinical psychology. Biometrika 28 3/4 : 321–77. http://www.jstor.org/stable/2333955 http://www.jstor.org/stable/2333955 . VJEPA: Variational Joint Embedding Predictive Architectures as Probabilistic World Models . https://arxiv.org/abs/2601.14354 https://arxiv.org/abs/2601.14354 .