Deriving SVD without even aiming at it

A mathematician explains how to derive the Singular Value Decomposition (SVD) from scratch by focusing on the motivation behind the concept, arguing that traditional math books often obscure intuitive understanding by presenting only the final formalized version.

Connections in Math: deriving the SVD from scratch Disclaimer: no AI was used to write this. Any errors, awkward sentences, and weird tangents are 100% organic, free-range, and human-made. How to actually learn anything? For a long time I have always been asking myself what it really means to learn or grasp something in math. I think this can be extended to other areas, but math is very often particularly distant from daily ordinary life when it comes to many of its concepts, and it’s hard for me at least to have an intuitive and natural understanding of it — the type of understanding that gives you some sort of foundational structure of the concepts and allows you to just reconstruct its details by yourself later in the future, with none or limited consultation. I have experienced a few times this feeling of having really captured some sort of general structure in some areas, that allows me to express myself or deduce facts only consulting what I currently understand about it. One good example is my first experience with Real Analysis. At first, it looked very confusing and difficult, but after a while you capture some sort of structure of the main concepts and you can mostly run the concepts yourself. But one thing took me too long to understand: traditional math books very often fail to give you a path that allows you to capture structure, or to grasp concepts in a more intuitive or natural way that would allow you to reconstruct it later . This is because most math books present math in its final formalized version. But they hide the path that was taken to get there: often messy, experimental, trial-and-error and tentative. This blog assumes you already have basic knowledge in math. It’s just not practical to teach everything from the ground up, but my goal is to write here some of the connections that helped me grasp concepts. Starting Somewhere I’ve chosen to start from Linear Algebra L.A . It’s one of the most applicable and accessible areas in math. When I was introduced to a central Linear Algebra concept called Singular Value Decomposition SVD , it took me time to really understand what it is, and I think most books fail to explain it in a simple way simply because they usually start from the final conclusion formally stated, which makes no sense to the reader. I think it’s critical to introduce motivation on why this was created in the first place. Most things in applied math came from a specific problem someone was trying to solve. Linear Algebra is a great place to start not only because it’s somehow simpler than more advanced non-linear areas, but also because it’s connected to so many other areas, such as calculus, information theory, image processing, machine learning, and many others. It’s some sort of basic or central area, although I think no matter where you start, if you continue to dig enough, you will reach connections to other areas — and in L.A this is even more evident. Let’s arrive naturally at SVD without even aiming at it. The matrix of a Linear Transformation can be written differently for the same L.T depending on the chosen input and output bases. To see that, take a deliberately simple linear map — stretch the -axis by 3, leave alone: In the standard basis , its matrix just reads off where the basis vectors land: Nothing could be clearer: “stretch one axis.” Now describe the same map in a skewed basis: choose a slanted, non-orthonormal basis: The matrix of the same transformation in this basis is : Same transformation. Same map on every actual vector. But now it looks like an arbitrary tangle of scaling, shearing, and sign flips — you’d never guess it was just “stretch by 3.” The complexity was never in the map. It was in the basis. Sometimes we can diagonalize a matrix, by just choosing a new basis that makes the matrix of diagonal this can only be done if admits an Eigen Vector basis, but we will leave this fact unproved for now . Of course, not all L.T allow for a basis of Eigen Vectors. In fact, some L.T are not even square e.g. input dimension is not the same as output dimension . We cannot even talk about Eigen Vectors for L.Ts that don’t output to the same dimensional space, because an Eigen Vector stays fixed up to a scale, so it obviously restricts input and output to the same dimension. When we can diagonalize a matrix, something interesting happens: we can clearly see that the Linear Transformation itself is basically a simple non-uniform scale, disguised as something more complicated due to the basis chosen to represent the matrix the opposite of the operation we did before, where we showed a simple scale can be represented in a more complicated matrix form . So we take the matrix of a diagonalizable L.T , and write: and then is similar to a diagonal matrix. So the potentially complicated is just a few numbers on the matrix’s diagonal and zeros everywhere else, up to a special change of basis matrix . This is perfect for storing in a computer and to perform matrix multiplication efficiently. We could convert our data to the reference frame of once, make millions of matrix multiplications using a diagonal matrix easy , then convert back to the original frame using . This simplification is possible because we could extract structure from the L.T . When do we fail to diagonalize a matrix? Well, if a diagonalizable matrix is just a disguised scale in some special coordinate frame, then we fail to diagonalize when the L.T is anything more than a disguised scale actually, this is equivalent to saying that cannot admit a basis of Eigen Vectors — try to prove that yourself . For example, if is a 2-d rotation, it does not have two independent Eigen Vectors it has no real eigenvectors at all; a rotation moves every direction in the plane . If is a 2-d shear, it surely cannot be diagonalized either, for a similar reason. So can we still extract structure from just any L.T in the same way we did with diagonalizable L.Ts? If we could, then we might get some advantages, as we did with diagonal matrices… Intuitively, we can just look at what happens to an -dimensional unit sphere through a generic L.T: it can be non-injective, that is, some directions might be annihilated collapsed to 0 — so a sphere would flatten into a disk ; it can scale other directions by an arbitrary value so a sphere or a disk would become an ellipsoid or ellipse ; it might rotate other directions by some arbitrary angle through arbitrary axes; it might also shear the sphere into other directions a sheared sphere is still a rotated ellipsoid . This simple fact can be written as below: where is a scalar — the stretch factor — and is a unit output direction. That’s it. We have detached and , and we don’t demand that they are Eigen Vectors anymore; they are just arbitrary vectors. The expression above introduces no new fact, it’s just what happens to any L.T, but it indeed captures our feeling that unit spheres are mapped to rotated ellipsoids in some way, if you see as scale factors and as the new rotated directions. Our feeling now is that a L.T might perform several types of operations rotation, shear, scale that, similarly to our diagonalization case, can possibly be decomposed and isolated, revealing some simpler structure in what initially seemed a complicated transformation. As we saw earlier, skewed bases are not the best to represent a L.T matrix, because they mix the coordinate basis vectors: each basis vector has a non-zero projection onto the others. The obvious next step here is to see what happens when the bases and are orthonormal. Can we always ask and to be orthonormal? We will see we can, and that’s an interesting and neat fact. Look at what we are asking. For the output vectors to be orthonormal: and at the same time we want the input vectors to be orthonormal: Now, something very interesting happens. Rewrite the inner product of the outputs by moving to the other side using : So far, nothing new — just rewriting the exact same expression. But now shows up explicitly, and it’s a symmetric matrix Take any potentially non-square matrix and form , and you will end up with a symmetric matrix. Now, a very important fact here is that a symmetric matrix always admits an orthonormal basis of Eigen Vectors this is the spectral theorem . So let us choose the to be exactly the orthonormal Eigen Vectors of . This immediately gives us 2 , the input orthonormality, for free. But here is the key point: this choice also gives us 1 , the output orthogonality, automatically. Watch what happens to the off-diagonal case . Since is an eigenvector of , we have , so: The outputs are orthogonal precisely because the inputs were orthogonal eigenvectors of . We did not have to demand the output orthogonality separately — it fell out of the input choice. And the lengths of those outputs are exactly the scale factors: , so , and normalizing gives the orthonormal output directions . So we could not, in general, find an orthonormal basis that itself diagonalizes — but we can always find one for the symmetric matrix , and that is exactly the input basis we were looking for. The rank of A and the rank of AᵀA Another remarkable fact is that the rank of is the same as the rank of , and we will need it soon, so let’s prove it first. Consider . By the rank–nullity theorem, where is the input dimension the kernel lives in the input space . Now notice that is since is and is , so it also maps , and the same rank–nullity theorem gives Both maps have the same domain dimension , so to prove the ranks are equal it is enough to prove the kernels are equal: Proof. 1 First, — the easy direction: 2 Now — the less trivial direction. At first it seems nothing stops a vector that is not in the kernel of from having land in the kernel of , in which case even though . But it turns out this cannot happen. Suppose . Then , so . Two cases: - either , so and we are done; - or , which would put a nonzero vector in both since is, by definition, an output of and . But this second case is impossible, because a subspace and its orthogonal complement share only the zero vector . So , i.e. . A simpler, self-contained way to prove this second direction is to use the positive-definiteness of the inner product directly: The last step is exactly positive-definiteness: only the zero vector has zero length. Either route works; the first connects to the four fundamental subspaces, the second goes straight to the inner product. Taking stock: what we managed to extract Before assembling anything, let’s pause and list what we now know about any linear map an matrix . We set out to do for a general map what diagonalization did for the nice ones — extract structure — and it turns out we can extract quite a lot: - Orthonormal frames on both sides. Because is symmetric, the spectral theorem hands us an orthonormal basis of input directions its eigenvectors . Pushing them through and normalizing gives an orthonormal basis of output directions . Both frames clean, no skew. - A rank that lines up. We proved . So exactly of the eigenvalues are nonzero, giving exactly nonzero scale factors . The other input directions are the kernel — sends them to zero. - A clean bijection in the middle. Those surviving input directions are linearly independent they are part of an orthonormal basis , and they map to linearly independent output directions, one for one: . On the directions that matter, is just a clean, invertible scaling between two orthonormal frames — everything else it annihilates. That last point is the heart of it. Stripped of the kernel, does nothing exotic: it takes perpendicular input directions to perpendicular output directions, stretching each by its own factor . The “complicated transformation” was, once again, a simple scaling in disguise — we just needed two orthonormal frames instead of one to see it. Collecting it into a factorization Now we collect these relations into a single matrix equation. Notice this is exactly the same relation we wrote down at the very start — only now we know the and can be chosen orthonormal, and the scale is . We have, one per surviving direction: with orthonormal in the input space and orthonormal in the output space. Put the as the columns of and the as the columns of . Applying to each column of is just , and the right-hand side — each scaled by — is with its columns scaled, which is exactly where the diagonal comes from: So . We are not collecting two matrices and then separately decomposing a third — the diagonal appears on its own , simply because scaling each column of by its factor is the same as multiplying on the right by a diagonal matrix. The per-direction scales collapse onto the diagonal for free. To isolate , multiply on the right by : Here we must be careful: has orthonormal columns , so , but is not the full identity — it is the orthogonal projection onto the span of the . That span is the row space of , and its complement is the kernel, which kills anyway. So projecting it away changes nothing, , and we arrive at three width- pieces: an orthonormal output frame , a diagonal of stretches , and an orthonormal input frame . The same “extract the structure” move we made for diagonalizable maps — now done for any linear map at all. The geometry behind it: the four subspaces We leaned on one fact in the rank proof — that — without really looking at it. It is worth pausing on, because it carries a lot of geometric insight, and it explains why the kernel directions could be dropped so cleanly when we built the thin SVD. A linear map splits each of its two spaces into an alive part and a dead part. In the input space , the alive part is the row space the directions actually acts on and the dead part is the kernel the directions sends to zero . In the output space , the alive part is the image the directions can reach and the dead part is the directions no input ever maps onto . And in both spaces, the alive and dead parts are orthogonal : Let me draw this in 2D, the simplest possible case. In each space we have just two perpendicular lines: one alive, one dead. Now here is the insight that makes the whole thing click. Take a vector sitting on the alive line of the input the row space . Apply : it lands on the alive line of the output the image , stretched by . Now apply to bring it back. Could kill it? It could only kill a vector that lies in — the dead output line. But our vector is on the alive output line, which is perpendicular to the dead one. A vector sitting entirely on the alive line has zero component along the dead line. So there is nothing for to annihilate — it carries the vector faithfully back to the input’s alive line. That is the whole reason the round trip preserves the alive directions: the dead direction is orthogonal to where the alive vectors live, so it cannot reach in and annihilate them. If the alive and dead parts were merely complementary but skew not perpendicular , an alive vector could carry a sneaky component along the dead direction and lose it on the way through. It is orthogonality — and only orthogonality — that keeps the alive subspace “pure”. This is exactly why, when we built the thin SVD, dropping the kernel columns changed nothing: the kept directions and the discarded ones sit at right angles, so the discard touches nothing that matters. A different way to write it: a sum of one-dimensional pieces So far we wrote the SVD as a product of three matrices, . But there is a second way to read exactly the same equation that I find more revealing. If you multiply it out, the product becomes a sum : Each term is an outer product : a column vector times a row vector, which produces a full matrix — but a very special one. It has rank 1 : every column is a multiple of , every row a multiple of . It is the simplest possible non-trivial matrix, a single “pattern”. So the formula says something striking: any matrix is a sum of rank-1 patterns , each scaled by its singular value . The map is built by stacking one-dimensional pieces on top of each other, from the strongest to the weakest . This also gives a clean second meaning to rank: it is the minimum number of rank-1 pieces you need to build the matrix. And it gives a meaning to the singular values that goes beyond “stretch factor”. Order them . A large means that direction carries a lot of the matrix — it is a strong, genuinely present pattern. A small means that direction barely contributes; the matrix is almost the same without it. A that is exactly zero means the direction contributes nothing at all it was never really there — it is the kernel . So the singular value measures how much that independent direction actually matters to the matrix as a whole. Throwing away the small pieces: compression Once you see the matrix as an ordered sum of rank-1 pieces, an idea is irresistible: what if we keep only the first few? Define the rank- truncation by keeping the top pieces and dropping the rest: This is a simpler matrix — fewer pieces, less to store — that approximates the original. How good is the approximation? This is the beautiful part. The Eckart–Young theorem says that is the best possible rank- approximation of : of all the matrices of rank , none is closer to in the least-squares / Frobenius sense . And the error you make is governed exactly by the singular values you threw away: So the reconstruction error is just the sum of the squared discarded singular values. If the small really are small, you can drop them and barely change the matrix — but you store far less. This is lossy compression , and SVD tells you precisely the best way to do it and exactly what it costs. Above, a single matrix think of it as an image rebuilt from its top layers. With you already see the dominant structure; by it is essentially there; the remaining dozens of pieces only add fine detail and noise. We are not storing the whole matrix — we are storing a handful of strong patterns and discarding the long tail of weak ones. Why does this work so well in practice? Because most real matrices are compressible : their singular-value spectrum decays fast. The spectrum above drops off a cliff after the first few values. That cliff is the matrix confessing that, although it is technically full of numbers, its real content lives in only a few directions. The rest is redundancy. The link to information and entropy This is where linear algebra quietly touches information theory, and it is the connection I find most satisfying. Think of the squared singular values as an “energy” carried by each direction. Normalize them so they sum to one, and you have a probability distribution over directions — how the matrix’s energy is spread across its independent patterns. Now the spectrum’s shape becomes a statement about information : - If a few are close to 1 and the rest near 0 a spiky distribution, low entropy , the matrix is highly redundant — almost all its content is in a few directions, and it compresses dramatically. - If the are all roughly equal a flat distribution, high entropy , the matrix is “incompressible” — every direction matters about the same, and there is nothing to throw away. The entropy of the singular-value distribution is, in this sense, a measure of how much genuine information the matrix carries versus how much is redundancy. Compressing by truncating small is exactly discarding the low-probability, low-information directions — the same move, in spirit, that lossy compression of any signal makes. The singular values do not just decompose a transformation; they rank its content by importance , and importance is information. Putting it to work We started by just trying to write a matrix in a simpler way, and we ended up with a tool that does a surprising amount: Compression. Keep the top rank-1 pieces and you get the best possible low-rank approximation, with a known, controllable error. This is the engine behind image compression and low-rank approximations everywhere. Finding the directions that matter in data. If the rows of are data points, the top singular directions are the directions of largest variation in the data — the directions along which the data actually spreads. This is Principal Component Analysis PCA : it is just the SVD of the centered data matrix, and is, up to a scale, the covariance matrix whose eigenvectors we were computing all along. Reducing dimensionality is then nothing but projecting onto the top few and dropping the weak ones — the same truncation as compression, applied to data. Understanding the map itself. The rank, the kernel, the image, the four subspaces — all of it can be read straight off the SVD: the rank is the number of nonzero , the kernel is the directions with , and the singular values tell you how the map stretches space. And underneath all of these is a single picture, the one we built piece by piece: every linear map takes an orthonormal set of input directions to an orthogonal set of output directions, stretching each by — a sphere becomes an ellipsoid. Compression is keeping the long axes of that ellipsoid. PCA is finding them in data. The four subspaces are the alive axes versus the flat ones. They are all the same fact, seen from different sides — which, more than the formula itself, is the thing I wanted to capture here.