# Deriving SVD without even aiming at it

> Source: <https://stillthinking.net/posts/connections-in-math-svd/>
> Published: 2026-06-28 14:21:17+00:00

# Connections in Math: deriving the SVD from scratch

Disclaimer: no AI was used to write this. Any errors, awkward sentences, and weird tangents are 100% organic, free-range, and human-made.

## How to actually learn anything?

For a long time I have always been asking myself what it really means to learn or grasp something in math. I think this can be extended to other areas, but math is very often particularly distant from daily ordinary life when it comes to many of its concepts, and it’s hard for me (at least) to have an intuitive and natural understanding of it — the type of understanding that gives you some sort of foundational structure of the concepts and allows you to just reconstruct its details by yourself later in the future, with none or limited consultation. I have experienced a few times this feeling of having really captured some sort of general structure in some areas, that allows me to express myself or deduce facts only consulting what I currently understand about it. One good example is my first experience with Real Analysis. At first, it looked very confusing and difficult, but after a while you capture some sort of structure of the main concepts and you can mostly run the concepts yourself.

But one thing took me too long to understand: traditional math books very often fail to give you a path that allows you to capture structure, or to grasp concepts in a more intuitive or natural way (that would allow you to reconstruct it later). This is because most math books present math in its final formalized version. But they hide the path that was taken to get there: often messy, experimental, trial-and-error and tentative.

This blog assumes you already have basic knowledge in math. It’s just not practical to teach everything from the ground up, but my goal is to write here some of the connections that helped me grasp concepts.

## Starting Somewhere

I’ve chosen to start from Linear Algebra (L.A). It’s one of the most applicable and accessible areas in math. When I was introduced to a central Linear Algebra concept called *Singular Value Decomposition* (SVD), it took me time to really understand what it is, and I think most books fail to explain it in a simple way simply because they usually start from the final conclusion formally stated, which makes no sense to the reader. I think it’s critical to introduce motivation on why this was created in the first place. Most things in applied math came from a specific problem someone was trying to solve. Linear Algebra is a great place to start not only because it’s somehow simpler than more advanced non-linear areas, but also because it’s connected to so many other areas, such as calculus, information theory, image processing, machine learning, and many others. It’s some sort of basic or central area, although I think no matter where you start, if you continue to dig enough, you will reach connections to other areas — and in L.A this is even more evident.

Let’s arrive naturally at SVD without even aiming at it. The *matrix* of a Linear Transformation can be written differently for the same L.T depending on the chosen input and output bases. To see that, take a deliberately simple linear map — stretch the -axis by 3, leave alone:

In the **standard basis**, its matrix just reads off where the basis vectors land:

Nothing could be clearer: “stretch one axis.”

Now describe the *same map* in a skewed basis: choose a slanted, non-orthonormal basis:

The matrix of the same transformation in this basis is :

Same transformation. Same map on every actual vector. But now it looks like an arbitrary tangle of scaling, shearing, and sign flips — you’d never guess it was just “stretch by 3.”

The complexity was never in the map. It was in the basis.

Sometimes we can *diagonalize* a matrix, by just choosing a new basis that makes the matrix of diagonal (this can only be done if admits an Eigen Vector basis, but we will leave this fact unproved for now). Of course, not all L.T allow for a basis of Eigen Vectors. In fact, some L.T are not even square (e.g. input dimension is not the same as output dimension). We cannot even talk about Eigen Vectors for L.Ts that don’t output to the same dimensional space, because an Eigen Vector stays fixed up to a scale, so it obviously restricts input and output to the same dimension.

When we **can** diagonalize a matrix, something interesting happens: we can clearly see that the Linear Transformation itself is basically a simple non-uniform scale, disguised as something more complicated due to the basis chosen to represent the matrix (the opposite of the operation we did before, where we showed a simple scale can be represented in a more complicated matrix form). So we take the matrix of a diagonalizable L.T , and write:

and then is *similar* to a diagonal matrix. So the potentially complicated is just a few numbers on the matrix’s diagonal and zeros everywhere else, up to a special change of basis matrix . This is perfect for storing in a computer and to perform matrix multiplication efficiently. We could convert our data to the reference frame of once, make millions of matrix multiplications using a diagonal matrix (easy), then convert back to the original frame using . This simplification is possible because we could extract *structure* from the L.T .

When do we fail to diagonalize a matrix? Well, if a diagonalizable matrix is just a disguised scale in some special coordinate frame, then we fail to diagonalize when the L.T is anything more than a disguised scale (actually, this is equivalent to saying that cannot admit a basis of Eigen Vectors — try to prove that yourself). For example, if is a 2-d rotation, it does not have two independent Eigen Vectors (it has no *real* eigenvectors at all; a rotation moves every direction in the plane!). If is a 2-d shear, it surely cannot be diagonalized either, for a similar reason. So can we still extract structure from just any L.T in the same way we did with diagonalizable L.Ts? If we could, then we might get some advantages, as we did with diagonal matrices…

Intuitively, we can just look at what happens to an -dimensional unit sphere through a generic L.T: it can be non-injective, that is, some directions might be annihilated (collapsed to 0 — so a sphere would flatten into a disk); it can scale other directions by an arbitrary value (so a sphere or a disk would become an *ellipsoid* or *ellipse*); it might rotate other directions by some arbitrary angle through arbitrary axes; it might also shear the sphere into other directions (a sheared sphere is still a rotated ellipsoid).

This simple fact can be written as below:

where is a *scalar* — the stretch factor — and is a unit output direction. That’s it. We have detached and , and we don’t demand that they are Eigen Vectors anymore; they are just arbitrary vectors. The expression above introduces no new fact, it’s just what happens to any L.T, but it indeed captures our feeling that unit spheres are mapped to rotated ellipsoids in some way, if you see as scale factors and as the new rotated directions. Our feeling now is that a L.T might perform several types of operations (rotation, shear, scale) that, similarly to our diagonalization case, can possibly be decomposed and isolated, revealing some simpler structure in what initially seemed a complicated transformation.

As we saw earlier, skewed bases are not the best to represent a L.T matrix, because they mix the coordinate basis vectors: each basis vector has a non-zero projection onto the others. The obvious next step here is to see what happens when the bases and are orthonormal. Can we always ask and to be orthonormal? We will see we can, and that’s an interesting and neat fact.

Look at what we are asking. For the **output** vectors to be orthonormal:

and at the same time we want the **input** vectors to be orthonormal:

Now, something very interesting happens. Rewrite the inner product of the outputs by moving to the other side (using ):

So far, nothing new — just rewriting the exact same expression. But now shows up explicitly, and it’s a symmetric matrix! Take any (potentially non-square) matrix and form , and you will end up with a symmetric matrix. Now, a very important fact here is that a symmetric matrix **always admits an orthonormal basis of Eigen Vectors** (this is the spectral theorem). So let us *choose* the to be exactly the orthonormal Eigen Vectors of . This immediately gives us (2), the input orthonormality, for free.

But here is the key point: this choice also gives us (1), the output orthogonality, automatically. Watch what happens to the off-diagonal case . Since is an eigenvector of , we have , so:

The outputs are orthogonal precisely because the inputs were orthogonal eigenvectors of . We did not have to demand the output orthogonality separately — it fell out of the input choice. And the lengths of those outputs are exactly the scale factors: , so , and normalizing gives the orthonormal output directions .

So we could not, in general, find an orthonormal basis that itself diagonalizes — but we *can* always find one for the symmetric matrix , and that is exactly the input basis we were looking for.

## The rank of A and the rank of AᵀA

Another remarkable fact is that the *rank* of is the **same** as the *rank* of , and we will need it soon, so let’s prove it first.

Consider . By the rank–nullity theorem,

where is the **input** dimension (the kernel lives in the input space ). Now notice that is (since is and is ), so it also maps , and the same rank–nullity theorem gives

Both maps have the *same* domain dimension , so to prove the ranks are equal it is enough to prove the kernels are equal:

**Proof.**

**(1)** First, — the easy direction:

**(2)** Now — the less trivial direction. At first it seems nothing stops a vector that is *not* in the kernel of from having land in the kernel of , in which case even though . But it turns out this cannot happen. Suppose . Then , so . Two cases:

- either , so and we are done;
- or , which would put a
*nonzero*vector in both (since is, by definition, an output of ) and .

But this second case is impossible, because

(a subspace and its orthogonal complement share only the zero vector). So , i.e. .

A simpler, self-contained way to prove this second direction is to use the positive-definiteness of the inner product directly:

The last step is exactly positive-definiteness: only the zero vector has zero length. Either route works; the first connects to the four fundamental subspaces, the second goes straight to the inner product.

## Taking stock: what we managed to extract

Before assembling anything, let’s pause and list what we now know about *any* linear map (an matrix). We set out to do for a general map what diagonalization did for the nice ones — extract structure — and it turns out we can extract quite a lot:

-
**Orthonormal frames on both sides.** Because is symmetric, the spectral theorem hands us an orthonormal basis of input directions (its eigenvectors). Pushing them through and normalizing gives an orthonormal basis of output directions . Both frames clean, no skew. -
**A rank that lines up.** We proved . So exactly of the eigenvalues are nonzero, giving exactly nonzero scale factors . The other input directions are the kernel — sends them to zero. -
**A clean bijection in the middle.** Those surviving input directions are linearly independent (they are part of an orthonormal basis), and they map to linearly independent output directions, one for one: . On the directions that matter, is just a clean, invertible scaling between two orthonormal frames — everything else it annihilates.

That last point is the heart of it. Stripped of the kernel, does nothing exotic: it takes perpendicular input directions to perpendicular output directions, stretching each by its own factor . The “complicated transformation” was, once again, a simple scaling in disguise — we just needed *two* orthonormal frames instead of one to see it.

## Collecting it into a factorization

Now we collect these relations into a single matrix equation. Notice this is exactly the same relation we wrote down at the very start — only now we know the and can be chosen orthonormal, and the scale is . We have, one per surviving direction:

with orthonormal in the input space and orthonormal in the output space. Put the as the columns of () and the as the columns of (). Applying to each column of is just , and the right-hand side — each scaled by — is with its columns scaled, which is exactly where the diagonal comes from:

So . We are not collecting two matrices and then separately decomposing a third — the diagonal *appears on its own*, simply because scaling each column of by its factor is the same as multiplying on the right by a diagonal matrix. The per-direction scales collapse onto the diagonal for free.

To isolate , multiply on the right by :

Here we must be careful: has orthonormal *columns*, so , but is **not** the full identity — it is the orthogonal projection onto the span of the . That span is the row space of , and its complement is the kernel, which kills anyway. So projecting it away changes nothing, , and we arrive at

three width- pieces: an orthonormal output frame , a diagonal of stretches , and an orthonormal input frame . The same “extract the structure” move we made for diagonalizable maps — now done for *any* linear map at all.

## The geometry behind it: the four subspaces

We leaned on one fact in the rank proof — that — without really looking at it. It is worth pausing on, because it carries a lot of geometric insight, and it explains *why* the kernel directions could be dropped so cleanly when we built the thin SVD.

A linear map splits *each* of its two spaces into an **alive** part and a **dead** part. In the input space , the alive part is the row space (the directions actually acts on) and the dead part is the kernel (the directions sends to zero). In the output space , the alive part is the image (the directions can reach) and the dead part is (the directions no input ever maps onto). And in both spaces, **the alive and dead parts are orthogonal**:

Let me draw this in 2D, the simplest possible case. In each space we have just two perpendicular lines: one alive, one dead. Now here is the insight that makes the whole thing click. Take a vector sitting on the alive line of the input (the row space). Apply : it lands on the alive line of the output (the image), stretched by . Now apply to bring it back. Could kill it? It could only kill a vector that lies in — the *dead* output line. But our vector is on the *alive* output line, which is **perpendicular** to the dead one. A vector sitting entirely on the alive line has *zero component* along the dead line. So there is nothing for to annihilate — it carries the vector faithfully back to the input’s alive line.

That is the whole reason the round trip preserves the alive directions: **the dead direction is orthogonal to where the alive vectors live, so it cannot reach in and annihilate them.** If the alive and dead parts were merely complementary but *skew* (not perpendicular), an alive vector could carry a sneaky component along the dead direction and lose it on the way through. It is orthogonality — and only orthogonality — that keeps the alive subspace “pure”. This is exactly why, when we built the thin SVD, dropping the kernel columns changed nothing: the kept directions and the discarded ones sit at right angles, so the discard touches nothing that matters.

## A different way to write it: a sum of one-dimensional pieces

So far we wrote the SVD as a product of three matrices, . But there is a second way to read exactly the same equation that I find more revealing. If you multiply it out, the product becomes a **sum**:

Each term is an *outer product*: a column vector times a row vector, which produces a full matrix — but a very special one. It has **rank 1**: every column is a multiple of , every row a multiple of . It is the simplest possible non-trivial matrix, a single “pattern”. So the formula says something striking: *any* matrix is a **sum of rank-1 patterns**, each scaled by its singular value . The map is built by stacking one-dimensional pieces on top of each other, from the strongest () to the weakest ().

This also gives a clean second meaning to rank: it is the minimum number of rank-1 pieces you need to build the matrix. And it gives a meaning to the singular values that goes beyond “stretch factor”. Order them . A **large** means that direction carries a lot of the matrix — it is a strong, genuinely present pattern. A **small** means that direction barely contributes; the matrix is *almost* the same without it. A that is exactly zero means the direction contributes nothing at all (it was never really there — it is the kernel). So the singular value measures *how much that independent direction actually matters* to the matrix as a whole.

## Throwing away the small pieces: compression

Once you see the matrix as an ordered sum of rank-1 pieces, an idea is irresistible: what if we keep only the first few? Define the **rank- truncation** by keeping the top pieces and dropping the rest:

This is a simpler matrix — fewer pieces, less to store — that *approximates* the original. How good is the approximation? This is the beautiful part. The **Eckart–Young theorem** says that is the *best possible* rank- approximation of : of all the matrices of rank , none is closer to (in the least-squares / Frobenius sense). And the error you make is governed exactly by the singular values you threw away:

So the reconstruction error is just the sum of the squared *discarded* singular values. If the small really are small, you can drop them and barely change the matrix — but you store far less. This is **lossy compression**, and SVD tells you precisely the best way to do it and exactly what it costs.

Above, a single matrix (think of it as an image) rebuilt from its top layers. With you already see the dominant structure; by it is essentially there; the remaining dozens of pieces only add fine detail and noise. We are not storing the whole matrix — we are storing a handful of strong patterns and discarding the long tail of weak ones.

Why does this work so well in practice? Because most real matrices are *compressible*: their singular-value spectrum decays fast.

The spectrum above drops off a cliff after the first few values. That cliff is the matrix confessing that, although it is technically full of numbers, its *real* content lives in only a few directions. The rest is redundancy.

### The link to information and entropy

This is where linear algebra quietly touches information theory, and it is the connection I find most satisfying. Think of the squared singular values as an “energy” carried by each direction. Normalize them so they sum to one,

and you have a *probability distribution* over directions — how the matrix’s energy is spread across its independent patterns. Now the spectrum’s shape becomes a statement about *information*:

- If a few are close to 1 and the rest near 0 (a spiky distribution, low entropy), the matrix is highly redundant — almost all its content is in a few directions, and it compresses dramatically.
- If the are all roughly equal (a flat distribution, high entropy), the matrix is “incompressible” — every direction matters about the same, and there is nothing to throw away.

The **entropy of the singular-value distribution** is, in this sense, a measure of how much genuine information the matrix carries versus how much is redundancy. Compressing by truncating small is exactly discarding the low-probability, low-information directions — the same move, in spirit, that lossy compression of any signal makes. The singular values do not just decompose a transformation; they *rank its content by importance*, and importance is information.

## Putting it to work

We started by just trying to write a matrix in a simpler way, and we ended up with a tool that does a surprising amount:

**Compression.** Keep the top rank-1 pieces and you get the best possible low-rank approximation, with a known, controllable error. This is the engine behind image compression and low-rank approximations everywhere.**Finding the directions that matter in data.** If the rows of are data points, the top singular directions are the directions of largest variation in the data — the directions along which the data actually spreads. This is**Principal Component Analysis**(PCA): it is just the SVD of the (centered) data matrix, and is, up to a scale, the covariance matrix whose eigenvectors we were computing all along. Reducing dimensionality is then nothing but projecting onto the top few and dropping the weak ones — the same truncation as compression, applied to data.**Understanding the map itself.** The rank, the kernel, the image, the four subspaces — all of it can be read straight off the SVD: the rank is the number of nonzero , the kernel is the directions with , and the singular values tell you how the map stretches space.

And underneath all of these is a single picture, the one we built piece by piece: *every linear map takes an orthonormal set of input directions to an orthogonal set of output directions, stretching each by — a sphere becomes an ellipsoid.* Compression is keeping the long axes of that ellipsoid. PCA is finding them in data. The four subspaces are the alive axes versus the flat ones. They are all the same fact, seen from different sides — which, more than the formula itself, is the thing I wanted to capture here.