Introduction to Conditional Flow Matching – Part I, Normalizing Flows

wpnews.pro

Generative AI is the study and application of algorithms to generate data following some distribution. The distribution could be of anything, in any modality: images of cats and dogs, samples of human speech, medical imaging data, natural language, etc. The accurate modeling of these distributions has shown great potential in creating commercial value as well as helping to address some fundamental societal problems: generative models are being used to design novel proteins for medicine, to reconstruct medical scans from shorter and safer acquisitions, and to give a synthetic voice back to people who have lost theirs. At their core, all generative techniques share the same goal: to approximate a complex generating distribution using only the samples we have from that distribution.

Over the years, a variety of powerful approaches to achieving this goal have emerged. Some famous examples of these techniques include Generative Adversarial Networks (GANs), autoregressive models, and diffusion models. Each of these methods, while mathematically rich, is grounded in concepts that are relatively straightforward to explain intuitively. GANs approximate the distribution by pitting two networks against each other: a generator, which turns random noise into candidate samples, and a discriminator, which is shown a mix of real training data and the generator's output and must judge which is which. The generator tries to fool the discriminator, the discriminator tries to catch the generator. Over time, both improve at their respective tasks, resulting in high quality samples from the generator that accurately represent the underlying training data distribution. Autoregressive models — of which the transformer-based large language models are the most famous example — break the problem of generating a whole sequence into a chain of much smaller ones: predict a distribution over just the next token given everything generated so far, sample from it, append, and repeat. Diffusion models corrupt training samples with noise in many small steps until nothing but noise remains, then train a model to undo the corruption one step at a time; generating means starting from fresh noise and running the learned denoiser.

A more recent arrival is Conditional Flow Matching (CFM), introduced in 2022 and already the engine behind several state-of-the-art image and video generators. In contrast to the techniques mentioned above, it is not so straightforward to give an intuition for what's happening in CFM. At a very high level, it is a way to transform a simple generating distribution into the target distribution by flowing through a continuous set of transformations, with a few tricks thrown in to make the modeling tractable. As that is a little too vague of an explanation for anyone's comfort, I've written this article. I aim to provide a thorough, step-by-step explanation of CFMs, offering intuition for each part of the journey.

Today we start with part I on normalizing flows, which constitute a method to construct complex distributions by transforming a simpler distribution through a series of invertible mappings.

Normalizing flows #

Before we dive in, it helps to state what we actually need from a generative model. We need two things. First, a way to sample new data points: generation is the whole point, and whatever machinery we build must ultimately let us draw fresh samples that behave as if they came from the data distribution. Second, a way to score how likely a given data point is under the model. This one is less obvious, but it is what makes training possible: if the model can tell us how probable each training example is, then "make the training data probable" becomes an objective we can optimize, and a model under which the real data is likely is precisely a model that has captured the distribution. Normalizing flows are built to deliver exactly these two abilities. They transform a simple distribution (easy to sample from, easy to score) through an invertible transformation, while keeping exact track of what that transformation does to the density. The rest of this article builds up that machinery from scratch.

Let's say we have a random variable (X) distributed according to a simple uniform distribution between 0 and 1: (X \sim p_x = U(0, 1)). It is a constant function, analytically defined on the domain ([0, 1]) as (p_x(x) = 1) and (0) elsewhere. Any sample between 0 and 1 is equally likely, and the PDF over the full domain integrates to 1.

But now let's say that after taking our sample (x), we transform it by some continuously differentiable invertible function (\phi(x)), with a continuously differentiable inverse, to get (y = \phi(x)). As an example, let's pick ( \phi(x) = 2x ). This induces by way of our transformation (\phi(x)) a new random variable (Y). Let us denote the induced density as (p_{y} \triangleq [\phi]_{ # } p_x ). What does the density (p_y) look like, such that (Y \sim p_y )?

Deriving the maths

Let's put our example to the side for a moment and try to derive a generic answer. We know that probability mass cannot disappear, i.e. the combined probability of all events should always be 1. Now consider the probability that a sample ( x ) falls in some interval ( [x_1, x_2] ). Every such sample is turned into exactly one ( y = \phi(x) ) inside the image of that interval — and because ( \phi ) is invertible, no sample from outside the interval can land in that image. So "( x ) falls in ( [x_1, x_2] )" and "( y ) falls in the image of ( [x_1, x_2] )" are one and the same event, and the same event must have the same probability on either side of the transformation. Because the transformation is continuous and invertible, it must also be strictly monotonic: a continuous function that ever "turned around" would take some value twice, and could then no longer be invertible. This means the image of the interval ( [x_1, x_2] ) is again an interval, ( [y_1, y_2] = [\phi(x_1), \phi(x_2)] ). Note that in the case of a monotonically decreasing transformation, the interval "switches direction". To keep the signs simple, let us assume for now that ( \phi ) is monotonically increasing, so that ( y_1 < y_2 ); we will return to the decreasing case in a moment. In the figure below, drag the red handles to move the interval, and use the toggle to switch between an increasing and a decreasing transformation — in the decreasing case, watch the image interval flip its ordering.

Writing the preservation of probability under the transformation analytically gives us

$$ \int_{x_1}^{x_2} p_x(x) \mathrm{d}x = \int_{y_1}^{y_2} p_y(y) \mathrm{d}y. $$

Now, importantly, the differentials in both sides of these equations are not equal: a small change in ( x ) does not produce an equal change in ( y ). The definition of the derivative tells us exactly how the two are related: a small step ( \Delta x ) produces a step ( \Delta y \approx \frac{\mathrm{d}y}{\mathrm{d}x} \Delta x ), and the approximation becomes exact in the infinitesimal limit out of which the integral is built. This is the substitution rule familiar from calculus:

$$ \mathrm{d}y = \frac{\mathrm{d}y}{\mathrm{d}x} \mathrm{d}x$$

In other words, an infinitesimal "nudge" in ( y ) space corresponds to a scaled infinitesimal "nudge" in ( x ) space, and the scale is equal exactly to the derivative of ( y ) with respect to ( x ) at that point.

We can use this to rewrite the integral of the density of our transformed samples ( p_y ) in the space of the original samples, i.e. in (x)-space:

$$ \int_{y_1}^{y_2} p_y(y) \mathrm{d}y = \int_{x_1}^{x_2} p_y(\phi(x)) \frac{\mathrm{d}\phi}{\mathrm{d}x}(x) \mathrm{d}x. $$

If we now substitute this back into the previous equation, we get

$$ \int_{x_1}^{x_2} p_x(x) \mathrm{d}x = \int_{x_1}^{x_2} p_y(\phi(x)) \frac{\mathrm{d}\phi}{\mathrm{d}x}(x) \mathrm{d}x. $$

Now, as this must hold for any choice of ( x_1, x_2 ), this must mean that the integrands themselves are equal:

$$ p_x(x) = p_y(\phi(x)) \frac{\mathrm{d}\phi}{\mathrm{d}x}(x). $$

What about the monotonically decreasing case we set aside earlier? Then ( y_1 = \phi(x_1) > y_2 = \phi(x_2) ): the image of ( [x_1, x_2] ) is the interval ( [y_2, y_1] ), and conservation of probability reads

$$ \int_{x_1}^{x_2} p_x(x) \mathrm{d}x = \int_{y_2}^{y_1} p_y(y) \mathrm{d}y. $$

Note the limits of integration on the right: the lower one must be the smaller number, ( y_2 ), for the integral to express a probability. Substituting ( y = \phi(x) ) as before, which sends the limit ( y_2 \mapsto x_2 ) and ( y_1 \mapsto x_1 ) and thereby flips the orientation of the integral, we get

$$ \int_{y_2}^{y_1} p_y(y) \mathrm{d}y = \int_{x_2}^{x_1} p_y(\phi(x)) \frac{\mathrm{d}\phi}{\mathrm{d}x}(x) \mathrm{d}x = -\int_{x_1}^{x_2} p_y(\phi(x)) \frac{\mathrm{d}\phi}{\mathrm{d}x}(x) \mathrm{d}x. $$

Equating integrands now gives ( p_x(x) = -p_y(\phi(x)) \frac{\mathrm{d}\phi}{\mathrm{d}x}(x) ). And since the derivative of a decreasing function is negative, ( -\frac{\mathrm{d}\phi}{\mathrm{d}x} = \left\lvert \frac{\mathrm{d}\phi}{\mathrm{d}x} \right\rvert ). Both cases therefore combine into a single equation by taking the absolute value of the derivative:

$$ p_x(x) = p_y(\phi(x)) \left\lvert\frac{\mathrm{d}\phi}{\mathrm{d}x}(x)\right\rvert. $$

Now, using the fact that ( \phi(x) ) is invertible, we know that for any choice of ( y ), we have ( x = \phi^{-1}(y) ), which gives us the change of variables rule in one dimension:

$$ p_y(y) = \frac{p_x(\phi^{-1}(y))}{\left\lvert\frac{\mathrm{d}\phi}{\mathrm{d}x}(\phi^{-1}(y))\right\rvert}. $$

Intuitively, the density at a point in ( y ) space is equal to the density of its corresponding point in ( x ) space corrected for how much the transformation has stretched the space. Using the relation of ( \frac{1}{\frac{\mathrm{d}\phi}{\mathrm{d}x}} = \frac{\mathrm{d}\phi^{-1}}{\mathrm{d}y} ) we arrive at an alternative formulation of the same rule:

$$ p_y(y) = p_x(\phi^{-1}(y)) \left\lvert\frac{\mathrm{d}\phi^{-1}}{\mathrm{d}y}(y)\right\rvert. $$

Our example

We now return to our example. We started with a uniform distribution, ( p_x(x) = 1 ) for (x \in [0, 1] ), and ( 0 ) elsewhere. We then applied a transformation of ( \phi(x) = 2x ), with an inverse of ( \phi^{-1}(y) = \frac{y}{2} ). Our new density can now be defined on the transformed interval ( [0, 2] ) as

$$ \begin{align} p_y(y) &= p_x(\phi^{-1}(y)) \left\lvert\frac{\mathrm{d}\phi^{-1}}{\mathrm{d}y}(y)\right\rvert \ &= 1 \cdot \frac{1}{2} = \frac{1}{2} \end{align} $$

and ( 0 ) elsewhere.

We have defined our first normalizing flow: starting from our source distribution (p_x) we applied a flow of (\phi) to arrive at a new distribution (p_y), where we took care to normalize the induced density. (The name is read in two ways in the literature: Rezende and Mohamed (2015) read it as we do here, with the change of variables rule ensuring that the induced density remains a valid, normalized one; Papamakarios et al. (2021) and Kobyzev et al. (2020) read it in the reverse direction, where the inverse flow "normalizes" complicated data into the simple base distribution, which in practice is usually a normal one.)

A slightly more complex example

In the above example, the stretching factor induced by the transformation was constant, independent of the sample ( x ). This does not have to be the case. Let's take a different transformation ( \phi(x) = x^2 ). Note that this function, although not invertible everywhere, is invertible on our source distribution's non-zero domain. So, for any transformed sample ( y ) we can say that (\phi^{-1}(y) = \sqrt{y} ). Applying our knowledge now, we can define our transformed density on the transformed interval ( (0, 1] ) as

$$ p_y(y) = p_x(\phi^{-1}(y)) \left\lvert\frac{\mathrm{d}\phi^{-1}}{\mathrm{d}y}(y)\right\rvert = 1 \cdot \frac{1}{2\sqrt{y}} = \frac{1}{2\sqrt{y}} $$

Using a simple starting distribution and a simple transformation, we've now managed to get a distribution that looks a lot more interesting. Note that the density grows without bound as ( y ) approaches 0 - that is perfectly fine for a density, as long as the total area under the curve still integrates to 1 (which it does).

The figure below opens with ( \phi(x) = x^2 ), but the exponent is yours to play with - slide ( k ) and watch the transformation and its induced density reshape together. Then press release samples: each dot is a genuine sample from the source distribution, carried through ( \phi ) along its actual path, and the histogram the dots build up converges onto exactly the density we derived. This is the generative picture in miniature - move the samples, and the density follows.

Generalizing to higher dimensions

So far we have concerned ourselves with distributions and transformations in (\mathbb{R}). However, all the above generalizes to higher dimensions — we just need to replace our one dimensional derivative with the full multidimensional Jacobian.

Quick recap of the Jacobian #

The Jacobian of a transformation contains all the partial derivatives of the output with respect to the input. If (\phi(\mathbf{x}) : \mathbb{R}^2 \rightarrow \mathbb{R}^2 ), then$$ J_\phi(\mathbf{x}) = \begin{bmatrix} \frac{\partial \phi_1}{\partial x_1}(\mathbf{x}) & \frac{\partial \phi_1}{\partial x_2}(\mathbf{x}) \ \frac{\partial \phi_2}{\partial x_1}(\mathbf{x}) & \frac{\partial \phi_2}{\partial x_2}(\mathbf{x}) \end{bmatrix} $$

The Jacobian generalizes the 1D derivative to higher dimensions, and gives the best possible linear approximation of the transformation at a given point. The interactive figure below illustrates this. The left panel shows what the transformation

$$ \phi(\mathbf{x}) = \begin{bmatrix} x_1 + t \sin(x_2) \ x_2 + t \sin(x_1) \end{bmatrix} $$

does to a square grid; the ( t ) slider controls how strong the distortion is. The right panel zooms in on the small square region marked on the left — drag the square to move it, and use the second slider to shrink it. Overlaid on the zoomed view is how the

Jacobian at the center of that square— a purely linear map — would transform the same region. The smaller you make the region, the harder it becomes to tell the true transformation from its linear stand-in: that is exactly the sense in which the Jacobian is the best linear approximation at a point.(One note on the visualization: as you zoom in, grid lines are drawn at ever finer spacing — otherwise there would soon be nothing left to see. The finer lines are echoed in the left panel so you can see where they sit in the full picture.)

Furthermore, remember that the determinant of a matrix quantifies how much the space is compressed or stretched by the transformation represented by that matrix. The determinant of the Jacobian is then telling us how space is stretched by the Jacobian at a given point, and thus how space is stretched by the original transformation in an infinitesimal region around that point (as in that region, the transformation is approximately linear).

The idea behind this visualization is from the inimitable Khan Academy, which has an excellent pair of videos showing the intuition behind the

[Jacobian]and its[determinant].

This means that, generalized to higher dimensions, the change of variables rule looks like this: Let (\phi(\mathbf{x}) : \mathbb{R}^d \rightarrow \mathbb{R}^d ) be a continuously differentiable invertible transformation, with a continuously differentiable inverse. Let (p_x) be the density function of some random variable ( X ). The density induced by applying the transformation ( \phi(\mathbf{x}) ) to samples from ( X ) can be written as

$$ \begin{align} p_y(\mathbf{y}) &= ([\phi]{ # } p_x)(\mathbf{y}) \ &= p_x(\phi^{-1}(\mathbf{y}))\left\lvert\det (J{\phi^{-1}}(\mathbf{y}))\right\rvert \ &= \frac{p_x(\phi^{-1}(\mathbf{y}))}{\left\lvert\det (J_{\phi}(\phi^{-1}(\mathbf{y})))\right\rvert} \end{align} $$

Pay close attention to where each Jacobian is evaluated: ( J_{\phi^{-1}} ) takes a point in ( y )-space, but ( J_{\phi} ) takes a point in ( x )-space - so in the last line it must be evaluated at ( \phi^{-1}(\mathbf{y}) ), the point in ( x )-space corresponding to ( \mathbf{y} ).

Combining multiple flows

We have seen how to transform a density ( p_x ) after applying a single transformation ( \phi(x) ) to the samples ( \mathbf{x} \in X ), which returns a new density ( p_y = [\phi]{ # }p_x). If we now apply a second transformation to the samples, we can transform the density again. In this way, we can apply a whole series of transformations (\phi_1, \phi_2, \ldots, \phi_K ) to the samples, and transform the density in step. Let's give the intermediate results names: starting from a sample ( \mathbf{x}0 ) from the source density ( p_0 = p_x ), each transformation produces ( \mathbf{x}i = \phi_i(\mathbf{x}{i-1}) ) with density ( p_i = [\phi_i]{ # } p{i-1} ), until we arrive at the final sample ( \mathbf{x}_K ). We denote the combined flow as ( \Phi = \phi_K \circ \ldots \circ \phi_2 \circ \phi_1 ). To find the density of the combined flow, we can apply the change of variables rule to the last transformation:

$$ p_K(\mathbf{x}K) = p{K-1}(\phi_K^{-1}(\mathbf{x}K)) \left\lvert \det J{\phi_K^{-1}}(\mathbf{x}_K) \right\rvert $$

and then expand ( p_{K-1} ) using the same rule, and so on, all the way down to ( p_0 ). The density of the combined flow thus multiplies together the stretching of space contributed by each individual transformation, each evaluated at its own intermediate point:

$$ p_K(\mathbf{x}K) = p_0(\Phi^{-1}(\mathbf{x}K))\prod{i=1}^K \left\lvert \det J{\phi_i^{-1}}(\mathbf{x}_{i}) \right\rvert $$

Training the flow

So now we know how to transform a given source density function when the samples from that density undergo some given transformation ( \phi ). In the context of generative AI however, the setup becomes more interesting when we know what the target distribution should be, and are interested to figure out which transformation maps the (easy to sample from) source distribution onto this target distribution.

Imagine we have a dataset ( \mathcal{D} ) of samples (\mathbf{x} \in \mathcal{D} \subseteq \mathbb{R}^2 ), that are distributed according to some target distribution ( p_t ). For illustration purposes, let's say that ( p_t ) is a normal distribution centered at ( \mu = [5, 5] ), with a covariance matrix of ( \Sigma = \begin{bmatrix} 2.25 & -1.75 \ -1.75 & 2.25 \end{bmatrix} ) (but note that in a real world scenario we would not have access to this generating distribution, only to the samples in ( \mathcal{D} \sim p_t )).

Starting from a known distribution ( p_x ) in the same space, we now want to find a flow ( \phi(\mathbf{x}) ) that maps samples from known ( p_x ) to unknown ( p_t ).

We are more or less free to choose our starting distribution. As after training we want to be able to sample from the distribution, we should pick a distribution for which sampling is straightforward. Let's pick as our starting distribution a normal distribution centered at the origin with unit covariance:

$$ p_x \triangleq \mathcal{N}(0, \mathbf{I}) $$

For our flow, we are free to pick any parameterized flow. One example is a general linear transformation followed by a translation:

$$ \phi(\mathbf{x}, [\mathbf{A}, \mathbf{b}]) \triangleq \mathbf{A} \mathbf{x} + \mathbf{b} $$

such that ( \mathbf{A} ) and ( \mathbf{b} ) are our trainable variables. As per the change of variables rule, the induced density can be written as

$$ \begin{align} p_y(\mathbf{y}) &= ([\phi]{ # } p_x)(\mathbf{y}) \ &= p_x(\phi^{-1}(\mathbf{y})) \left\lvert \det J{\phi^{-1}}(\mathbf{y}) \right\rvert \end{align} $$

Take a moment to appreciate what this equation lets us do, because it is the central trick of normalizing flows. To compute the density our model assigns to a data point ( \mathbf{y} ), we map the data point backwards into source space using ( \phi^{-1} ), score it under the simple source density, and correct for how much the transformation stretched the space. Sampling uses the flow in the forward direction, but training - which is all about scoring data points - only ever needs the inverse direction. This is why, in the code below, we will parameterize the inverse transformation directly.

To fill in this equation, we need to find a few quantities. First, the inverse of the transformation is

$$ \phi^{-1}(\mathbf{y}) = \mathbf{A}^{-1}(\mathbf{y} - \mathbf{b}) $$

Note that this inverse does not necessarily exist! If ( \mathbf{A} ) is singular, then the inverse matrix does not exist. We will see shortly that, conveniently, the training objective itself takes care of this. The Jacobian of the inverse is then

$$ J_{\phi^{-1}}(\mathbf{y}) = \begin{bmatrix} \frac{\partial \phi^{-1}_1}{\partial y_1}(\mathbf{y}) & \frac{\partial \phi^{-1}_1}{\partial y_2}(\mathbf{y}) \ \frac{\partial \phi^{-1}_2}{\partial y_1}(\mathbf{y}) & \frac{\partial \phi^{-1}_2}{\partial y_2}(\mathbf{y}) \end{bmatrix}$$

which is simply (\mathbf{A}^{-1}):

$$ \mathbf{A}^{-1} = \frac{1}{\det(\mathbf{A})} \begin{bmatrix} A_{22} & -A_{12} \ -A_{21} & A_{11} \end{bmatrix} $$

with

$$ \det(\mathbf{A}) = A_{11}A_{22} - A_{12}A_{21}. $$

The determinant of the inverse of a matrix is simply the inverse of the determinant, so that

$$ \begin{align} \det (J_{\phi^{-1}}(\mathbf{y})) &= \det (\mathbf{A}^{-1}) \ &= \frac{1}{\det (\mathbf{A})} \ &= \frac{1}{A_{11}A_{22} - A_{12}A_{21}} \end{align}$$

We can now write out the full induced density ( p_y(\mathbf{y}) ):

$$ \begin{align} p_y(\mathbf{y}) &= p_x(\mathbf{A}^{-1}(\mathbf{y} - \mathbf{b})) \left\lvert \frac{1}{\det(\mathbf{A})} \right\rvert \ &= \frac{1}{2\pi} \exp (-\frac{1}{2} (\mathbf{A}^{-1}(\mathbf{y} - \mathbf{b}))^T(\mathbf{A}^{-1}(\mathbf{y} - \mathbf{b}))) \left\lvert \frac{1}{\det(\mathbf{A})} \right\rvert \end{align} $$

We can turn this into a log likelihood for easy optimization:

$$ \log p_y(\mathbf{y}) = (-\frac{1}{2} (\mathbf{A}^{-1}(\mathbf{y} - \mathbf{b}))^T(\mathbf{A}^{-1}(\mathbf{y} - \mathbf{b}))) - \log(2\pi) + \log \left\lvert \frac{1}{\det(\mathbf{A})} \right\rvert $$

Now, finally, we take the negative (as we are minimizing) and get rid of the constant to get to our per-sample loss function:

$$ \mathcal{L}(\mathbf{y}; [\mathbf{A}, \mathbf{b}]) = (\frac{1}{2} (\mathbf{A}^{-1}(\mathbf{y} - \mathbf{b}))^T(\mathbf{A}^{-1}(\mathbf{y} - \mathbf{b}))) - \log \left\lvert \frac{1}{\det(\mathbf{A})} \right\rvert $$

The full training objective is this loss averaged over all samples in our dataset. (For readers who like to anchor this in familiar terms: minimizing the average negative log likelihood is the same as minimizing the cross-entropy between the data distribution and our induced density - equivalently, up to a constant, their KL divergence.)

Notice also that this loss automatically resolves the invertibility worry from earlier. The quantity ( \frac{1}{\det(\mathbf{A})} ) in the second term is exactly ( \det(\mathbf{A}^{-1}) ): the determinant of the matrix we will actually be training. If gradient descent were to push ( \mathbf{A}^{-1} ) towards a singular matrix, this determinant would go to zero and the term ( -\log \left\lvert \det(\mathbf{A}^{-1}) \right\rvert ) would grow without bound. And that is just as it should be: a transformation that collapses space onto a lower-dimensional subspace assigns, in the limit, zero density to every data point, and the negative log likelihood punishes that with an infinite loss. To see why, run the map forwards: if ( \mathbf{A}^{-1} ) squashes the plane onto a line, its inverse — the generative direction — stretches that line back out across the whole plane, smearing the source distribution's fixed budget of probability mass over an ever larger area. Density is mass per unit area, and a finite mass spread over unbounded area leaves zero density at any individual point. Maximizing likelihood therefore keeps the learned transformation invertible all by itself, and we can always recover ( \mathbf{A} ) from ( \mathbf{A}^{-1} ) after training.

We now have all ingredients to plug into a gradient descent loop, to optimize for our parameters ( \mathbf{A}, \mathbf{b} ). Let's write this out in PyTorch (a complete runnable script is available here). The first class we'll need is a dataset that generates samples from our target distribution. In a real world scenario, you would most likely replace this with a dataset that loads samples from disk.

class TargetDataset(torch.utils.data.Dataset):
    def __init__(self, n_samples, mean, cov):
        super(TargetDataset, self).__init__()
        self.n_samples = n_samples
        self.mean = mean
        self.cov = cov

        self.samples = torch.tensor(np.random.multivariate_normal(self.mean, self.cov, self.n_samples), dtype=torch.float32)

    def __len__(self):
        return self.n_samples

    def __getitem__(self, idx):
        return self.samples[idx]

Next, we want to represent our flow. As during training we care only about the inverted flow, we'll write a class to represent the inverted flow explicitly: its parameters are ( \mathbf{A}^{-1} ) (rather than ( \mathbf{A} )) and ( \mathbf{b} ). After training, we can invert ( \mathbf{A}^{-1} ) once more to recover the forward flow - the flow

method below.

class InverseLinearTransform(torch.nn.Module):
    def __init__(self):
        super(InverseLinearTransform, self).__init__()

        self.inv_A = torch.nn.Parameter(torch.eye(2))
        self.b = torch.nn.Parameter(torch.zeros(2))

    def det_jacobian(self):
        A_11 = self.inv_A[0, 0]
        A_12 = self.inv_A[0, 1]
        A_21 = self.inv_A[1, 0]
        A_22 = self.inv_A[1, 1]

        return A_11 * A_22 - A_12 * A_21

    def forward(self, y):
        return torch.matmul(self.inv_A, y - self.b)

    def flow(self, x):
        A = torch.linalg.inv(self.inv_A)
        return torch.matmul(A, x) + self.b

Finally, we can write our negative log likelihood loss function:

def nll_loss(y, inv_transform):
    inv = inv_transform(y)
    return 0.5 * torch.dot(inv, inv) - torch.log(torch.abs(inv_transform.det_jacobian()))

With all the components in place, we can write our setup and training loop:

mean = np.array([5, 5])
cov = np.array([[2.25, -1.75], [-1.75, 2.25]])
dataset = TargetDataset(1000, mean, cov)

inv_transform = InverseLinearTransform()
optimizer = torch.optim.SGD(inv_transform.parameters(), lr=0.001)

num_epochs = 100
for epoch in range(0, num_epochs):
    for sample in dataset:
        optimizer.zero_grad()

        negative_log_likelihood = nll_loss(sample, inv_transform)
        negative_log_likelihood.backward()

        optimizer.step()

Note that this is a very minimal setup; we are doing standard SGD with a fixed learning rate. This code is for illustration purposes, in reality one would be more careful about the training setup.

When training has successfully converged, the underlying transformation parameters ( \mathbf{A}, \mathbf{b} ) have been optimized so that samples from the source distribution passed through the flow will be approximately distributed as the samples from the target distribution. We can visualize this by plotting how the induced density changes over the course of training, as (\mathbf{A}) and (\mathbf{b}) get closer and closer to the correct target.

We see that in the first step, there's a very large update - the gradients are initially very large as the training data is exceedingly unlikely under the source distribution. After the first epoch, the density is updated more gradually, however pretty quickly the induced density settles on something that looks very similar to our target distribution.

What did the flow learn?

For this particular example we can say exactly what the trained flow should be, which gives us a nice way to check the result. An affine transformation of a Gaussian is again a Gaussian: if ( \mathbf{x} \sim \mathcal{N}(0, \mathbf{I}) ), then ( \mathbf{A}\mathbf{x} + \mathbf{b} ) is distributed as ( \mathcal{N}(\mathbf{b}, \mathbf{A}\mathbf{A}^T) ) (see e.g. the Matrix Cookbook, section 8.1.4, or verify it directly by plugging the Gaussian density and the affine inverse into the change of variables rule we derived above). Our model family is therefore exactly the family of Gaussians, and the loss is minimized when ( \mathbf{b} ) and ( \mathbf{A}\mathbf{A}^T ) equal the empirical mean and covariance of the training samples - which, for a large enough dataset, approach ( \mu ) and ( \Sigma ). The training run in the runnable script indeed ends up with

$$ \mathbf{b} \approx \begin{bmatrix} 5.02 \ 4.98 \end{bmatrix}, \qquad \mathbf{A}\mathbf{A}^T \approx \begin{bmatrix} 2.05 & -1.52 \ -1.52 & 1.96 \end{bmatrix} $$

close to the target values, and closer still to the empirical mean and covariance of the 1000 training samples, which are the true optimum of our objective.

Two things are worth noting here. First, the solution is not unique: for any rotation matrix ( \mathbf{Q} ) we have ( (\mathbf{A}\mathbf{Q})(\mathbf{A}\mathbf{Q})^T = \mathbf{A}\mathbf{A}^T ), so every rotated version of a solution induces exactly the same density and achieves exactly the same loss. Which one training finds depends on the initialization and the path gradient descent happens to take - indeed, the trained ( \mathbf{A} ) from the script is not symmetric, while the "canonical" solution ( \Sigma^{1/2} ) is. Many different flows produce the same distribution.

Second, this tells us in exactly which sense our flow is trivial: an affine flow can only ever turn a Gaussian into another Gaussian. Had our target distribution been anything more interesting - multimodal, curved, heavy-tailed - no amount of training would have gotten us there. That is the fundamental reason we need more expressive transformations, and it is where the next part of this series picks up.

Now that the flow has been trained, we can take any sample in the source space and transform it to the target space. In the figure below, click or drag on the source density (on the left) to see the corresponding point in the target density (on the right).

Next steps

We have seen in principle how we can train a parameterized flow to transform samples from a source distribution into samples that are distributed according to some target distribution. However, the example presented here is exceedingly simple. Our flow is not a composite of multiple flows - it is just a single step. Moreover, the flow type we trained (a linear transform) is more or less trivial. In the next part, we will look at invertible residual flows, which is a more generic type of non-linear flow representable by a neural network. We will also be combining multiple flows to be able to approximate more complex target distributions.

Looking further ahead, this is also where the title of the series starts to pay off. Once we are comfortable chaining many small transformations, a natural question is what happens if we make the chain infinitely fine: flowing through a continuum of infinitesimal transformations rather than a discrete series. That limit turns the flow into an ordinary differential equation, and learning its velocity field directly is exactly where (conditional) flow matching will emerge. The rough roadmap: normalizing flows (this part), more expressive and composite flows (part II), continuous-time flows, and finally conditional flow matching itself. Everything that follows builds on the change of variables machinery we developed here.