2024-10-16

Noise everywhere

As I continue my journey to understand the AI architectures shaping our future, I’ve come across diffusion models. It’s no surprise given that these models are the foundation behind most modern AI image generation systems. Inspired by the physical phenomenon of diffusion, as the name suggests, these models allow anyone—regardless of formal artistic training—to bring their ideas to life in minutes. The speed and accessibility they offer are truly remarkable.

This piece assumes a basic understanding of neural networks and statistics, and it aims to break down the underlying mechanisms that power diffusion models. We'll start with a brief look at the physics behind the concept, before diving deep into the model’s inner workings and exploring how we can further improve it with insights from a more recent research. Happy reading!

Brownian Motion: Randomness in Motion

Brownian motion, also known as a Wiener Process, refers to the random, unpredictable movement of particles suspended in fluid, first observed by Robert Brown in 1827. This motion occurs due to the continuous bombardment of particles by molecules in the surrounding medium, leading to chaotic movement. Importantly, Brownian motion is not just limited to physical systems but it also widely used in mathematics, physics, and finance to model random processes.

Mathematical Formulation

Mathematically, Brownian motion is a real-valued, continuous-time stochastic process $(B_{t})_{t \ge 0}$ characterized by:

Initial condition: $B_{0} = 0$ , meaning the process starts at zero.
Independent increments: For any $0 \le t_{1} \le t_{2} \dots \le t_{n}$ , the increments $B_{t_{2}} - B_{t_{1}}, \dots, B_{t_{n}} - B_{t_{n - 1}}$ are independent.
Stationary increments: The increment over time depends only on the interval length: $B_{t + s} - B_{t} ∼ N(0, s)$ (a normal distribution).
Normal distribution of increments: For $s > 0$ , the increment $B_{t + s} - B_{t}$ is normally distributed with mean zero and variance $s$ , i.e., $B_{t + s} - B_{t} ∼ N(0, s)$

The Wiener Process

The Wiener process is the standard mathematical model for Brownian motion, often described by the following stochastic differential equation (SDE):

dB_{t} = \sigma dW_{t}

where:

$B_{t}$ is the position of the particle at time $t$ ,
$\sigma$ is the diffusion coefficient, controlling how fast the particles diffuses,
$W_{t}$ represents a standard Brownian motion or Wiener process.

In summary, Brownian motion represents randomness unfolding over time. It's a fundamental process of nature, governing everything from the movement of molecules in water to heat flow and even information spread in complex systems.

From Physics to Machine Learning: Diffusion in Data

Just as Brownian motion introduces randomness into the physical world, diffusion models introduce controlled randomness into data. This added noise gradually transforms structured data into a more chaotic form, eventually making the data distribution resemble simple Gaussian distribution.

However, diffusion models don't stop at adding noise. The real magic happens in reverse: learning how to gradually remove that noise, step by step, to recover the original structure — much like observing how a chaotic system gradually returns to order. This process is mathematically modeled using stochastic differential equation (SDE) similar to those that describe Brownian motion.

In a diffusion process, the forward process mirrors Brownian motion, progressively adding noise to data. The reverse process, however, learns to denoise and restore structure, similar to Langevin dynamics, where both deterministic and random forces are balanced to guide the system back towards equilibrium.

Langevin Dynamics: The Balance Between Determinism and Randomness

Langevin dynamics provides a more detailed model of particle movement, accounting for both deterministic forces (e.g., gravity or friction) and random thermal fluctuations into account. The motion of particles in a fluid is governed by the Langevin equation, which balances friction and randomness:

m \frac{d^{2}x(t)}{dt^2} = - \gamma \frac{dx(t)}{dt} + F(x) + \eta(t)

where:

$\gamma$ represents friction or damping,
$F(x)$ is an external deterministic force (such as gravity)
$\eta(t)$ is a random force representing noise (e.g., thermal fluctuations)

In the context of machine learning, Langevin dynamics offers a way to model the reverse process in diffusion models: how noise (randomness) is gradually removed from the data, allowing the original structure to reemerge.

In overdamped systems, where inertia is negligible, the Langevin equations reduces to a simpler form:

\frac{dx(t)}{dt} = - \frac{1}{\gamma}\nabla U(x) + \frac{\eta(t)}{m}

where the noise term $\eta(t)$ behaves much like the noise in a diffusion model, while the gradient $\nabla U(x)$ acts as a force that drives the data back towards its original, structured state.

Diffusion Models: Learning to Reverse Noise

In machine learning, diffusion models handles data by introducing randomness in a manner similar to Brownian motion. The forward process adds random noise to the data — just like the random fluctuations in physical systems — until the data becomes pure Gaussian noise. The reverse process, much like Langevin dynamics, learns to remove this noise step by step, guided by the underlying structure of the original data.

Mathematically, the forward process in a diffusion model is analogous to Brownian motion and is modeled using stochastic differential equations. The reverse process, which denoises the data, mirrors Langevin dynamics by balancing random noise with deterministic forces to recover the structured data.

Forward Diffusion Process

The forward diffusion process progressively adds Gaussian noise to the original data sample $x_{0}$ over $T$ discrete time steps, resulting in pure noise at $x_{T}$ . This process can be visualized as:

x_{0} \rightarrow x_{1} \rightarrow \dots \rightarrow x_{T}

At each time step $t$ , the model corrupts the sample $x_{t - 1}$ by adding Gaussian noise, gradually transforming the data distribution into a simple Gaussian distribution. This transition is mathematically represented as:

x_{t} = \sqrt{\alpha_{t}}x_{t - 1} + \sqrt{1 - \alpha_{t}}\epsilon,\quad \epsilon ∼ \mathcal{N}(0, I)

where:

$x_{t - 1}$ is the sample at the previous time step.
$\epsilon$ is the Gaussian noise drawn from $\mathcal{N}(0, I)$ , a normal distribution with zero mean and identify covariance.
$\alpha_{t} = 1 -\beta_{t}$ is a decay factor, with $\beta_{t}$ representing the noise variance at time $t$ .

Marginalizing the Forward Process

The above equation describes the transition between two adjacent time steps, but we can also directly express $x_{t}$ as a function of the initial data $x_{0}$ and the cumulative noise by marginalizing over the previous steps. Repeatedly applying the transition equation gives:

x_{t} = \sqrt{\bar{\alpha}_{t}}x_{0} + \sqrt{1 - \bar{\alpha}_{t}}\epsilon

where $\bar{\alpha}_{t} = \prod^{t}_{i = 1} \alpha_{i}$ is the cumulative product of $\alpha_{i}$ up to time step $t$ .

This shows that $x_{t}$ is a progressively noisier version of $x_{0}$ , with the amount of noise increasing as $t$ grows. The factor $\bar{\alpha}_{t}$ controls how much of the original signal $x_{0}$ is preserved.

As $t \rightarrow T, \bar{\alpha}_{t} \rightarrow 0$ , and the distribution of $x_{T}$ converges to pure Gaussian noise:

x_{T} ∼ \mathcal{N}(0, I)

This characteristic of the forward diffusion process means that as $T$ increases, the data approaches an isotropic Gaussian distribution.

Forward Process Distribution

The conditional probability distribution for $x_{t}$ given $x_{t - 1}$ can be expressed as:

q(x_{t}|x_{t - 1}) = \mathcal{N}(x_{t}; \sqrt{\alpha_{t}}x_{t - 1}, (1 - \alpha_{t})I)

which confirms that the transition between consecutive time steps is Gaussian, with a mean scaled by $\sqrt{\alpha_{t}}$ and a variance of $1 - \alpha_{t}$ .

Similarly, the marginal distribution of $x_{t}$ given $x_{0}$ is:

q(x_{t}|x_{0}) = \mathcal{N}(x_{t}; \sqrt{\bar{\alpha}_{t}}x_{0}, (1 - \bar{\alpha}_{t})I)

This shows that at any time step $t$ , $x_{t}$ is normally distributed around a scaled version of the original data $x_{0}$ with variance $\sqrt{1 - \bar{\alpha}_{t}}$ , driven by the cumulative noise.

Markov Chain Formalism

The forward diffusion process follows a Markov Chain, where each step $x_{t}$ depends only on the previous step $x_{t - 1}$ , and not on any earlier states. This can be expressed as:

p(x_{t}|x_{t - 1})

This memoryless property is central to the model's tractability, as it allows the forward process to corrupt the data incrementally in small, manageable steps. Each transition is governed by a simple conditional probability distribution, enabling efficient computation and modeling of high-dimensional data.

The Markovian nature also ensures that the reverse process can efficiently recover the original data by progressively denoising, moving from $x_{T}$ back to $x_{0}$ , step by step.

Reverse Process

The reverse process in diffusion models is where data generation occurs. Starting from a noisy sample $x_{T}$ (close to pure Gaussian noise), the model iteratively denoises the sample to recover the original data $x_{0}$ .

Deriving the Reverse Process

The goal of the reverse process is to approximate the posterior distribution $p_{\theta}(x_{t - 1}|x_{t})$ , i.e., to learn how to recover $x_{t - 1}$ from $x_{t}$ . Using Bayes' theorem, this reverse distribution can be derived from the forward process. Specifically, the reverse distribution is modeled as:

p_{\theta}(x_{t - 1}| x_{t}) = \mathcal{N}(x_{t - 1}; \mu_{\theta}(x_{t},t), \Sigma_{\theta}(x_{t},t))

where $\mu_{\theta}(x_{t}, t)$ and $\Sigma_{\theta}(x_{t}, t)$ are parameters learned by the model. The fact that both the forward and reverse process are Gaussian allows us to compute the mean $\mu_{\theta}(x_{t}, t)$ and variance $\Sigma_{\theta}(x_{t}, t)$ analytically.

For the forward process, recall:

q(x_{t}|x_{t - 1}) = \mathcal{N}(x_{t}; \sqrt{1 - \beta_{t}}x_{t - 1}, \beta_{t}I)

where $\beta_{t}$ controls how much noise is added at each step.

By applying Bayes' theorem, the reverse distribution $p(x_{t - 1}|x_{t})$ can be derived as:

p(x_{t - 1}| x_{t}, x_{0}) = \mathcal{N}(x_{t - 1}; \tilde{\mu}_{t}(x_{t}, x_{0}), \tilde{\beta}_{t}I)

where $\tilde{\mu}(x_{t}, x_{0})$ is the mean for the reverse step, and $\tilde{\beta}_{t}$ is the variance. The mean $\tilde{\mu}(x_{t}, x_{0})$ is given by:

\tilde{\mu}_{t}(x_{t}, x_{0}) = \frac{1}{\sqrt{1 - \beta_{t}}}(x_{t}- \frac{\beta_{t}}{\sqrt{1 - \bar{\alpha}_{t}}}x_{0})

where $\bar{\alpha}_{t}$ is the cumulative product of $\alpha_{s} = 1 - \beta_{s}$ over the time steps up to $T$ .

Handling Unknown $x_{0}$

Since $x_{0}$ (the original data) is unknown during inference, we approximate it using the neural network's prediction $\hat{x}_{0} = f_{\theta}(x_{t}, t)$ , which estimates the original clean image. Substituting $\hat{x}_{0}$ into the reverse mean equation gives:

\tilde{\mu}_{t}(x_{t}, \hat{x}_{0}) = \frac{1}{\sqrt{1 - \beta_{t}}}(x_{t}- \frac{\beta_{t}}{\sqrt{1 - \bar{\alpha}_{t}}}(\hat{x}_{0}))

Here, $\hat{x}_{0}$ is the neural network's best estimate of the original data from the noisy input $x_{t}$ .

Estimating Noise and Denoising

Instead of directly predicting $\hat{x}_{0}$ , the neural network predicts the noise $\epsilon_{\theta}(x_{t}, t)$ added during the forward process. Predicting the noise is advantageous because it simplifies the training objective.

Once the model predicts the noise, we can recover the clean data estimate $\hat{x}_{0}$ using the following equation:

\hat{x}_{0}(x_{t}) = \frac{1}{\sqrt{\bar{\alpha}_{t}}}(x_{t} - \sqrt{1 - \bar{\alpha}_{t}}\epsilon_{\theta}(x_{t}, t))

This equations shows that by estimating the noise at each step, the model progressively removes it, ultimately recovering the clean data $x_{0}$ .

Final Reverse Step Update

The final update for the reverse step combines both the predicted mean $\tilde{\mu}_{\theta}(x_{t}, \hat{x}_{0})$ and the stochastic component (the Gaussian noise sampled from $\mathcal{N}(0, I)$ ):

x_{t - 1} = \frac{1}{\sqrt{1 - \beta_{t}}}(x_{t}- \frac{\beta_{t}}{\sqrt{1 - \bar{\alpha}_{t}}}\hat{x}_{0}(x_{t})) + \mathcal{N}(0, \beta_{t}I)

This equation highlights that the reverse process involves both:

Deterministic denoising through the learned mean $\tilde{\mu}_{\theta}(x_{t}, \hat{x}_{0})$ .
Stochastic sampling, by adding Gaussian noise sampled from $\mathcal{N}(0, \beta_{t}I)$ .

Final Notes on Reverse Process

Since $\beta_{t}$ and $\bar{\alpha}_{t}$ are predefined in the forward process, the reverse process follows a mostly deterministic trajectory, except for the added Gaussian noise at each step. This stochastic noise ensures that the generated samples are diverse, even though the denoising path is largely guided by the neural network's predictions.

Training the Model

Training a diffusion model involves minimizing the difference between the true noise $\epsilon$ and the noise predicted by the model $\epsilon_{\theta}(x_{t}, t)$ . The loss function is derived from the variational lower bound (VLB), which optimizes the log-likelihood of the data by approximating the posterior distribution over the latent variables.

Variational Lower Bound (VLB)

The log-likelihood of the data $x_{0}$ can be written as:

\log p_{\theta}(x_{0}) = \log \int p_{\theta}(x_{0}|x_{1})p_{\theta}(x_{1}|x_{2})\dots p_{\theta}(x_{T})dx_{1:T}

Direct optimization of this objective is intractable, so we maximize the variational lower bound (VLB) instead:

\log p_{\theta} (x_{0}) \ge \mathbb{E}_{q}[\log \frac{p_{\theta}(x_{0:T})}{q(x_{1:T}|x_{0})}]

This bound is decomposed into KL divergence terms, comparing the true forward process $q(x_{t}|x_{t - 1})$ with the learned reverse process $p_{\theta}(x_{t - 1}|x_{t})$ , along with a reconstruction loss at $t = 0$ .

Loss Function

The simplified loss function for training becomes:

L(\theta) = \mathbb{E}_{t, x_{0}, \epsilon}[||\epsilon - \epsilon_{\theta}(x_{t}, t)||^2]

This objective encourages the model to accurately predict the noise $\epsilon$ , so that during the reverse process, the noise can be effectively removed to recover the original data. The training process involves the following steps:

Sample Data: Randomly sample $x_{0}$ from the dataset.
Add Noise: Apply the forward process to generate $x_{t}$ from $x_{0}$ by adding noise $\epsilon ∼ \mathcal{N}(0, I)$
Predict Noise: Feed $x_{t}$ and the time step $t$ to the model, which predicts the noise $\epsilon_{\theta}(x_{t}, t)$ .
Loss Calculation: Compute the loss as the squared difference between the predicted and actual noise.
Optimization: Update the model parameters using an optimizer like Adam to minimize the loss.

Sampling Problem in Diffusion Models

In the reverse process of the Denoising Diffusion Probabilistic Model (DDPM), each step is stochastic, with the model generating a new random noise sample at each iteration. This significantly slows down inference, as the model typically requires 1000 steps to generate high-quality samples (as described in the original DDPM paper). While DDPM produces high-quality images and is theoretically sound, the time and computational cost make it inefficient for real-world applications. The large number of steps is necassary to ensure a smooth, gradual transition from noisy data back to the reconstructed image, which helps maintain sample quality and facilitates model learning.

Denoising Diffusion Implicit Models (DDIM): The Key Innovation

To address this issue, Denoising Diffusion Implicit Models (DDIM) was introduced, offering a deterministic approach to the reverse process. The key idea in DDIM is to convert the stochastic reverse process into a deterministic mapping, allowing the model to transform noisy data into clear data in fewer steps, without generating new random noises at each step. This approach changes the nature of the model to a non-Markovian process, meaning that the next state $x_{t - 1}$ depends not only on the current state $x_{t}$ but also on the initial date $x_{0}$ at time step $0$ .

Mathematical Explanation of DDIM sampling

Lets dive into the mathematics of how DDIM achieves this improvement. The core idea is to compute a deterministic trajectory that maps noisy data back to the original data without relying on randomness at each step.

Forward Process (Same as DDPM) The forward process in DDIM is the same as in DDPM. The model starts by adding Gaussian noise to the data $x_{0}$ , progressively moving towards pure noise $x_{T}$ as $t$ increases. This is described by:

x_{t} = \sqrt{\bar{\alpha}_{t}}x_{0} + \sqrt{1 - \bar{\alpha}_{t}}\epsilon

where $\epsilon ∼ \mathcal{N}(0, I)$ is the added Gaussian noise, the $\bar{\alpha}_{t}$ is the cumulative product of the noise schedule over time.

Reverse Process in DDIM (Deterministic) Here is where DDIM introduces its key innovation. Unlike in DDPM, DDIM does not sample $x_{t - 1}$ from a distribution. Instead, it uses a deterministic mapping to computes $x_{t - 1}$ directly from $x_{0}$ and the noise predicted by the model $\epsilon_{\theta}(x_{t},t)$ :

x_{t - 1} = \sqrt{\bar{\alpha}_{t} - 1}x_{0} + \sqrt{1 - \bar{\alpha}_{t - 1}}\epsilon_{\theta}(x_{t}, t)

where:

$x_{t - 1}$ is the data at the next (earlier) time step.
$\bar{\alpha}_{t - 1}$ is the cumulative product of the noise schedule at time $t - 1$ .
$\epsilon_{\theta}(x_{t}, t)$ is the noise predicted by the neural network at time step $t$ .

Why is this deterministic?

In DDPM, each reverse step involves sampling from a Gaussian distribution, which introduces randomness and slows down the process. In contrast, DDIM directly computes $x_{t - 1}$ using a deterministic mapping based on $x_{0}$ and the predicted noise $\epsilon_{\theta}(x_{t}, t)$ , eliminating random sampling. This deterministic approach greatly speeds up the inference process.

Linking Back to the Original Data

One key difference in DDIM is the introduction of a dependence on $x_{0}$ (the original data) at every reverse step. This makes the reverse process non-Markovian, meaning that each step depends on both the noisy sample $x_{t}$ and the original data $x_{0}$ . This allows DDIM to take larger steps during the reverse process, reducing the total number of steps required. In practice, DDIM reduces the number of reverse steps from 1000 (as in DDPM) to as few as 50 or 100, without sacrificing sample quality.

Derivation of DDIM Reverse Process

To understand the reverse process more clearly, lets derive the key equations.

Forward Process Recap

In the forward process, we can express $x_{t}$ as a function of $x_{0}$ and the noise $\epsilon$ :

x_{t} = \sqrt{\bar{\alpha}_{t}}x_{0} + \sqrt{1 - \bar{\alpha}_{t}}\epsilon

Rearranging this equation, we can express $x_{0}$ as a function of $x_{t}$ :

x_{0} = \frac{1}{\sqrt{\bar{\alpha}_{t}}}(x_{t} - \sqrt{1 - \bar{\alpha}_{t}}\epsilon)

This shows us that once we have $x_{t}$ , we can deterministically recover $x_{0}$ .

Reverse Process in DDIM

DDIM leverages this formula to compute each reverse step without introducing randomness. By incorporating the learned noise $\epsilon_{\theta}(x_{t}, t)$ , DDIM computes $x_{t - 1}$ deterministically in the reverse direction:

x_{t - 1} = \sqrt{\bar{\alpha}_{t- 1}}x_{0} + \sqrt{1 - \bar{\alpha}_{t - 1}}\epsilon_{\theta}(x_{t}, t)

This direct computation removes the need for stochastic sampling, making the reverse process faster and more efficient.

Why DDIM works with Fewer Steps

Non-Markovian Nature: DDIM allows the reverse process to depend on both $x_{0}$ (the original data) and $x_{t}$ (the noisy data at time step $t$ ), enabling the model to take larger steps without losing track of the original data. This reduces the total number of steps required.
Deterministic Path: By directly computing each reverse step without randomness, DDIM becomes more efficient, skipping unnecessary steps while maintaining high fidelity in the generated samples.

Applications and Benefits

Speed: DDIM can reduce the number of steps by an order of magnitude (e.g., from 1000 in DDPM to 100 or fewer steps in DDIM), significantly speeding up the sampling process. This makes it much more suitable for real-life or large-scale applications.
Quality: Despite using fewer steps, DDIM still maintains high-quality outputs, and sometimes even improves sample quality due to the smoother, deterministic trajectory through the data space.

Closing Thoughts

A lot has happened in the world while I was preparing this piece. Tesla unveiled its new humanoid robot and robotaxi concept—both just proofs of concept for now—but knowing how fast Elon Musk moves, I wouldn’t be surprised if the Optimus robot is commercialized in a few years. Even more impressive, though, was SpaceX’s successful mid-air catch of the Super Heavy booster using Mechazilla, marking a new era for reusable rockets. This brings humanity a step closer to Mars, even though it will still take years before a successful colony is set up. It’s hard not to be inspired by these kinds of "moonshots" aimed at building a better future. What strikes me most is that a man with no formal background in engineering made all of this happen. It’s a reminder that you can just do things—break through barriers and create what others might not even imagine.

Writing this reminds me of something Steve Jobs once said: "You tend to get told that the world is the way it is and that you should live your life inside the world, trying not to bash into the walls too much. But that’s a very limited life. Life can be much broader once you discover that everything around you was made up by people that were no smarter than you, and you can change it." It’s a reminder for me to work even harder toward building the future I want to see.

As for diffusion models, they’re an incredible piece of technology. They learn the latent representations of images from training data and reproduce those images from fully noised distributions, essentially recreating images from what looks like chaos. While there are valid ethical concerns, particularly around the use of artists’ work without consent and fears that these models could replace human creativity, I believe diffusion models are only going to get better—and faster. Just look at how quickly they solved issues like generating realistic hands. This opens up an exciting future where anyone can create art based on their ideas, sparking new waves of creativity across the board.

However, this also raises concerns about misuse and misinformation. With tools this powerful, discussions around regulation are necessary to ensure responsible use. We need an open dialogue between creators, policymakers, and society at large to find the balance between innovation and ethical responsibility.

Noise everywhere

Brownian Motion: Randomness in Motion

Mathematical Formulation

The Wiener Process

From Physics to Machine Learning: Diffusion in Data

Langevin Dynamics: The Balance Between Determinism and Randomness

Diffusion Models: Learning to Reverse Noise

Forward Diffusion Process

Marginalizing the Forward Process

Forward Process Distribution

Markov Chain Formalism

Reverse Process

Deriving the Reverse Process

Handling Unknown x0x_{0}x0​

Estimating Noise and Denoising

Final Reverse Step Update

Final Notes on Reverse Process

Training the Model

Variational Lower Bound (VLB)

Loss Function

Sampling Problem in Diffusion Models

Denoising Diffusion Implicit Models (DDIM): The Key Innovation

Mathematical Explanation of DDIM sampling

Why is this deterministic?

Linking Back to the Original Data

Derivation of DDIM Reverse Process

Forward Process Recap

Reverse Process in DDIM

Why DDIM works with Fewer Steps

Applications and Benefits

Closing Thoughts

Handling Unknown $x_{0}$