2024-10-16

Noise everywhere

As I continue my journey to understand the AI architectures shaping our future, I’ve come across diffusion models. It’s no surprise given that these models are the foundation behind most modern AI image generation systems. Inspired by the physical phenomenon of diffusion, as the name suggests, these models allow anyone—regardless of formal artistic training—to bring their ideas to life in minutes. The speed and accessibility they offer are truly remarkable.

This piece assumes a basic understanding of neural networks and statistics, and it aims to break down the underlying mechanisms that power diffusion models. We'll start with a brief look at the physics behind the concept, before diving deep into the model’s inner workings and exploring how we can further improve it with insights from a more recent research. Happy reading!

Brownian Motion: Randomness in Motion

Brownian motion, also known as a Wiener Process, refers to the random, unpredictable movement of particles suspended in fluid, first observed by Robert Brown in 1827. This motion occurs due to the continuous bombardment of particles by molecules in the surrounding medium, leading to chaotic movement. Importantly, Brownian motion is not just limited to physical systems but it also widely used in mathematics, physics, and finance to model random processes.

Mathematical Formulation

Mathematically, Brownian motion is a real-valued, continuous-time stochastic process (Bt)t0(B_{t})_{t \ge 0} characterized by:

  1. Initial condition: B0=0B_{0} = 0, meaning the process starts at zero.
  2. Independent increments: For any 0t1t2tn0 \le t_{1} \le t_{2} \dots \le t_{n} , the increments Bt2Bt1,,BtnBtn1B_{t_{2}} - B_{t_{1}}, \dots, B_{t_{n}} - B_{t_{n - 1}} are independent.
  3. Stationary increments: The increment over time depends only on the interval length: Bt+sBtN(0,s)B_{t + s} - B_{t} ∼ N(0, s) (a normal distribution).
  4. Normal distribution of increments: For s>0s > 0, the increment Bt+sBtB_{t + s} - B_{t} is normally distributed with mean zero and variance ss, i.e., Bt+sBtN(0,s)B_{t + s} - B_{t} ∼ N(0, s)

The Wiener Process

The Wiener process is the standard mathematical model for Brownian motion, often described by the following stochastic differential equation (SDE):

dBt=σdWtdB_{t} = \sigma dW_{t}

where:

  • BtB_{t} is the position of the particle at time tt,
  • σ\sigma is the diffusion coefficient, controlling how fast the particles diffuses,
  • WtW_{t} represents a standard Brownian motion or Wiener process.

In summary, Brownian motion represents randomness unfolding over time. It's a fundamental process of nature, governing everything from the movement of molecules in water to heat flow and even information spread in complex systems.

From Physics to Machine Learning: Diffusion in Data

Just as Brownian motion introduces randomness into the physical world, diffusion models introduce controlled randomness into data. This added noise gradually transforms structured data into a more chaotic form, eventually making the data distribution resemble simple Gaussian distribution.

However, diffusion models don't stop at adding noise. The real magic happens in reverse: learning how to gradually remove that noise, step by step, to recover the original structure — much like observing how a chaotic system gradually returns to order. This process is mathematically modeled using stochastic differential equation (SDE) similar to those that describe Brownian motion.

In a diffusion process, the forward process mirrors Brownian motion, progressively adding noise to data. The reverse process, however, learns to denoise and restore structure, similar to Langevin dynamics, where both deterministic and random forces are balanced to guide the system back towards equilibrium.

Langevin Dynamics: The Balance Between Determinism and Randomness

Langevin dynamics provides a more detailed model of particle movement, accounting for both deterministic forces (e.g., gravity or friction) and random thermal fluctuations into account. The motion of particles in a fluid is governed by the Langevin equation, which balances friction and randomness:

md2x(t)dt2=γdx(t)dt+F(x)+η(t)m \frac{d^{2}x(t)}{dt^2} = - \gamma \frac{dx(t)}{dt} + F(x) + \eta(t)

where:

  • γ\gamma represents friction or damping,
  • F(x)F(x) is an external deterministic force (such as gravity)
  • η(t)\eta(t) is a random force representing noise (e.g., thermal fluctuations)

In the context of machine learning, Langevin dynamics offers a way to model the reverse process in diffusion models: how noise (randomness) is gradually removed from the data, allowing the original structure to reemerge.

In overdamped systems, where inertia is negligible, the Langevin equations reduces to a simpler form:

dx(t)dt=1γU(x)+η(t)m\frac{dx(t)}{dt} = - \frac{1}{\gamma}\nabla U(x) + \frac{\eta(t)}{m}

where the noise term η(t)\eta(t) behaves much like the noise in a diffusion model, while the gradient U(x)\nabla U(x) acts as a force that drives the data back towards its original, structured state.

Diffusion Models: Learning to Reverse Noise

In machine learning, diffusion models handles data by introducing randomness in a manner similar to Brownian motion. The forward process adds random noise to the data — just like the random fluctuations in physical systems — until the data becomes pure Gaussian noise. The reverse process, much like Langevin dynamics, learns to remove this noise step by step, guided by the underlying structure of the original data.

Mathematically, the forward process in a diffusion model is analogous to Brownian motion and is modeled using stochastic differential equations. The reverse process, which denoises the data, mirrors Langevin dynamics by balancing random noise with deterministic forces to recover the structured data.

Forward Diffusion Process

The forward diffusion process progressively adds Gaussian noise to the original data sample x0x_{0} over TT discrete time steps, resulting in pure noise at xTx_{T}. This process can be visualized as:

x0x1xTx_{0} \rightarrow x_{1} \rightarrow \dots \rightarrow x_{T}

At each time step tt, the model corrupts the sample xt1x_{t - 1} by adding Gaussian noise, gradually transforming the data distribution into a simple Gaussian distribution. This transition is mathematically represented as:

xt=αtxt1+1αtϵ,ϵN(0,I)x_{t} = \sqrt{\alpha_{t}}x_{t - 1} + \sqrt{1 - \alpha_{t}}\epsilon,\quad \epsilon ∼ \mathcal{N}(0, I)

where:

  • xt1x_{t - 1} is the sample at the previous time step.
  • ϵ\epsilon is the Gaussian noise drawn from N(0,I)\mathcal{N}(0, I), a normal distribution with zero mean and identify covariance.
  • αt=1βt\alpha_{t} = 1 -\beta_{t} is a decay factor, with βt\beta_{t} representing the noise variance at time tt.

Marginalizing the Forward Process

The above equation describes the transition between two adjacent time steps, but we can also directly express xtx_{t} as a function of the initial data x0x_{0} and the cumulative noise by marginalizing over the previous steps. Repeatedly applying the transition equation gives:

xt=αˉtx0+1αˉtϵx_{t} = \sqrt{\bar{\alpha}_{t}}x_{0} + \sqrt{1 - \bar{\alpha}_{t}}\epsilon

where αˉt=i=1tαi\bar{\alpha}_{t} = \prod^{t}_{i = 1} \alpha_{i} is the cumulative product of αi\alpha_{i} up to time step tt.

This shows that xtx_{t} is a progressively noisier version of x0x_{0}, with the amount of noise increasing as tt grows. The factor αˉt\bar{\alpha}_{t} controls how much of the original signal x0x_{0} is preserved.

As tT,αˉt0t \rightarrow T, \bar{\alpha}_{t} \rightarrow 0, and the distribution of xTx_{T} converges to pure Gaussian noise:

xTN(0,I)x_{T} ∼ \mathcal{N}(0, I)

This characteristic of the forward diffusion process means that as TT increases, the data approaches an isotropic Gaussian distribution.

Forward Process Distribution

The conditional probability distribution for xtx_{t} given xt1x_{t - 1} can be expressed as:

q(xtxt1)=N(xt;αtxt1,(1αt)I)q(x_{t}|x_{t - 1}) = \mathcal{N}(x_{t}; \sqrt{\alpha_{t}}x_{t - 1}, (1 - \alpha_{t})I)

which confirms that the transition between consecutive time steps is Gaussian, with a mean scaled by αt\sqrt{\alpha_{t}} and a variance of 1αt1 - \alpha_{t}.

Similarly, the marginal distribution of xtx_{t} given x0x_{0} is:

q(xtx0)=N(xt;αˉtx0,(1αˉt)I)q(x_{t}|x_{0}) = \mathcal{N}(x_{t}; \sqrt{\bar{\alpha}_{t}}x_{0}, (1 - \bar{\alpha}_{t})I)

This shows that at any time step tt, xtx_{t} is normally distributed around a scaled version of the original data x0x_{0} with variance 1αˉt\sqrt{1 - \bar{\alpha}_{t}}, driven by the cumulative noise.

Markov Chain Formalism

The forward diffusion process follows a Markov Chain, where each step xtx_{t} depends only on the previous step xt1x_{t - 1}, and not on any earlier states. This can be expressed as:

p(xtxt1)p(x_{t}|x_{t - 1})

This memoryless property is central to the model's tractability, as it allows the forward process to corrupt the data incrementally in small, manageable steps. Each transition is governed by a simple conditional probability distribution, enabling efficient computation and modeling of high-dimensional data.

The Markovian nature also ensures that the reverse process can efficiently recover the original data by progressively denoising, moving from xTx_{T} back to x0x_{0}, step by step.

Reverse Process

The reverse process in diffusion models is where data generation occurs. Starting from a noisy sample xTx_{T} (close to pure Gaussian noise), the model iteratively denoises the sample to recover the original data x0x_{0}.

Deriving the Reverse Process

The goal of the reverse process is to approximate the posterior distribution pθ(xt1xt)p_{\theta}(x_{t - 1}|x_{t}), i.e., to learn how to recover xt1x_{t - 1} from xtx_{t}. Using Bayes' theorem, this reverse distribution can be derived from the forward process. Specifically, the reverse distribution is modeled as:

pθ(xt1xt)=N(xt1;μθ(xt,t),Σθ(xt,t))p_{\theta}(x_{t - 1}| x_{t}) = \mathcal{N}(x_{t - 1}; \mu_{\theta}(x_{t},t), \Sigma_{\theta}(x_{t},t))

where μθ(xt,t)\mu_{\theta}(x_{t}, t) and Σθ(xt,t)\Sigma_{\theta}(x_{t}, t) are parameters learned by the model. The fact that both the forward and reverse process are Gaussian allows us to compute the mean μθ(xt,t)\mu_{\theta}(x_{t}, t) and variance Σθ(xt,t)\Sigma_{\theta}(x_{t}, t) analytically.

For the forward process, recall:

q(xtxt1)=N(xt;1βtxt1,βtI)q(x_{t}|x_{t - 1}) = \mathcal{N}(x_{t}; \sqrt{1 - \beta_{t}}x_{t - 1}, \beta_{t}I)

where βt\beta_{t} controls how much noise is added at each step.

By applying Bayes' theorem, the reverse distribution p(xt1xt)p(x_{t - 1}|x_{t}) can be derived as:

p(xt1xt,x0)=N(xt1;μ~t(xt,x0),β~tI)p(x_{t - 1}| x_{t}, x_{0}) = \mathcal{N}(x_{t - 1}; \tilde{\mu}_{t}(x_{t}, x_{0}), \tilde{\beta}_{t}I)

where μ~(xt,x0)\tilde{\mu}(x_{t}, x_{0}) is the mean for the reverse step, and β~t\tilde{\beta}_{t} is the variance. The mean μ~(xt,x0)\tilde{\mu}(x_{t}, x_{0}) is given by:

μ~t(xt,x0)=11βt(xtβt1αˉtx0)\tilde{\mu}_{t}(x_{t}, x_{0}) = \frac{1}{\sqrt{1 - \beta_{t}}}(x_{t}- \frac{\beta_{t}}{\sqrt{1 - \bar{\alpha}_{t}}}x_{0})

where αˉt\bar{\alpha}_{t} is the cumulative product of αs=1βs\alpha_{s} = 1 - \beta_{s} over the time steps up to TT.

Handling Unknown x0x_{0}

Since x0x_{0} (the original data) is unknown during inference, we approximate it using the neural network's prediction x^0=fθ(xt,t)\hat{x}_{0} = f_{\theta}(x_{t}, t), which estimates the original clean image. Substituting x^0\hat{x}_{0} into the reverse mean equation gives:

μ~t(xt,x^0)=11βt(xtβt1αˉt(x^0))\tilde{\mu}_{t}(x_{t}, \hat{x}_{0}) = \frac{1}{\sqrt{1 - \beta_{t}}}(x_{t}- \frac{\beta_{t}}{\sqrt{1 - \bar{\alpha}_{t}}}(\hat{x}_{0}))

Here, x^0\hat{x}_{0} is the neural network's best estimate of the original data from the noisy input xtx_{t}.

Estimating Noise and Denoising

Instead of directly predicting x^0\hat{x}_{0}, the neural network predicts the noise ϵθ(xt,t)\epsilon_{\theta}(x_{t}, t) added during the forward process. Predicting the noise is advantageous because it simplifies the training objective.

Once the model predicts the noise, we can recover the clean data estimate x^0\hat{x}_{0} using the following equation:

x^0(xt)=1αˉt(xt1αˉtϵθ(xt,t))\hat{x}_{0}(x_{t}) = \frac{1}{\sqrt{\bar{\alpha}_{t}}}(x_{t} - \sqrt{1 - \bar{\alpha}_{t}}\epsilon_{\theta}(x_{t}, t))

This equations shows that by estimating the noise at each step, the model progressively removes it, ultimately recovering the clean data x0x_{0}.

Final Reverse Step Update

The final update for the reverse step combines both the predicted mean μ~θ(xt,x^0)\tilde{\mu}_{\theta}(x_{t}, \hat{x}_{0}) and the stochastic component (the Gaussian noise sampled from N(0,I)\mathcal{N}(0, I)):

xt1=11βt(xtβt1αˉtx^0(xt))+N(0,βtI)x_{t - 1} = \frac{1}{\sqrt{1 - \beta_{t}}}(x_{t}- \frac{\beta_{t}}{\sqrt{1 - \bar{\alpha}_{t}}}\hat{x}_{0}(x_{t})) + \mathcal{N}(0, \beta_{t}I)

This equation highlights that the reverse process involves both:

  1. Deterministic denoising through the learned mean μ~θ(xt,x^0)\tilde{\mu}_{\theta}(x_{t}, \hat{x}_{0}).
  2. Stochastic sampling, by adding Gaussian noise sampled from N(0,βtI)\mathcal{N}(0, \beta_{t}I).

Final Notes on Reverse Process

Since βt\beta_{t} and αˉt\bar{\alpha}_{t} are predefined in the forward process, the reverse process follows a mostly deterministic trajectory, except for the added Gaussian noise at each step. This stochastic noise ensures that the generated samples are diverse, even though the denoising path is largely guided by the neural network's predictions.

Training the Model

Training a diffusion model involves minimizing the difference between the true noise ϵ\epsilon and the noise predicted by the model ϵθ(xt,t)\epsilon_{\theta}(x_{t}, t). The loss function is derived from the variational lower bound (VLB), which optimizes the log-likelihood of the data by approximating the posterior distribution over the latent variables.

Variational Lower Bound (VLB)

The log-likelihood of the data x0x_{0} can be written as:

logpθ(x0)=logpθ(x0x1)pθ(x1x2)pθ(xT)dx1:T\log p_{\theta}(x_{0}) = \log \int p_{\theta}(x_{0}|x_{1})p_{\theta}(x_{1}|x_{2})\dots p_{\theta}(x_{T})dx_{1:T}

Direct optimization of this objective is intractable, so we maximize the variational lower bound (VLB) instead:

logpθ(x0)Eq[logpθ(x0:T)q(x1:Tx0)]\log p_{\theta} (x_{0}) \ge \mathbb{E}_{q}[\log \frac{p_{\theta}(x_{0:T})}{q(x_{1:T}|x_{0})}]

This bound is decomposed into KL divergence terms, comparing the true forward process q(xtxt1)q(x_{t}|x_{t - 1}) with the learned reverse process pθ(xt1xt)p_{\theta}(x_{t - 1}|x_{t}), along with a reconstruction loss at t=0t = 0.

Loss Function

The simplified loss function for training becomes:

L(θ)=Et,x0,ϵ[ϵϵθ(xt,t)2]L(\theta) = \mathbb{E}_{t, x_{0}, \epsilon}[||\epsilon - \epsilon_{\theta}(x_{t}, t)||^2]

This objective encourages the model to accurately predict the noise ϵ\epsilon, so that during the reverse process, the noise can be effectively removed to recover the original data. The training process involves the following steps:

  1. Sample Data: Randomly sample x0x_{0} from the dataset.
  2. Add Noise: Apply the forward process to generate xtx_{t} from x0x_{0} by adding noise ϵN(0,I)\epsilon ∼ \mathcal{N}(0, I)
  3. Predict Noise: Feed xtx_{t} and the time step tt to the model, which predicts the noise ϵθ(xt,t)\epsilon_{\theta}(x_{t}, t).
  4. Loss Calculation: Compute the loss as the squared difference between the predicted and actual noise.
  5. Optimization: Update the model parameters using an optimizer like Adam to minimize the loss.

Sampling Problem in Diffusion Models

In the reverse process of the Denoising Diffusion Probabilistic Model (DDPM), each step is stochastic, with the model generating a new random noise sample at each iteration. This significantly slows down inference, as the model typically requires 1000 steps to generate high-quality samples (as described in the original DDPM paper). While DDPM produces high-quality images and is theoretically sound, the time and computational cost make it inefficient for real-world applications. The large number of steps is necassary to ensure a smooth, gradual transition from noisy data back to the reconstructed image, which helps maintain sample quality and facilitates model learning.

Denoising Diffusion Implicit Models (DDIM): The Key Innovation

To address this issue, Denoising Diffusion Implicit Models (DDIM) was introduced, offering a deterministic approach to the reverse process. The key idea in DDIM is to convert the stochastic reverse process into a deterministic mapping, allowing the model to transform noisy data into clear data in fewer steps, without generating new random noises at each step. This approach changes the nature of the model to a non-Markovian process, meaning that the next state xt1x_{t - 1} depends not only on the current state xtx_{t} but also on the initial date x0x_{0} at time step 00.

Mathematical Explanation of DDIM sampling

Lets dive into the mathematics of how DDIM achieves this improvement. The core idea is to compute a deterministic trajectory that maps noisy data back to the original data without relying on randomness at each step.

Forward Process (Same as DDPM) The forward process in DDIM is the same as in DDPM. The model starts by adding Gaussian noise to the data x0x_{0}, progressively moving towards pure noise xTx_{T} as tt increases. This is described by:

xt=αˉtx0+1αˉtϵx_{t} = \sqrt{\bar{\alpha}_{t}}x_{0} + \sqrt{1 - \bar{\alpha}_{t}}\epsilon

where ϵN(0,I)\epsilon ∼ \mathcal{N}(0, I) is the added Gaussian noise, the αˉt\bar{\alpha}_{t} is the cumulative product of the noise schedule over time.

Reverse Process in DDIM (Deterministic) Here is where DDIM introduces its key innovation. Unlike in DDPM, DDIM does not sample xt1x_{t - 1} from a distribution. Instead, it uses a deterministic mapping to computes xt1x_{t - 1} directly from x0x_{0} and the noise predicted by the model ϵθ(xt,t)\epsilon_{\theta}(x_{t},t):

xt1=αˉt1x0+1αˉt1ϵθ(xt,t)x_{t - 1} = \sqrt{\bar{\alpha}_{t} - 1}x_{0} + \sqrt{1 - \bar{\alpha}_{t - 1}}\epsilon_{\theta}(x_{t}, t)

where:

  • xt1x_{t - 1} is the data at the next (earlier) time step.
  • αˉt1\bar{\alpha}_{t - 1} is the cumulative product of the noise schedule at time t1t - 1.
  • ϵθ(xt,t)\epsilon_{\theta}(x_{t}, t) is the noise predicted by the neural network at time step tt.

Why is this deterministic?

In DDPM, each reverse step involves sampling from a Gaussian distribution, which introduces randomness and slows down the process. In contrast, DDIM directly computes xt1x_{t - 1} using a deterministic mapping based on x0x_{0} and the predicted noise ϵθ(xt,t)\epsilon_{\theta}(x_{t}, t), eliminating random sampling. This deterministic approach greatly speeds up the inference process.

Linking Back to the Original Data

One key difference in DDIM is the introduction of a dependence on x0x_{0} (the original data) at every reverse step. This makes the reverse process non-Markovian, meaning that each step depends on both the noisy sample xtx_{t} and the original data x0x_{0}. This allows DDIM to take larger steps during the reverse process, reducing the total number of steps required. In practice, DDIM reduces the number of reverse steps from 1000 (as in DDPM) to as few as 50 or 100, without sacrificing sample quality.


Derivation of DDIM Reverse Process

To understand the reverse process more clearly, lets derive the key equations.

Forward Process Recap

In the forward process, we can express xtx_{t} as a function of x0x_{0} and the noise ϵ\epsilon:

xt=αˉtx0+1αˉtϵx_{t} = \sqrt{\bar{\alpha}_{t}}x_{0} + \sqrt{1 - \bar{\alpha}_{t}}\epsilon

Rearranging this equation, we can express x0x_{0} as a function of xtx_{t}:

x0=1αˉt(xt1αˉtϵ)x_{0} = \frac{1}{\sqrt{\bar{\alpha}_{t}}}(x_{t} - \sqrt{1 - \bar{\alpha}_{t}}\epsilon)

This shows us that once we have xtx_{t}, we can deterministically recover x0x_{0}.

Reverse Process in DDIM

DDIM leverages this formula to compute each reverse step without introducing randomness. By incorporating the learned noise ϵθ(xt,t)\epsilon_{\theta}(x_{t}, t), DDIM computes xt1x_{t - 1} deterministically in the reverse direction:

xt1=αˉt1x0+1αˉt1ϵθ(xt,t)x_{t - 1} = \sqrt{\bar{\alpha}_{t- 1}}x_{0} + \sqrt{1 - \bar{\alpha}_{t - 1}}\epsilon_{\theta}(x_{t}, t)

This direct computation removes the need for stochastic sampling, making the reverse process faster and more efficient.


Why DDIM works with Fewer Steps

  • Non-Markovian Nature: DDIM allows the reverse process to depend on both x0x_{0} (the original data) and xtx_{t} (the noisy data at time step tt), enabling the model to take larger steps without losing track of the original data. This reduces the total number of steps required.
  • Deterministic Path: By directly computing each reverse step without randomness, DDIM becomes more efficient, skipping unnecessary steps while maintaining high fidelity in the generated samples.

Applications and Benefits

  • Speed: DDIM can reduce the number of steps by an order of magnitude (e.g., from 1000 in DDPM to 100 or fewer steps in DDIM), significantly speeding up the sampling process. This makes it much more suitable for real-life or large-scale applications.
  • Quality: Despite using fewer steps, DDIM still maintains high-quality outputs, and sometimes even improves sample quality due to the smoother, deterministic trajectory through the data space.

Closing Thoughts

A lot has happened in the world while I was preparing this piece. Tesla unveiled its new humanoid robot and robotaxi concept—both just proofs of concept for now—but knowing how fast Elon Musk moves, I wouldn’t be surprised if the Optimus robot is commercialized in a few years. Even more impressive, though, was SpaceX’s successful mid-air catch of the Super Heavy booster using Mechazilla, marking a new era for reusable rockets. This brings humanity a step closer to Mars, even though it will still take years before a successful colony is set up. It’s hard not to be inspired by these kinds of "moonshots" aimed at building a better future. What strikes me most is that a man with no formal background in engineering made all of this happen. It’s a reminder that you can just do things—break through barriers and create what others might not even imagine.

Writing this reminds me of something Steve Jobs once said: "You tend to get told that the world is the way it is and that you should live your life inside the world, trying not to bash into the walls too much. But that’s a very limited life. Life can be much broader once you discover that everything around you was made up by people that were no smarter than you, and you can change it." It’s a reminder for me to work even harder toward building the future I want to see.

As for diffusion models, they’re an incredible piece of technology. They learn the latent representations of images from training data and reproduce those images from fully noised distributions, essentially recreating images from what looks like chaos. While there are valid ethical concerns, particularly around the use of artists’ work without consent and fears that these models could replace human creativity, I believe diffusion models are only going to get better—and faster. Just look at how quickly they solved issues like generating realistic hands. This opens up an exciting future where anyone can create art based on their ideas, sparking new waves of creativity across the board.

However, this also raises concerns about misuse and misinformation. With tools this powerful, discussions around regulation are necessary to ensure responsible use. We need an open dialogue between creators, policymakers, and society at large to find the balance between innovation and ethical responsibility.