2024-10-16
Noise everywhere
As I continue my journey to understand the AI architectures shaping our future, I’ve come across diffusion models. It’s no surprise given that these models are the foundation behind most modern AI image generation systems. Inspired by the physical phenomenon of diffusion, as the name suggests, these models allow anyone—regardless of formal artistic training—to bring their ideas to life in minutes. The speed and accessibility they offer are truly remarkable.
This piece assumes a basic understanding of neural networks and statistics, and it aims to break down the underlying mechanisms that power diffusion models. We'll start with a brief look at the physics behind the concept, before diving deep into the model’s inner workings and exploring how we can further improve it with insights from a more recent research. Happy reading!
Brownian Motion: Randomness in Motion
Brownian motion, also known as a Wiener Process, refers to the random, unpredictable movement of particles suspended in fluid, first observed by Robert Brown in 1827. This motion occurs due to the continuous bombardment of particles by molecules in the surrounding medium, leading to chaotic movement. Importantly, Brownian motion is not just limited to physical systems but it also widely used in mathematics, physics, and finance to model random processes.
Mathematical Formulation
Mathematically, Brownian motion is a real-valued, continuous-time stochastic process characterized by:
- Initial condition: , meaning the process starts at zero.
- Independent increments: For any , the increments are independent.
- Stationary increments: The increment over time depends only on the interval length: (a normal distribution).
- Normal distribution of increments: For , the increment is normally distributed with mean zero and variance , i.e.,
The Wiener Process
The Wiener process is the standard mathematical model for Brownian motion, often described by the following stochastic differential equation (SDE):
where:
- is the position of the particle at time ,
- is the diffusion coefficient, controlling how fast the particles diffuses,
- represents a standard Brownian motion or Wiener process.
In summary, Brownian motion represents randomness unfolding over time. It's a fundamental process of nature, governing everything from the movement of molecules in water to heat flow and even information spread in complex systems.
From Physics to Machine Learning: Diffusion in Data
Just as Brownian motion introduces randomness into the physical world, diffusion models introduce controlled randomness into data. This added noise gradually transforms structured data into a more chaotic form, eventually making the data distribution resemble simple Gaussian distribution.
However, diffusion models don't stop at adding noise. The real magic happens in reverse: learning how to gradually remove that noise, step by step, to recover the original structure — much like observing how a chaotic system gradually returns to order. This process is mathematically modeled using stochastic differential equation (SDE) similar to those that describe Brownian motion.
In a diffusion process, the forward process mirrors Brownian motion, progressively adding noise to data. The reverse process, however, learns to denoise and restore structure, similar to Langevin dynamics, where both deterministic and random forces are balanced to guide the system back towards equilibrium.
Langevin Dynamics: The Balance Between Determinism and Randomness
Langevin dynamics provides a more detailed model of particle movement, accounting for both deterministic forces (e.g., gravity or friction) and random thermal fluctuations into account. The motion of particles in a fluid is governed by the Langevin equation, which balances friction and randomness:
where:
- represents friction or damping,
- is an external deterministic force (such as gravity)
- is a random force representing noise (e.g., thermal fluctuations)
In the context of machine learning, Langevin dynamics offers a way to model the reverse process in diffusion models: how noise (randomness) is gradually removed from the data, allowing the original structure to reemerge.
In overdamped systems, where inertia is negligible, the Langevin equations reduces to a simpler form:
where the noise term behaves much like the noise in a diffusion model, while the gradient acts as a force that drives the data back towards its original, structured state.
Diffusion Models: Learning to Reverse Noise
In machine learning, diffusion models handles data by introducing randomness in a manner similar to Brownian motion. The forward process adds random noise to the data — just like the random fluctuations in physical systems — until the data becomes pure Gaussian noise. The reverse process, much like Langevin dynamics, learns to remove this noise step by step, guided by the underlying structure of the original data.
Mathematically, the forward process in a diffusion model is analogous to Brownian motion and is modeled using stochastic differential equations. The reverse process, which denoises the data, mirrors Langevin dynamics by balancing random noise with deterministic forces to recover the structured data.
Forward Diffusion Process
The forward diffusion process progressively adds Gaussian noise to the original data sample over discrete time steps, resulting in pure noise at . This process can be visualized as:
At each time step , the model corrupts the sample by adding Gaussian noise, gradually transforming the data distribution into a simple Gaussian distribution. This transition is mathematically represented as:
where:
- is the sample at the previous time step.
- is the Gaussian noise drawn from , a normal distribution with zero mean and identify covariance.
- is a decay factor, with representing the noise variance at time .
Marginalizing the Forward Process
The above equation describes the transition between two adjacent time steps, but we can also directly express as a function of the initial data and the cumulative noise by marginalizing over the previous steps. Repeatedly applying the transition equation gives:
where is the cumulative product of up to time step .
This shows that is a progressively noisier version of , with the amount of noise increasing as grows. The factor controls how much of the original signal is preserved.
As , and the distribution of converges to pure Gaussian noise:
This characteristic of the forward diffusion process means that as increases, the data approaches an isotropic Gaussian distribution.
Forward Process Distribution
The conditional probability distribution for given can be expressed as:
which confirms that the transition between consecutive time steps is Gaussian, with a mean scaled by and a variance of .
Similarly, the marginal distribution of given is:
This shows that at any time step , is normally distributed around a scaled version of the original data with variance , driven by the cumulative noise.
Markov Chain Formalism
The forward diffusion process follows a Markov Chain, where each step depends only on the previous step , and not on any earlier states. This can be expressed as:
This memoryless property is central to the model's tractability, as it allows the forward process to corrupt the data incrementally in small, manageable steps. Each transition is governed by a simple conditional probability distribution, enabling efficient computation and modeling of high-dimensional data.
The Markovian nature also ensures that the reverse process can efficiently recover the original data by progressively denoising, moving from back to , step by step.
Reverse Process
The reverse process in diffusion models is where data generation occurs. Starting from a noisy sample (close to pure Gaussian noise), the model iteratively denoises the sample to recover the original data .
Deriving the Reverse Process
The goal of the reverse process is to approximate the posterior distribution , i.e., to learn how to recover from . Using Bayes' theorem, this reverse distribution can be derived from the forward process. Specifically, the reverse distribution is modeled as:
where and are parameters learned by the model. The fact that both the forward and reverse process are Gaussian allows us to compute the mean and variance analytically.
For the forward process, recall:
where controls how much noise is added at each step.
By applying Bayes' theorem, the reverse distribution can be derived as:
where is the mean for the reverse step, and is the variance. The mean is given by:
where is the cumulative product of over the time steps up to .
Handling Unknown
Since (the original data) is unknown during inference, we approximate it using the neural network's prediction , which estimates the original clean image. Substituting into the reverse mean equation gives:
Here, is the neural network's best estimate of the original data from the noisy input .
Estimating Noise and Denoising
Instead of directly predicting , the neural network predicts the noise added during the forward process. Predicting the noise is advantageous because it simplifies the training objective.
Once the model predicts the noise, we can recover the clean data estimate using the following equation:
This equations shows that by estimating the noise at each step, the model progressively removes it, ultimately recovering the clean data .
Final Reverse Step Update
The final update for the reverse step combines both the predicted mean and the stochastic component (the Gaussian noise sampled from ):
This equation highlights that the reverse process involves both:
- Deterministic denoising through the learned mean .
- Stochastic sampling, by adding Gaussian noise sampled from .
Final Notes on Reverse Process
Since and are predefined in the forward process, the reverse process follows a mostly deterministic trajectory, except for the added Gaussian noise at each step. This stochastic noise ensures that the generated samples are diverse, even though the denoising path is largely guided by the neural network's predictions.
Training the Model
Training a diffusion model involves minimizing the difference between the true noise and the noise predicted by the model . The loss function is derived from the variational lower bound (VLB), which optimizes the log-likelihood of the data by approximating the posterior distribution over the latent variables.
Variational Lower Bound (VLB)
The log-likelihood of the data can be written as:
Direct optimization of this objective is intractable, so we maximize the variational lower bound (VLB) instead:
This bound is decomposed into KL divergence terms, comparing the true forward process with the learned reverse process , along with a reconstruction loss at .
Loss Function
The simplified loss function for training becomes:
This objective encourages the model to accurately predict the noise , so that during the reverse process, the noise can be effectively removed to recover the original data. The training process involves the following steps:
- Sample Data: Randomly sample from the dataset.
- Add Noise: Apply the forward process to generate from by adding noise
- Predict Noise: Feed and the time step to the model, which predicts the noise .
- Loss Calculation: Compute the loss as the squared difference between the predicted and actual noise.
- Optimization: Update the model parameters using an optimizer like Adam to minimize the loss.
Sampling Problem in Diffusion Models
In the reverse process of the Denoising Diffusion Probabilistic Model (DDPM), each step is stochastic, with the model generating a new random noise sample at each iteration. This significantly slows down inference, as the model typically requires 1000 steps to generate high-quality samples (as described in the original DDPM paper). While DDPM produces high-quality images and is theoretically sound, the time and computational cost make it inefficient for real-world applications. The large number of steps is necassary to ensure a smooth, gradual transition from noisy data back to the reconstructed image, which helps maintain sample quality and facilitates model learning.
Denoising Diffusion Implicit Models (DDIM): The Key Innovation
To address this issue, Denoising Diffusion Implicit Models (DDIM) was introduced, offering a deterministic approach to the reverse process. The key idea in DDIM is to convert the stochastic reverse process into a deterministic mapping, allowing the model to transform noisy data into clear data in fewer steps, without generating new random noises at each step. This approach changes the nature of the model to a non-Markovian process, meaning that the next state depends not only on the current state but also on the initial date at time step .
Mathematical Explanation of DDIM sampling
Lets dive into the mathematics of how DDIM achieves this improvement. The core idea is to compute a deterministic trajectory that maps noisy data back to the original data without relying on randomness at each step.
Forward Process (Same as DDPM) The forward process in DDIM is the same as in DDPM. The model starts by adding Gaussian noise to the data , progressively moving towards pure noise as increases. This is described by:
where is the added Gaussian noise, the is the cumulative product of the noise schedule over time.
Reverse Process in DDIM (Deterministic) Here is where DDIM introduces its key innovation. Unlike in DDPM, DDIM does not sample from a distribution. Instead, it uses a deterministic mapping to computes directly from and the noise predicted by the model :
where:
- is the data at the next (earlier) time step.
- is the cumulative product of the noise schedule at time .
- is the noise predicted by the neural network at time step .
Why is this deterministic?
In DDPM, each reverse step involves sampling from a Gaussian distribution, which introduces randomness and slows down the process. In contrast, DDIM directly computes using a deterministic mapping based on and the predicted noise , eliminating random sampling. This deterministic approach greatly speeds up the inference process.
Linking Back to the Original Data
One key difference in DDIM is the introduction of a dependence on (the original data) at every reverse step. This makes the reverse process non-Markovian, meaning that each step depends on both the noisy sample and the original data . This allows DDIM to take larger steps during the reverse process, reducing the total number of steps required. In practice, DDIM reduces the number of reverse steps from 1000 (as in DDPM) to as few as 50 or 100, without sacrificing sample quality.
Derivation of DDIM Reverse Process
To understand the reverse process more clearly, lets derive the key equations.
Forward Process Recap
In the forward process, we can express as a function of and the noise :
Rearranging this equation, we can express as a function of :
This shows us that once we have , we can deterministically recover .
Reverse Process in DDIM
DDIM leverages this formula to compute each reverse step without introducing randomness. By incorporating the learned noise , DDIM computes deterministically in the reverse direction:
This direct computation removes the need for stochastic sampling, making the reverse process faster and more efficient.
Why DDIM works with Fewer Steps
- Non-Markovian Nature: DDIM allows the reverse process to depend on both (the original data) and (the noisy data at time step ), enabling the model to take larger steps without losing track of the original data. This reduces the total number of steps required.
- Deterministic Path: By directly computing each reverse step without randomness, DDIM becomes more efficient, skipping unnecessary steps while maintaining high fidelity in the generated samples.
Applications and Benefits
- Speed: DDIM can reduce the number of steps by an order of magnitude (e.g., from 1000 in DDPM to 100 or fewer steps in DDIM), significantly speeding up the sampling process. This makes it much more suitable for real-life or large-scale applications.
- Quality: Despite using fewer steps, DDIM still maintains high-quality outputs, and sometimes even improves sample quality due to the smoother, deterministic trajectory through the data space.
Closing Thoughts
A lot has happened in the world while I was preparing this piece. Tesla unveiled its new humanoid robot and robotaxi concept—both just proofs of concept for now—but knowing how fast Elon Musk moves, I wouldn’t be surprised if the Optimus robot is commercialized in a few years. Even more impressive, though, was SpaceX’s successful mid-air catch of the Super Heavy booster using Mechazilla, marking a new era for reusable rockets. This brings humanity a step closer to Mars, even though it will still take years before a successful colony is set up. It’s hard not to be inspired by these kinds of "moonshots" aimed at building a better future. What strikes me most is that a man with no formal background in engineering made all of this happen. It’s a reminder that you can just do things—break through barriers and create what others might not even imagine.
Writing this reminds me of something Steve Jobs once said: "You tend to get told that the world is the way it is and that you should live your life inside the world, trying not to bash into the walls too much. But that’s a very limited life. Life can be much broader once you discover that everything around you was made up by people that were no smarter than you, and you can change it." It’s a reminder for me to work even harder toward building the future I want to see.
As for diffusion models, they’re an incredible piece of technology. They learn the latent representations of images from training data and reproduce those images from fully noised distributions, essentially recreating images from what looks like chaos. While there are valid ethical concerns, particularly around the use of artists’ work without consent and fears that these models could replace human creativity, I believe diffusion models are only going to get better—and faster. Just look at how quickly they solved issues like generating realistic hands. This opens up an exciting future where anyone can create art based on their ideas, sparking new waves of creativity across the board.
However, this also raises concerns about misuse and misinformation. With tools this powerful, discussions around regulation are necessary to ensure responsible use. We need an open dialogue between creators, policymakers, and society at large to find the balance between innovation and ethical responsibility.