- Published on
Non-Gaussian Latent Diffusion Models
- Authors
- Name
- Alex Rosen
Table of Contents
The following is a summary of a paper I wrote, exploring non-Gaussian noise for diffusion models.
Introduction
Latent Diffusion Models (LDMs) [1] have emerged as a prominent class of generative models, offering a novel approach to synthesizing high-quality data across various domains, including images, audio, and text. At the heart of LDMs lies the concept of a diffusion process [2], a mechanism typically governed by a Markov chain that gradually transforms data into noisier representations; eventually, approximately isotropic Gaussian noise [2]. The reverse of this process, which involves iteratively denoising the data, is used for generation.
Non-Gaussian noise in diffusion models (DMs) capture a broader range of data distributions, which might not be well-represented by Gaussian noise alone. This could be particularly relevant in cases where training data exhibits heavy-tailed or multimodal characteristics. Using different types of noise can aid in overcoming limitations associated with Gaussian noise, such as the tendency to smooth out sharp features or details. This is crucial in applications like image and audio generation, where preserving fine-grained details is essential for high-quality outputs.
Background
Diffusion
Suppose we begin with an element of our training set . Over steps, we define a Markov chain producing noisier and noisier samples using a (typically Gaussian) distribution . We learn or specify a sequence of coefficients called a Beta schedule and calculate
for to gradually add noise to . We formulate this problem with the goal of finding the reverse distribution to reproduce from random noise.
Applying Bayes' rule is computationally intractable. Instead, since it can be shown that is Gaussian given is Gaussian [2], we can approximate the reverse distribution by learning the parameters to through some neural network . That is,
To aid the learning process, we also make the variance of at each step a function of the Beta schedule (called a variance schedule [2]), giving our neural network more information about the distribution which produced .
Several modern approaches [3], [4], [5] skip explicit parameterization and use a U-Net (dimension preserving CNN) [6] to predict the noise itself, sometimes utilizing an autoencoder to denoise in a latent space [1]. Let's take this a step further; what patterns emerge when we employ different kinds of noise?
DDPMs
Diffusion in a latent space (LDMs) is relatively self-explanatory dimensionality reduction, so we focus on understanding DDPMs for arbitrary distributions instead. For shorthand, call and . Now, for some distribution , we can write
Given the noise at timestep , DDPMs aim to predict a function that takes coupled values and to some denoising space [4], [5]. denotes the weights of this denoising function to later introduce an autoencoder with parameters . This leaves us with
and so we predict
for variance schedule and . Previously, this was for and the standard deviation of the forward distribution , typically [4].
Methodology
Noise Comparisons
The difficulty in comparing DMs with different noise distributions is dealing with fair amounts of noise creation; under arbitrary variance schedules, it's unclear whether will be 'pure noise' for all distributions. To deal with this, we build off the work on variance schedulers in the first diffusion paper [2], generalizing linear and quadratic variance schedulers to arbitrary distributions (with finite mean and variance).
Theorem 1.
Let be two probability distributions with mean zero. If is a monotonic sequence for the variance schedules and of and defined over compact sets, then_
exists as for any metric that induces the same topology as the Euclidean norm.
Proof. By the equivalence of norms, it suffices to prove the Euclidean case. Since
for a ratio of coefficients , and have asymptotically equivalent Beta schedules up to a constant factor. Then, substituting into (1), we have an upper bound on the quotient in (6) for every . So, the monotonicity of and the compactness of the domains of both variance schedulers is enough for the limit to exist. Note that even though we only desire non-decreasing variance schedules, the sequence above may not be. Introducing a more general metric allows for broader interpretations of what it means for outputs to have similar noise.
So, with a large number of timesteps, we expect similar noise for an arbitrary collection of distributions with (usually constant, linear or quadratic) variance schedules that satisfy the assumptions of Theorem 1. Implementations should apply normalization so (6) evaluates to one.
Architecture
NGLDMs perform DDPM-style diffusion using a U-Net. But, we now sample from the arbitrary distribution with zero mean and finite variance, then train the U-Net in a latent space with a pre-trained autoencoder (encoder, decoder with parameters ).
Algorithms 1 and 2 are enhancements of the DDPM training and sampling algorithms; the objective function is identical.
We immediately encode U-Net input. Then, we uniformly sample a timestep to generate noise for. Using Theorem 1, we make a function of a polynomial Beta schedule, and recommend that this polynomial be of low degree to match the intuition behind adding noise gradually to our latent input. Finally, we use mean squared error loss to minimize the distance between and the noise added to our input in a single step (3).
The sampling algorithm repeats the prediction (5) for all timesteps. We then return the decoded prediction back at timestamp zero. It's important to note that at timestamp zero, we apply the constraint since our predictions should be nearing 'full-denoise' at this point.
Autoencoder
For high-dimensional datasets, one should use a pre-trained autoencoder to avoid dealing with learning a good latent representation of inputs and a successful denoising process simultaneously. The choice of autoencoder should be dealt with on a case-by-case basis to ensure necessary information is retained in the latent space. For example, image data should be taken from a (batch size) (height) (width) (color channels) tensor to a tensor of (batch size) (new dimension) (color channels).
Experiments and Evaluation
FID Score
To measure the similarity between our original and generated images, we use the Fréchet Inception Distance (FID) score [7]
where are the feature-wise mean and covariance matrices of the original and generated images, respectively. We use InceptionV3 [8] as a feature extractor and evaluate FID scores according to the mean and covariance matrices of the feature vectors from the final layer of this model. This has become standard for recent evaluations of DMs, with lower scores signifying a generative process that does a better job emulating the test set.
Setup
The U-Net implementation is characterized by double convolutional layers and self-attention mechanisms. Positional encoding is integrated to maintain temporal context in the diffusion sequence. Siren activations [9] are used on account of their results for inpainting.
Since it's computationally infeasible to perform a grid search for the optimal variance schedule for each of the five distributions in Table 1, each distribution has fixed variance 1. We use hyperparameters optimal for Gaussian noise; a total number of timesteps, and a linear Beta schedule from to [4]. Each model was trained with a batch size of 10 over 200 epochs.
Results
Each model was trained on the CIFAR-10 dataset (10 classes with 6000 images per class), aided by an autoencoder that took these images from a dimensional to a dimensional space with small reconstruction error. Sampling is not class-conditional, meaning the results shown are from an entirely unsupervised process.
Table 1 shows the FID scores at epoch 195 for all five distributions chosen, and Figure 2 shows the FID scores for the Gaussian, Laplace, Uniform and Gumbel distributions. The FID scores for the Exponential distribution were separated into Figure 3 since they are too large to be included on the same scale; this was somewhat expected due to the nature of the family of Exponential distributions. FID scores were calculated using 1000 randomly generated images and 1000 randomly sampled training images, per epoch, per distribution.
The numerical rankings shown reflect qualitative results above (more). Gaussian samples appear the most realistic while Laplace samples seem to produce smeary, but sharper, recognizable visual forms. Uniform and Gumbel samples are less coherent, and the Exponential samples are nearly black.
Takeaways
Determining the best type of noise for a DM using a validation set is computationally expensive for state-of-the-art architectures. Regardless, these results demonstrate the importance of lending equal attention to a multitude of noise distributions.
At first glance, it may seem that Table 1 shows that Gaussian distributions should always be used for DDPM-related diffusion processes in a latent space. But, since we use a Beta schedule and number of timesteps fine-tuned for the Gaussian distribution, this in fact alludes to the considerable promise of Non-Gaussian sampling for DMs – after normalization, they perform surprisingly well on an arbitrary hyperparameter set. The Laplace distribution, in particular, seems to be an excellent candidate for further research. Going forward, if suitable resources are dedicated to diffusion with certain non-Gaussian distributions, there may be strong empirical results that match or exceed the performace of Gaussian noise.