Table of Contents

Introduction
Background
Diffusion
DDPMs
Methodology
Noise Comparisons
Architecture
Autoencoder
Experiments and Evaluation
FID Score
Setup
Results
Takeaways
References

The following is a summary of a paper I wrote, exploring non-Gaussian noise for diffusion models.

Introduction

Latent Diffusion Models (LDMs) [1] have emerged as a prominent class of generative models, offering a novel approach to synthesizing high-quality data across various domains, including images, audio, and text. At the heart of LDMs lies the concept of a diffusion process [2], a mechanism typically governed by a Markov chain that gradually transforms data into noisier representations; eventually, approximately isotropic Gaussian noise [2]. The reverse of this process, which involves iteratively denoising the data, is used for generation.

Non-Gaussian noise in diffusion models (DMs) capture a broader range of data distributions, which might not be well-represented by Gaussian noise alone. This could be particularly relevant in cases where training data exhibits heavy-tailed or multimodal characteristics. Using different types of noise can aid in overcoming limitations associated with Gaussian noise, such as the tendency to smooth out sharp features or details. This is crucial in applications like image and audio generation, where preserving fine-grained details is essential for high-quality outputs.

Background

Diffusion

Suppose we begin with an element of our training set $x^{(0)}$ . Over $T$ steps, we define a Markov chain producing noisier and noisier samples $x^{(0)},\ldots,x^{(T)}$ using a (typically Gaussian) distribution $q(x^{(t)}|x^{(t-1)})$ . We learn or specify a sequence of coefficients called a Beta schedule $\{\beta_T\}_{t=1}^T$ and calculate

\displaystyle x^{(t)}_i=\sqrt{1-\beta_t}x^{(t-1)}_i+\sqrt{\beta_t}\epsilon_t\tag{1}

for $\epsilon_t \sim q$ to gradually add noise to $x^{(0)}$ . We formulate this problem with the goal of finding the reverse distribution $p(x^{(t-1)}|x^{(t)})$ to reproduce $x^{(0)}$ from random noise.

Applying Bayes' rule is computationally intractable. Instead, since it can be shown that $p$ is Gaussian given $q$ is Gaussian [2], we can approximate the reverse distribution $p$ by learning the parameters to $p_{\theta}$ through some neural network $f=(f_{\mu},f_{\Sigma})$ . That is,

p_{\theta}(x^{(t-1)}|x^{(t)})\triangleq\mathcal{N}(x^{(t-1)};f_{\mu}(x^{(t)},t),f_{\Sigma}(x^{(t)},t))\tag{2}

To aid the learning process, we also make the variance of $q$ at each step $t$ a function of the Beta schedule (called a variance schedule [2]), giving our neural network more information about the distribution which produced $x^{(t)}$ .

Several modern approaches [3], [4], [5] skip explicit parameterization and use a U-Net (dimension preserving CNN) [6] to predict the noise $\epsilon$ itself, sometimes utilizing an autoencoder to denoise in a latent space [1]. Let's take this a step further; what patterns emerge when we employ different kinds of noise?

DDPMs

Diffusion in a latent space (LDMs) is relatively self-explanatory dimensionality reduction, so we focus on understanding DDPMs for arbitrary distributions instead. For shorthand, call $\alpha_t=1-\beta_t$ and $\overline{a_t}=\prod_{i=1}^t a_i$ . Now, for some distribution $\mathcal{D}$ , we can write

q(x^{(t)}|x^{(t-1)})=\mathcal{D}(x^{(t)};\sqrt{\overline{\alpha}_t}x^{(0)},(1-\overline{\alpha}_t)\Bbb{I})\tag{3}

Given the noise $\epsilon$ at timestep $t$ , DDPMs aim to predict a function $\epsilon_{\theta_2}$ that takes coupled values $\sqrt{\overline{a}_t}x_0+\sqrt{1-\overline{a}_t}\epsilon$ and $t$ to some denoising space [4], [5]. $\theta_2$ denotes the weights of this denoising function to later introduce an autoencoder with parameters $\theta_1$ . This leaves us with

f_{\mu}(x^{(t)},t)=\frac{1}{\sqrt{\overline{a}_t}}\left( x^{(t)}-\frac{\beta_t}{\sqrt{1-\overline{a}_t}}\epsilon_{\theta_2}(x^{(t)},t) \right)\tag{4}

and so we predict

\hat{x}^{(t)}=\frac{1}{\sqrt{\overline{a}_t}}\left( x^{(t)}-\frac{\beta_t}{\sqrt{1-\overline{a}_t}}\epsilon_{\theta_2}(x^{(t)},t) \right) + \nu_{\mathcal{D}}(\beta_t,t) z\tag{5}

for variance schedule $\nu_{\mathcal{D}}(\beta_t,t)$ and $z\sim \mathcal{D}$ . Previously, this was $\sigma_t z$ for $z\sim \mathcal{N}(0,1)$ and the standard deviation of the forward distribution $\sigma_t$ , typically $\sigma_t=\sqrt{\beta_t}$ [4].

Methodology

Noise Comparisons

The difficulty in comparing DMs with different noise distributions is dealing with fair amounts of noise creation; under arbitrary variance schedules, it's unclear whether $x^{(T)}$ will be 'pure noise' for all distributions. To deal with this, we build off the work on variance schedulers in the first diffusion paper [2], generalizing linear and quadratic variance schedulers to arbitrary distributions (with finite mean and variance).

Theorem 1.

Let $q_1,q_2:\Bbb{R}\mapsto \Bbb{R}$ be two probability distributions with mean zero. If $\{\tfrac{\nu_1(t)}{\nu_2(t)}\}_{t=0}^T$ is a monotonic sequence for the variance schedules $\nu_1$ and $\nu_2$ of $q_1$ and $q_2$ defined over compact sets, then_

\Bbb{E}_{q_1}\left[d\left(x^{(0)}-x^{(T)}_{q_1}\right)\right]\Big/ \Bbb{E}_{q_2}\left[d\left(x^{(0)}-x^{(T)}_{q_2}\right)\right]\tag{6}

exists as $T\to\infty$ for any metric $d$ that induces the same topology as the Euclidean norm.

Proof. By the equivalence of norms, it suffices to prove the Euclidean case. Since

Var_{q_1}(x_{q_1}^{(T)})/Var_{q_2}(x_{q_2}^{(T)})\to c\tag{7}

for a ratio of coefficients $c\in \mathbb{R}$ , $q_1$ and $q_2$ have asymptotically equivalent Beta schedules up to a constant factor. Then, substituting $\mathbb{E}[x^{(T)}]=0$ into (1), we have an upper bound on the quotient in (6) for every $T$ . So, the monotonicity of $\{\tfrac{\nu_1(t)}{\nu_2(t)}\}_{t=0}^T$ and the compactness of the domains of both variance schedulers is enough for the limit to exist. Note that even though we only desire non-decreasing variance schedules, the sequence above may not be. Introducing a more general metric allows for broader interpretations of what it means for outputs to have similar noise.

So, with a large number of timesteps, we expect similar noise for an arbitrary collection of distributions with (usually constant, linear or quadratic) variance schedules that satisfy the assumptions of Theorem 1. Implementations should apply normalization so (6) evaluates to one.

Architecture

NGLDMs perform DDPM-style diffusion using a U-Net. But, we now sample from the arbitrary distribution $\mathcal{D}$ with zero mean and finite variance, then train the U-Net in a latent space with a pre-trained autoencoder $\mathscr{E}_{\theta_1},\mathscr{D}_{\theta_1}$ (encoder, decoder with parameters $\theta_1$ ).

Algorithms 1 and 2 are enhancements of the DDPM training and sampling algorithms; the objective function is identical.

We immediately encode $x\overset{\mathscr{\epsilon}_{\theta_1}}{\to}$ U-Net input. Then, we uniformly sample a timestep to generate noise for. Using Theorem 1, we make $\nu_{\mathcal{D}}$ a function of a polynomial Beta schedule, and recommend that this polynomial be of low degree to match the intuition behind adding noise gradually to our latent input. Finally, we use mean squared error loss to minimize the distance between $\epsilon_{\theta_2}$ and the noise added to our input in a single step (3).

The sampling algorithm repeats the prediction (5) for all $T$ timesteps. We then return the decoded prediction back at timestamp zero. It's important to note that at timestamp zero, we apply the constraint $v_{\mathcal{D}}(\beta_0,0)\approx 0$ since our predictions should be nearing 'full-denoise' at this point.

Autoencoder

For high-dimensional datasets, one should use a pre-trained autoencoder to avoid dealing with learning a good latent representation of inputs and a successful denoising process simultaneously. The choice of autoencoder should be dealt with on a case-by-case basis to ensure necessary information is retained in the latent space. For example, image data should be taken from a (batch size) $\times$ (height) $\times$ (width) $\times$ (color channels) tensor to a tensor of (batch size) $\times$ (new dimension) $\times$ (color channels).

Experiments and Evaluation

FID Score

To measure the similarity between our original and generated images, we use the Fréchet Inception Distance (FID) score [7]

FID=||{\mu_1-\mu_2}||^2_2+tr(\Sigma_1+\Sigma_2+2\sqrt{\Sigma_1 \Sigma_2})

where $\mu_1,\mu_2,\Sigma_1,\Sigma_2$ are the feature-wise mean and covariance matrices of the original and generated images, respectively. We use InceptionV3 [8] as a feature extractor and evaluate FID scores according to the mean and covariance matrices of the feature vectors from the final layer of this model. This has become standard for recent evaluations of DMs, with lower scores signifying a generative process that does a better job emulating the test set.

Setup

The U-Net implementation is characterized by double convolutional layers and self-attention mechanisms. Positional encoding is integrated to maintain temporal context in the diffusion sequence. Siren activations [9] are used on account of their results for inpainting.

Since it's computationally infeasible to perform a grid search for the optimal variance schedule for each of the five distributions in Table 1, each distribution has fixed variance 1. We use hyperparameters optimal for Gaussian noise; a total number of $T=1000$ timesteps, and a linear Beta schedule from $\beta_1=10^{-4}$ to $\beta_T=0.02$ [4]. Each model was trained with a batch size of 10 over 200 epochs.

Results

Each model was trained on the CIFAR-10 dataset (10 classes with 6000 $32\times 32$ images per class), aided by an autoencoder that took these images from a $10\times 32\times 32\times 3$ dimensional to a $10\times 24\times 24\times 3$ dimensional space with small reconstruction error. Sampling is not class-conditional, meaning the results shown are from an entirely unsupervised process.

Table 1 shows the FID scores at epoch 195 for all five distributions chosen, and Figure 2 shows the FID scores for the Gaussian, Laplace, Uniform and Gumbel distributions. The FID scores for the Exponential distribution were separated into Figure 3 since they are too large to be included on the same scale; this was somewhat expected due to the nature of the family of Exponential distributions. FID scores were calculated using 1000 randomly generated images and 1000 randomly sampled training images, per epoch, per distribution.

The numerical rankings shown reflect qualitative results above (more). Gaussian samples appear the most realistic while Laplace samples seem to produce smeary, but sharper, recognizable visual forms. Uniform and Gumbel samples are less coherent, and the Exponential samples are nearly black.

Takeaways

Determining the best type of noise for a DM using a validation set is computationally expensive for state-of-the-art architectures. Regardless, these results demonstrate the importance of lending equal attention to a multitude of noise distributions.

At first glance, it may seem that Table 1 shows that Gaussian distributions should always be used for DDPM-related diffusion processes in a latent space. But, since we use a Beta schedule and number of timesteps fine-tuned for the Gaussian distribution, this in fact alludes to the considerable promise of Non-Gaussian sampling for DMs – after normalization, they perform surprisingly well on an arbitrary hyperparameter set. The Laplace distribution, in particular, seems to be an excellent candidate for further research. Going forward, if suitable resources are dedicated to diffusion with certain non-Gaussian distributions, there may be strong empirical results that match or exceed the performace of Gaussian noise.

References

[1]

R. Rombach, A. Blattmann, D. Lorenz, P. Esser, and B. Ommer, “High-Resolution Image Synthesis with Latent Diffusion Models,” CoRR, vol. abs/2112.10752, 2021, Available: https://arxiv.org/abs/2112.10752

[2]

J. Sohl-Dickstein, E. A. Weiss, N. Maheswaranathan, and S. Ganguli, “Deep Unsupervised Learning using Nonequilibrium Thermodynamics,” CoRR, vol. abs/1503.03585, 2015, Available: http://arxiv.org/abs/1503.03585

[3]

Y. Song, J. Sohl-Dickstein, D. P. Kingma, A. Kumar, S. Ermon, and B. Poole, “Score-Based Generative Modeling through Stochastic Differential Equations,” CoRR, vol. abs/2011.13456, 2020, Available: https://arxiv.org/abs/2011.13456

[4]

J. Ho, A. Jain, and P. Abbeel, “Denoising Diffusion Probabilistic Models,” CoRR, vol. abs/2006.11239, 2020, Available: https://arxiv.org/abs/2006.11239

[5]

A. Nichol and P. Dhariwal, “Improved Denoising Diffusion Probabilistic Models,” CoRR, vol. abs/2102.09672, 2021, Available: https://arxiv.org/abs/2102.09672

[6]

O. Ronneberger, P. Fischer, and T. Brox, “U-Net: Convolutional Networks for Biomedical Image Segmentation,” CoRR, vol. abs/1505.04597, 2015, Available: http://arxiv.org/abs/1505.04597

[7]

M. Heusel, H. Ramsauer, T. Unterthiner, B. Nessler, G. Klambauer, and S. Hochreiter, “GANs Trained by a Two Time-Scale Update Rule Converge to a Nash Equilibrium,” CoRR, vol. abs/1706.08500, 2017, Available: http://arxiv.org/abs/1706.08500

[8]

C. Szegedy, V. Vanhoucke, S. Ioffe, J. Shlens, and Z. Wojna, “Rethinking the Inception Architecture for Computer Vision,” CoRR, vol. abs/1512.00567, 2015, Available: http://arxiv.org/abs/1512.00567

[9]

V. Sitzmann, J. N. P. Martel, A. W. Bergman, D. B. Lindell, and G. Wetzstein, “Implicit Neural Representations with Periodic Activation Functions,” CoRR, vol. abs/2006.09661, 2020, Available: https://arxiv.org/abs/2006.09661