Published on

Non-Gaussian Latent Diffusion Models

Authors
Table of Contents

The following is a summary of a paper I wrote, exploring non-Gaussian noise for diffusion models.

Introduction

Latent Diffusion Models (LDMs) [1] have emerged as a prominent class of generative models, offering a novel approach to synthesizing high-quality data across various domains, including images, audio, and text. At the heart of LDMs lies the concept of a diffusion process [2], a mechanism typically governed by a Markov chain that gradually transforms data into noisier representations; eventually, approximately isotropic Gaussian noise [2]. The reverse of this process, which involves iteratively denoising the data, is used for generation.

Non-Gaussian noise in diffusion models (DMs) capture a broader range of data distributions, which might not be well-represented by Gaussian noise alone. This could be particularly relevant in cases where training data exhibits heavy-tailed or multimodal characteristics. Using different types of noise can aid in overcoming limitations associated with Gaussian noise, such as the tendency to smooth out sharp features or details. This is crucial in applications like image and audio generation, where preserving fine-grained details is essential for high-quality outputs.

Background

Diffusion

Suppose we begin with an element of our training set x(0)x^{(0)}. Over TT steps, we define a Markov chain producing noisier and noisier samples x(0),,x(T)x^{(0)},\ldots,x^{(T)} using a (typically Gaussian) distribution q(x(t)x(t1))q(x^{(t)}|x^{(t-1)}). We learn or specify a sequence of coefficients called a Beta schedule {βT}t=1T\{\beta_T\}_{t=1}^T and calculate

xi(t)=1βtxi(t1)+βtϵt(1)\displaystyle x^{(t)}_i=\sqrt{1-\beta_t}x^{(t-1)}_i+\sqrt{\beta_t}\epsilon_t\tag{1}

for ϵtq\epsilon_t \sim q to gradually add noise to x(0)x^{(0)}. We formulate this problem with the goal of finding the reverse distribution p(x(t1)x(t))p(x^{(t-1)}|x^{(t)}) to reproduce x(0)x^{(0)} from random noise.

Applying Bayes' rule is computationally intractable. Instead, since it can be shown that pp is Gaussian given qq is Gaussian [2], we can approximate the reverse distribution pp by learning the parameters to pθp_{\theta} through some neural network f=(fμ,fΣ)f=(f_{\mu},f_{\Sigma}). That is,

pθ(x(t1)x(t))N(x(t1);fμ(x(t),t),fΣ(x(t),t))(2) p_{\theta}(x^{(t-1)}|x^{(t)})\triangleq\mathcal{N}(x^{(t-1)};f_{\mu}(x^{(t)},t),f_{\Sigma}(x^{(t)},t))\tag{2}

To aid the learning process, we also make the variance of qq at each step tt a function of the Beta schedule (called a variance schedule [2]), giving our neural network more information about the distribution which produced x(t)x^{(t)}.

Several modern approaches [3], [4], [5] skip explicit parameterization and use a U-Net (dimension preserving CNN) [6] to predict the noise ϵ\epsilon itself, sometimes utilizing an autoencoder to denoise in a latent space [1]. Let's take this a step further; what patterns emerge when we employ different kinds of noise?

DDPMs

Diffusion in a latent space (LDMs) is relatively self-explanatory dimensionality reduction, so we focus on understanding DDPMs for arbitrary distributions instead. For shorthand, call αt=1βt\alpha_t=1-\beta_t and at=i=1tai\overline{a_t}=\prod_{i=1}^t a_i. Now, for some distribution D\mathcal{D}, we can write

q(x(t)x(t1))=D(x(t);αtx(0),(1αt)I)(3) q(x^{(t)}|x^{(t-1)})=\mathcal{D}(x^{(t)};\sqrt{\overline{\alpha}_t}x^{(0)},(1-\overline{\alpha}_t)\Bbb{I})\tag{3}

Given the noise ϵ\epsilon at timestep tt, DDPMs aim to predict a function ϵθ2\epsilon_{\theta_2} that takes coupled values atx0+1atϵ\sqrt{\overline{a}_t}x_0+\sqrt{1-\overline{a}_t}\epsilon and tt to some denoising space [4], [5]. θ2\theta_2 denotes the weights of this denoising function to later introduce an autoencoder with parameters θ1\theta_1. This leaves us with

fμ(x(t),t)=1at(x(t)βt1atϵθ2(x(t),t))(4) f_{\mu}(x^{(t)},t)=\frac{1}{\sqrt{\overline{a}_t}}\left( x^{(t)}-\frac{\beta_t}{\sqrt{1-\overline{a}_t}}\epsilon_{\theta_2}(x^{(t)},t) \right)\tag{4}

and so we predict

x^(t)=1at(x(t)βt1atϵθ2(x(t),t))+νD(βt,t)z(5) \hat{x}^{(t)}=\frac{1}{\sqrt{\overline{a}_t}}\left( x^{(t)}-\frac{\beta_t}{\sqrt{1-\overline{a}_t}}\epsilon_{\theta_2}(x^{(t)},t) \right) + \nu_{\mathcal{D}}(\beta_t,t) z\tag{5}

for variance schedule νD(βt,t)\nu_{\mathcal{D}}(\beta_t,t) and zDz\sim \mathcal{D}. Previously, this was σtz\sigma_t z for zN(0,1)z\sim \mathcal{N}(0,1) and the standard deviation of the forward distribution σt\sigma_t, typically σt=βt\sigma_t=\sqrt{\beta_t} [4].

Methodology

Noise Comparisons

The difficulty in comparing DMs with different noise distributions is dealing with fair amounts of noise creation; under arbitrary variance schedules, it's unclear whether x(T)x^{(T)} will be 'pure noise' for all distributions. To deal with this, we build off the work on variance schedulers in the first diffusion paper [2], generalizing linear and quadratic variance schedulers to arbitrary distributions (with finite mean and variance).

Theorem 1.

Let q1,q2:RRq_1,q_2:\Bbb{R}\mapsto \Bbb{R} be two probability distributions with mean zero. If {ν1(t)ν2(t)}t=0T\{\tfrac{\nu_1(t)}{\nu_2(t)}\}_{t=0}^T is a monotonic sequence for the variance schedules ν1\nu_1 and ν2\nu_2 of q1q_1 and q2q_2 defined over compact sets, then_

Eq1[d(x(0)xq1(T))]/Eq2[d(x(0)xq2(T))](6) \Bbb{E}_{q_1}\left[d\left(x^{(0)}-x^{(T)}_{q_1}\right)\right]\Big/ \Bbb{E}_{q_2}\left[d\left(x^{(0)}-x^{(T)}_{q_2}\right)\right]\tag{6}

exists as TT\to\infty for any metric dd that induces the same topology as the Euclidean norm.

Proof. By the equivalence of norms, it suffices to prove the Euclidean case. Since

Varq1(xq1(T))/Varq2(xq2(T))c(7) Var_{q_1}(x_{q_1}^{(T)})/Var_{q_2}(x_{q_2}^{(T)})\to c\tag{7}

for a ratio of coefficients cRc\in \mathbb{R}, q1q_1 and q2q_2 have asymptotically equivalent Beta schedules up to a constant factor. Then, substituting E[x(T)]=0\mathbb{E}[x^{(T)}]=0 into (1), we have an upper bound on the quotient in (6) for every TT. So, the monotonicity of {ν1(t)ν2(t)}t=0T\{\tfrac{\nu_1(t)}{\nu_2(t)}\}_{t=0}^T and the compactness of the domains of both variance schedulers is enough for the limit to exist. Note that even though we only desire non-decreasing variance schedules, the sequence above may not be. Introducing a more general metric allows for broader interpretations of what it means for outputs to have similar noise.

So, with a large number of timesteps, we expect similar noise for an arbitrary collection of distributions with (usually constant, linear or quadratic) variance schedules that satisfy the assumptions of Theorem 1. Implementations should apply normalization so (6) evaluates to one.

Architecture

NGLDMs perform DDPM-style diffusion using a U-Net. But, we now sample from the arbitrary distribution D\mathcal{D} with zero mean and finite variance, then train the U-Net in a latent space with a pre-trained autoencoder Eθ1,Dθ1\mathscr{E}_{\theta_1},\mathscr{D}_{\theta_1} (encoder, decoder with parameters θ1\theta_1).

Training
Sampling

Algorithms 1 and 2 are enhancements of the DDPM training and sampling algorithms; the objective function is identical.

We immediately encode xϵθ1x\overset{\mathscr{\epsilon}_{\theta_1}}{\to} U-Net input. Then, we uniformly sample a timestep to generate noise for. Using Theorem 1, we make νD\nu_{\mathcal{D}} a function of a polynomial Beta schedule, and recommend that this polynomial be of low degree to match the intuition behind adding noise gradually to our latent input. Finally, we use mean squared error loss to minimize the distance between ϵθ2\epsilon_{\theta_2} and the noise added to our input in a single step (3).

The sampling algorithm repeats the prediction (5) for all TT timesteps. We then return the decoded prediction back at timestamp zero. It's important to note that at timestamp zero, we apply the constraint vD(β0,0)0v_{\mathcal{D}}(\beta_0,0)\approx 0 since our predictions should be nearing 'full-denoise' at this point.

Autoencoder

For high-dimensional datasets, one should use a pre-trained autoencoder to avoid dealing with learning a good latent representation of inputs and a successful denoising process simultaneously. The choice of autoencoder should be dealt with on a case-by-case basis to ensure necessary information is retained in the latent space. For example, image data should be taken from a (batch size) ×\times (height) ×\times (width) ×\times (color channels) tensor to a tensor of (batch size) ×\times (new dimension) ×\times (color channels).

Experiments and Evaluation

FID Score

To measure the similarity between our original and generated images, we use the Fréchet Inception Distance (FID) score [7]

FID=μ1μ222+tr(Σ1+Σ2+2Σ1Σ2) FID=||{\mu_1-\mu_2}||^2_2+tr(\Sigma_1+\Sigma_2+2\sqrt{\Sigma_1 \Sigma_2})

where μ1,μ2,Σ1,Σ2\mu_1,\mu_2,\Sigma_1,\Sigma_2 are the feature-wise mean and covariance matrices of the original and generated images, respectively. We use InceptionV3 [8] as a feature extractor and evaluate FID scores according to the mean and covariance matrices of the feature vectors from the final layer of this model. This has become standard for recent evaluations of DMs, with lower scores signifying a generative process that does a better job emulating the test set.

Setup

The U-Net implementation is characterized by double convolutional layers and self-attention mechanisms. Positional encoding is integrated to maintain temporal context in the diffusion sequence. Siren activations [9] are used on account of their results for inpainting.

Since it's computationally infeasible to perform a grid search for the optimal variance schedule for each of the five distributions in Table 1, each distribution has fixed variance 1. We use hyperparameters optimal for Gaussian noise; a total number of T=1000T=1000 timesteps, and a linear Beta schedule from β1=104\beta_1=10^{-4} to βT=0.02\beta_T=0.02 [4]. Each model was trained with a batch size of 10 over 200 epochs.

Results

Table1

Each model was trained on the CIFAR-10 dataset (10 classes with 6000 32×3232\times 32 images per class), aided by an autoencoder that took these images from a 10×32×32×310\times 32\times 32\times 3 dimensional to a 10×24×24×310\times 24\times 24\times 3 dimensional space with small reconstruction error. Sampling is not class-conditional, meaning the results shown are from an entirely unsupervised process.

Graph1
Graph2

Table 1 shows the FID scores at epoch 195 for all five distributions chosen, and Figure 2 shows the FID scores for the Gaussian, Laplace, Uniform and Gumbel distributions. The FID scores for the Exponential distribution were separated into Figure 3 since they are too large to be included on the same scale; this was somewhat expected due to the nature of the family of Exponential distributions. FID scores were calculated using 1000 randomly generated images and 1000 randomly sampled training images, per epoch, per distribution.

Progress

The numerical rankings shown reflect qualitative results above (more). Gaussian samples appear the most realistic while Laplace samples seem to produce smeary, but sharper, recognizable visual forms. Uniform and Gumbel samples are less coherent, and the Exponential samples are nearly black.

Takeaways

Determining the best type of noise for a DM using a validation set is computationally expensive for state-of-the-art architectures. Regardless, these results demonstrate the importance of lending equal attention to a multitude of noise distributions.

Gaussian
Laplacian

At first glance, it may seem that Table 1 shows that Gaussian distributions should always be used for DDPM-related diffusion processes in a latent space. But, since we use a Beta schedule and number of timesteps fine-tuned for the Gaussian distribution, this in fact alludes to the considerable promise of Non-Gaussian sampling for DMs – after normalization, they perform surprisingly well on an arbitrary hyperparameter set. The Laplace distribution, in particular, seems to be an excellent candidate for further research. Going forward, if suitable resources are dedicated to diffusion with certain non-Gaussian distributions, there may be strong empirical results that match or exceed the performace of Gaussian noise.

References:

[1]
R. Rombach, A. Blattmann, D. Lorenz, P. Esser, and B. Ommer, “High-Resolution Image Synthesis with Latent Diffusion Models,” CoRR, vol. abs/2112.10752, 2021, Available: https://arxiv.org/abs/2112.10752
[2]
J. Sohl-Dickstein, E. A. Weiss, N. Maheswaranathan, and S. Ganguli, “Deep Unsupervised Learning using Nonequilibrium Thermodynamics,” CoRR, vol. abs/1503.03585, 2015, Available: http://arxiv.org/abs/1503.03585
[3]
Y. Song, J. Sohl-Dickstein, D. P. Kingma, A. Kumar, S. Ermon, and B. Poole, “Score-Based Generative Modeling through Stochastic Differential Equations,” CoRR, vol. abs/2011.13456, 2020, Available: https://arxiv.org/abs/2011.13456
[4]
J. Ho, A. Jain, and P. Abbeel, “Denoising Diffusion Probabilistic Models,” CoRR, vol. abs/2006.11239, 2020, Available: https://arxiv.org/abs/2006.11239
[5]
A. Nichol and P. Dhariwal, “Improved Denoising Diffusion Probabilistic Models,” CoRR, vol. abs/2102.09672, 2021, Available: https://arxiv.org/abs/2102.09672
[6]
O. Ronneberger, P. Fischer, and T. Brox, “U-Net: Convolutional Networks for Biomedical Image Segmentation,” CoRR, vol. abs/1505.04597, 2015, Available: http://arxiv.org/abs/1505.04597
[7]
M. Heusel, H. Ramsauer, T. Unterthiner, B. Nessler, G. Klambauer, and S. Hochreiter, “GANs Trained by a Two Time-Scale Update Rule Converge to a Nash Equilibrium,” CoRR, vol. abs/1706.08500, 2017, Available: http://arxiv.org/abs/1706.08500
[8]
C. Szegedy, V. Vanhoucke, S. Ioffe, J. Shlens, and Z. Wojna, “Rethinking the Inception Architecture for Computer Vision,” CoRR, vol. abs/1512.00567, 2015, Available: http://arxiv.org/abs/1512.00567
[9]
V. Sitzmann, J. N. P. Martel, A. W. Bergman, D. B. Lindell, and G. Wetzstein, “Implicit Neural Representations with Periodic Activation Functions,” CoRR, vol. abs/2006.09661, 2020, Available: https://arxiv.org/abs/2006.09661