Published on

Generative Models Under Geometric Constraints

Authors
  • avatar
    Name
    Alex Rosen
    Twitter

Introduction

Diffusion models and DDPM [1] were originally inspired by a random walk from non-equilibirum thermodynamics, where we define a process to gradually move from a training sample to noise we call an equilibrium state [2]. This perspective unveiled a new framework to train a generative model by learning to reverse this process moving from an equilibrium state to a high-concentration sample resembling our dataset.

It turns out that this point of view for training a generative model has many deep and intuitive properties, from an absence of joint training objectives to the reverse random walk encouraging a diverse set of samples. Still, one desireable property seems to remain elusive: How do we bake constraints into the reverse process or final sample? How might we sample a configuration of objects in a room with guarantees they don't overlap? My work attempts to pave a path forward for diffusion-esque algorithms that facilitate sampling under geometric constraints.

Background on diffusion

DDPMs belong to a class of generative models whose sampling process involves moving backwards through a finite first-order Markov chain starting from an isotropic Gaussian noise equilibrium. Given the previous element xt1x_{t-1} in our chain (starting with a sample x0x_0 from our training set), we define the forward process for xtx_t as

q(xtxt1)=N(1βtxt1,βtI) q(x_{t}|x_{t-1})=\mathcal{N}(\sqrt{1-\beta_t}x_{t-1},\beta_t\mathbb{I})

which we can sample directly from x0x_0 as

q(xtx0)=N(αtx0,(1αt)I) q(x_{t}|x_0)=\mathcal{N}(\sqrt{\overline{\alpha}_t}x_0,(1-\overline{\alpha}_t)\mathbb{I})

for αt=1βt\alpha_t=1-\beta_t and αt=i=1tαt\overline{\alpha}_t=\prod_{i=1}^t \alpha_t. This forward process is repeated for TT steps (usually 1000), and we choose {βt}t=1T\{\beta_t\}_{t=1}^T so that equilibrium xTx_T is approximately standard normal.

At inference time, start with a sample xTx_T and seek to repeatedly sample xt1x_{t-1} given xtx_t until we reach x0x_0. Using the reparameterization trick, we can write

x0=1αt(x01αtϵ) x_0=\tfrac{1}{\sqrt{\overline{\alpha}_t}}\left(x_0-\sqrt{1-\overline{\alpha}_t}\epsilon\right)

for ϵN(0,1)\epsilon\sim N(0,1), allowing us to show the mean of the reverse posterior

q(xt1xt,x0)=q(xtxt1,x0)q(xt1x0)q(xtx0) q(x_{t-1}|x_t,x_0)=\frac{q(x_t|x_{t-1},x_0)q(x_{t-1}|x_0)}{q(x_t|x_0)}

is

μ=1αt(xt1αt1αtϵ) \mu=\tfrac{1}{\sqrt{\alpha_t}}\left(x_t-\frac{1-\alpha_t}{\sqrt{1-\overline{\alpha}_t}}\epsilon\right)

So, we train a network to predict this ϵ\epsilon given xtx_t and tt, and sample xt1x_{t-1} as this μ\mu with small additive Gaussian noise dependent on tt.

With the success of this method came many subsequent alterations, such as speeding up the reverse process [3], reasoning about a more suitable variance schedule [4], or performing the diffusion process in a latent space [5]. Relevant follow-up work also includes cold diffusion [6], which generalizes the idea of diffusion to non-Gaussian or even deterministic equilibria. A much simpler algorithm, cold diffusion attempts to predict back x0x_0, learning a restoration operation rθr_\theta given some degradation xt=d(x0,t)x_t=d(x_0,t):

x0rθ(d(x0,t)) x_0\approx r_{\theta}(d(x_0, t))

To sample an x0x_0, we start with xTx_T, predict x^0=r(xT,T)\hat{x}_0=r(x_T, T), run this prediction through our forward process to produce an xt1=d(x^0,t)x_{t-1}=d(\hat{x}_0,t), and repeat until timestep zero.

Incorporating constraints into the reverse process and final generation are typically problem specific, resulting in a variety of different methods, often approximations. My algorithm is an attempt at solving problems with geometric constraints in greater generality.

Proposed Method

The formulations of diffusion in the previous section rely on an intermediate prediction of x0x_0, whether it's explicit with cold diffusion or implicitly through the mean of a posterior with DDPM. In order to account for constraints during sampling or constraints in the final output, we need some process involving these constraints during training, which we can introduce by having a direct prediction back one timestep, as opposed to an intermediate prediction or sampling process through x0x_0.

My method begins with removing the Markov chain from the diffusion process and replacing it with linear interpolation between equilibria noise and training examples. At training time, we sample a timestep uniformly from {0,,T}\{0,\ldots,T\}, a training example x0x_0 and an equilibrium ϵq\epsilon \sim q from any deterministic or non-deterministic distribution qq. Lying on the line segment between ϵ\epsilon and x0x_0, we define

xt=(11T)x0+tTϵ x_t=\left(1-\tfrac{1}{T}\right)x_0+\tfrac{t}{T}\epsilon

that approaches ϵ\epsilon as tTt\to T and x0x_0 as t0t\to 0. To devise a sampling algorithm to obtain xt1x_{t-1} given xtx_t, we can compute xt1x_{t-1} and xtx_t using the same source of randomness ϵ\epsilon, and directly predict xt1x_{t-1} without conditioning on x0x_0.

Now that we have a method to directly predict xt1x_{t-1}, we can rephrase this prediction to incorporate constraints; instead of a model learning xt1x_{t-1}, suppose we learn an initial velocity vector to apply to points in space xtx_t. Then, we can feed these velocity vectors and initial positions xtx_t to a simulation, run this simulation, and record the final positions as our prediction for xt1x_{t-1}. That is, for a differentiable physics engine D(x,v)D(x,v) taking initial positions xx and velocities vv to final positions, a model MM, and some loss L\mathcal{L} between x^t1\hat{x}_{t-1} and xt1x_{t-1}, we substitute the gradient descent step on

L(xt1,M(xt,t)) \nabla \mathcal{L}(x_{t-1}, M(x_t, t))

for a step on

L(xt1,D(xt,M(xt,t))) \nabla \mathcal{L}(x_{t-1}, D(x_t, M(x_t, t)))

This introduction of DD allows the reverse process to move through space under constraints, such as object-to-object collisions, object-to-scene collisions, deformations, a background vector field, etc.

At inference time, this algorithm remains similar to DDPM and cold diffusion. We sample xt=xTqx_t=x_T\sim q and repeatedly compute xt1=D(xt,M(xt,t))x_{t-1}=D(x_t, M(x_t, t)) down to timestep zero.

Applications

Current focus: The applications I aim to focus on in a first publication are:

  1. 3D point cloud generation with spheres instead of points. The goal is to match generation quality and diversity of current methods (for points only) while being able to guarantee that spheres don't overlap. This application would make my algorithm approachable to understand and it would be a straightforward way to demonstrate that it holds up against DDPM.
  2. Layout planning—like the previous application but with more complicated shapes and geometry.
  3. Solving jigsaw or Voronoi diagram related puzzles.

Denoising in the real world: If performing inference in the real-world rather than simulation, the velocity vectors M(xt,t)M(x_t, t) would be applied to the objects in positions xtx_t and xt1x_{t-1} would be measured from the scene after some fixed period of time. The differentiable physics engine DD may also be used to introduce small sources of randomness during training time simulations to ensure final positions are not fully determined by xTx_T and our learned velocities are robust against discrepancies between simulation and the real world.

Image generation under geometric constraints: I've thought a lot about using this method to have greater control over what diversity refers to in the context to image generation.

Gaussian

Figure 1 shows Figure 3 mapped to a normalized HSV color space and Figure 2 shows the result of an iterative separation algorithm to transform the points so that nearly all of them are 0.1 apart. Figure 4 shows Figure 3 mapped back to the pixel domain. Obviously, Figure 4 looks terrible, but what's interesting is that the image now appears diverse in color—notice the changes in the blues of the sky and water. With the right choice of the amount of separation and a large enough dataset, perhaps the algorithm I propose could be used to go beyond current notions of diversity for generative models (referring to outputs that look different or seem to encompass an entire distribution) and introduce notions of diversity in terms of the color spectrum across an image or ratios of frequency (using geometry in the Fourier domain to have samples ranging from a large number of low frequency to a large number of high frequency components). This is something I would come back to after a first publication.

References

[1]
J. Ho, A. Jain, and P. Abbeel, “Denoising Diffusion Probabilistic Models,” CoRR, vol. abs/2006.11239, 2020, Available: https://arxiv.org/abs/2006.11239
[2]
J. Sohl-Dickstein, E. A. Weiss, N. Maheswaranathan, and S. Ganguli, “Deep Unsupervised Learning using Nonequilibrium Thermodynamics,” CoRR, vol. abs/1503.03585, 2015, Available: http://arxiv.org/abs/1503.03585
[3]
J. Song, C. Meng, and S. Ermon, “Denoising Diffusion Implicit Models,” CoRR, vol. abs/2010.02502, 2020, Available: https://arxiv.org/abs/2010.02502
[4]
A. Nichol and P. Dhariwal, “Improved Denoising Diffusion Probabilistic Models,” CoRR, vol. abs/2102.09672, 2021, Available: https://arxiv.org/abs/2102.09672
[5]
R. Rombach, A. Blattmann, D. Lorenz, P. Esser, and B. Ommer, “High-Resolution Image Synthesis with Latent Diffusion Models,” CoRR, vol. abs/2112.10752, 2021, Available: https://arxiv.org/abs/2112.10752
[6]
A. Bansal et al., Cold Diffusion: Inverting Arbitrary Image Transforms Without Noise. 2022. Available: https://arxiv.org/abs/2208.09392