Denoising Diffusion From the Perspective of Two state Markov Chains and Linear Dynamical Systems


In this note, I introduce the denoising diffusion models, the state of art behind image generation platforms such as stable diffusion and Dall-E.

Denoising diffusion is an approach to sampling developed in the context of generative modeling in the machine learning community by Jascha Sohl-Dickstein's group according to a quick google search. And that's about the only credit giving I have energy for in this post.

The problem of sampling is to draw samples from some probability distribution $p^{data}(x)$. One instance of the problem that arises in applications is when one is agnostic to the expression $p^{data}(x)$, but has samples from this distribution. For example, you have Italian jobbed a number of Van Gogh paintings. And now you want to create your own Van Gogh style paintings. Suppose the variable $x$ lives in the set of paintings. The goal is to draw new samples from $p^{data}(x)$ based on available samples, and try as best as best possible not to generate crap like what I used to turn in for art class.

For the sake of simplicity, we consider the case where we want to sample from a set with two elements $\mathcal{S} =\{ A,B\}$ according to a probability distribution $ p^{data} = \left [p^{data}_A ~~~p^{data}_B\right ] \in \mathbb{R}^2$ such that $p^{data}_A + p^{data}_B = 1$, $p^{data}_A ,p^{data}_B \geq 0$. For example, $A$ could Van Gogh paintings, and $B$ could represent my failed art projects and doodles.  

Say, we don't know know what $p^{data}_A$ and $p^{data}_B$ are. But we have samples $X^{data}_i$ which takes values in $\mathcal{S} =\{ A,B\}$. The index $i$ ranges from $1$ to $N$. That is, we have $N$ samples.

 One way to do this is to construct a continuous time Markov chain (CTMC) $X(t)$, a random variable parameterized by continuous time $t\in [0,\infty)$ on the the set $\mathcal{S} = \{ A,B\}$. A CTMC is defined by two quantities $u_{AB} \geq 0$ and $u_{BA} \geq 0 $ which define the rate of transition between the states in $\mathcal{S}$. The evolution of the random variable $X_i(t)$ is given by the probabilities \[\mathbb{P}(X_i(t +h) \in A |  X_i(t) \in B) \approx u_{A,B}h \]\[\mathbb{P}(X_i(t +h) \in B |  X_i(t) \in A) \approx u_{B,A}h \]

 


The probability of finding the CTMC $X(t)$ in any state is given by two scalar valued quantities $x_A(t) , x_B(t)$
\[\mathbb{P}(X_i(t)\in A)  = x_A(t)\]\[\mathbb{P}(X_i(t)\in B)  = x_B(t) \]
These quantities are known to evolve according to the ordinary differential equation (ODE), the Kolmogorov forward equation (KFE).
\begin{align}
\label{eq:lde}
\dot{x}_A(t) = -u_{AB}x_A(t) + u_{BA}x_A(t)  \\
\dot{x}_B(t) = u_{AB}x_A(t) - u_{BA}x_B(t)
\end{align}

One can write this as a linear differential equation

\begin{align}
\begin{bmatrix}
\dot{x}_A(t) \\ \dot{x}_B(t)
\end{bmatrix}
= \begin{bmatrix} -u_{AB} &  u_{BA} \\
 u_{AB} &  -u_{BA}\end{bmatrix}
 \begin{bmatrix}
x_A(t) \\ x_B(t)
\end{bmatrix}
\end{align}
The long term behavior of the CTMC can be inferred by looking at the matrix
\begin{align}
A
= \begin{bmatrix} -u_{AB} &  u_{BA} \\
 u_{AB} &  -u_{BA}\end{bmatrix}
\end{align}
This matrix has the eigenvalues
\begin{align}
\{\lambda_1 , \lambda_2\} = \{0,-u_{AB} - u_{BA} \}
\end{align}
The eigenvalue $\lambda_1= 0$ corresponds to the eigenvector
\begin{align}
e^1 = x^{eq}=
\begin{bmatrix}
 \frac{u_{BA}}{u_{AB} + u_{BA}} \\ \frac{u_{AB}}{u_{AB} + u_{BA}}
\end{bmatrix}
\end{align}

Since $x_A(0) + x_B(0)=1$, it follows from linear ODE theory
\[
\|x(t)- x^{eq}\| \leq Me^{\lambda_2t}\|x(0) - x^{eq}\|
\]for some constant $M$ that is one of the most ignored constants in dynamical systems theory. I would never want to be M.
Anyway, so, as $ t \rightarrow \infty$, the CTMC $X(t)$ has the the distribution $x^{eq}$. This is the idea behind sampling using CTMC. Using the long term behavior of the ODE, to sample from $\mathcal{S}$ according to $p^{data}$. Ominous Inception music.


Numerically simulating a two state CTMC involves following steps 

  1. Initialize $X(0) \in \mathcal{S}$.Iterate over $t \in [0, dt ,...,T]$
  2. Jump from state $X(t)$ to another in $\{A,B\} $ with probabilities $u_{AB}dt$ and $u_{BA}dt$.
  3. Call your parents.

Classical Approach to Sampling from $p^{data}$ using CTMC 

 The classical approach to sample from $p^{data}_A$ and $p^{data}_B$ using CTMCs is that we choose the rates $u_{AB}$ and $u_{BA}$ such that $x^{eq} = p^{data}$.  And then simulate the CTMC by generating random numbers at increments of time $h$ such that variable $X(t)$ transitions from state $A$ to $B$

Therefore, in order to ensure that $X(t)$ is distributed according to $p^{data}_A$ and $p^{data}_B$ as $t \rightarrow \infty$, we just have to make the choice
\[\frac{p^{data}_A}{p^{data}_B} = \frac{u_{BA}}{u_{AB}}\]
Since we don't have access to $p^{data}$ we have to approximate $p^{data}$. If $N, N_A$ and  $N_B$ are the number of total samples, samples from  $A$ and samples from $B$, respectively. Then according to the law of large numbers,
\[ \lim_{N \rightarrow \infty} \frac{N_A}{N} =  p^{data}_A  ,~~~~\lim_{N \rightarrow \infty} \frac{N_A}{N} =  p^{data}_B\]
Therefore, we can choose
\[u_{AB} =  C \frac{N_B}{N} \approx C p^{data}_B ~~~u_{AB} =C \frac{N_A}{N} \approx C p^{data}_A \]
where $C>0$ is some upper bound on $u_{AB}, u_{BA}$ for numerical purposes.

Then the idea is that we can initialize $X(t)$ according to some easy to sample distribution, known as the noise distribution $p^{noise}$. For example,
\begin{align}
p^{noise} = x(0)
=\begin{bmatrix}
0.5 \\ 0.5
\end{bmatrix}
\end{align}
Hence one can sample from $S$ initially by using a coin flip. In the contiuous case, the noise distribution is usually chosen to be the Gaussian distribution or the uniform distribution.

Then according to the behavior of the Kolmogorov ODE we can sample from the distribution $p^{data}$ by running the Markov chain $X(t)$ and and if as  $t \rightarrow \infty$ we have that
\[\mathbb{P}(X_i(t)\in A)  = x^{eq}_A \approx p^{data}_A\]\[\mathbb{P}(X_i(t)\in B)  = x^{eq}_B \approx p^{data}_B\]
Therefore, we can generate multiple samples from $p^{data}$ just by setting $X(0) = A$ or $=B$ using a coin flip, and then running the Markov chain for long enough time.

To summarize, the classical approach to sample from $p^{data}$ looks like this:

  1. Initialize $X(0)$ by sampling according to noise distribution $p^{noise}$.
  2. Run Markov chain $X(t)$ with rates $u_{AB}(t) = C \frac{N_B}{N} ~~ u_{BA}(t) = C \frac{N_A}{N} $

Suppose 

\begin{align}
p^{data} =
\begin{bmatrix}
1 \\ 0
\end{bmatrix}
\end{align}

Then by running the CTMC and comparing the solution of the Kolmogorov ODE we get the following result in Python.

The plot shows the evolution of a population of N=100 CTMCs according to the choice of $u_{AB}, u_{BA}$. The average number of CTMCs in state A and B is compared to the solution of the Kolmogorov forward equation.


Now, for whatever reason, this is not the most satisfactory way to sample from $p^{data}$. So, let's look at the "modern" way of sampling from $p^{data}$ and then later I will provide some intuition as to why this new way might be better.

Remark: Note that given the setting, this is a really silly way to sample from $\mathcal{S}$. If we are able to execute step 2 of simulating the Markov chain to make the CTMC jump between states $A$ and $B$, then in fact we can use the same approach to trivially sample from $S$. However, in the case when the set $S$ is continuous, this idea does not work. Therefore, one should use this approach only to gain intuition as to what happens in the continuous variable setting. In the continuous case, the CTMC process is replaced by the Langevin equation. This is probably also the better way to do it when you have a much larger number of elements in $S$. But I am not an expert in this setting.


Denoising Diffusion Approach


Note that in the classical approach, the CTMC is initialized at the noise distribution $p^{noise}$ and the equilibrium distribution is designed to be approximately equal to $p^{data}$. In denoising diffusion we flip this and instead of one CTMC will look at two CTMCs that in conjunction achieve the same effect. For this reason, we introduce two CTMCs. A forward CTMC and a "reverse" CTMC. The first process is supposed to noise the system and in the reverse we "denoise" it.


First, the forward. In this case, we set $X(0) \in \lbrace X^{data}_1,...,X^{data}_N\rbrace $ that is $X(0)$ just taken from the existing sample set. Next, we choose $u_{AB}$ and $u_{BA}$ such that $x^{eq} = p^{noise}$. This is straightforward, for example, if we pick\[u_{AB} =  0.5C  \approx C p_B ~~~u_{AB} = 0.5C  \approx C p^{data}_A \]
Therefore, we evolve the Markov chain $X(t) $ long enough, the process "adds noise" to the process and eventually $X(t)$ is with equal probability in $A$ or $B$. We do this for each initialization $X_i(0) = X^{data}_i$.


Let the probability distribution in the forward phase be denoted by
\begin{align}
x^f(t) =
\begin{bmatrix}
x^f_A(t) \\ x^f_B(t)
\end{bmatrix}
\end{align}
Then we know that
\begin{align}
\lim_{t \rightarrow \infty}x^f(t) =
\begin{bmatrix}
0.5 \\ 0.5
\end{bmatrix}
\end{align}
Therefore, if $T$ is large enough $x(T) \approx p^{noise}$.


Next, we define the "reverse" process. Our goal is to construct another Markov chain $X^r(t)$ that has $p^{noise}$ as the initial distribution, and has $x^f(T-t)$ as the probability distribution. This is the key. We want to be able to generate the probability density $x^f(T-t)$ and potentially go back from $p^{noise}$ to $p^{data}$. Can we construct such a Markov chain?
Turns out, yes!

Consider the differential equation satsified by the time reversal $x^f(T-t) =x^r(t)$. It is easy to see that
\begin{align}
\dot{x}^r_A(t) = u_{AB}x^r_A(t) - u_{BA}x^r_B(t) \nonumber \\
\dot{x}^r_B(t) = -u_{AB}x^r_A(t) + u_{BA}x^r_B(t) \nonumber
\end{align}
Does this ODE correspond to a Markov Chain? At the first look, the signs are all wrong. The coefficient of the term $x^r_A(t)$ needs to be negative, for it to denote that rate of transition from state $A$ to $B$. Similarly, for $x^r_B(t)$. However, we can rewrite the equation for it so that it has the right signs.
\begin{align}
\dot{x}^r_A(t) = -\frac{ u_{BA}x^r_B(t)}{x^r_A(t)}x^r_A(t) + \frac{u_{AB}x^r_A(t)}{x^r_B(t)}x^r_B(t)  \nonumber \\
\dot{x}^r_B(t) =  \frac{ u_{BA}x^r_B(t)}{x^r_A(t)}x^r_A(t)  - \frac{u_{AB}x^r_A(t)}{x^r_B(t)}x^r_B(t)  \nonumber
\end{align}
Therefore, this equation can be seen to be a Markov chain with time varying transition rates
\begin{align}
u^{r}_{AB}(t) =\frac{ u_{BA}x^r_B(t)}{x^r_A(t)} ~~ u^{r}_{BA}(t) = \frac{u_{AB}x^r_A(t)}{x^r_B(t)}
\end{align}
To run this Markov chain we need $x^r(t)$, to estimate the "scores"
\begin{align}
{\rm score} = \begin{bmatrix}
\frac{x^r_B(t)}{x^r_A(t)} \\ \frac{x^r_A(t)}{x^r_B(t)}
\end{bmatrix}
\end{align}
Of course, we don't have access to this quantity. But we can use the fact that $x^r(t) = x^f(T-t)$ to approximate it using the evolution of the Markov chain in the forward process by considering the approximation,
\begin{align}
x^r(t) = x^f(T-t) \approx \begin{bmatrix} \frac{N^f_A(T-t)}{N} \\ \frac{N^f_B(T-t)}{N}\end{bmatrix}
\end{align}
Therefore, to finally sample from $p^{data}$, we can run the following algorithm,

  1. Run forward process $X^f(t)$ starting from samples $N$ times by starting from data samples $X^{data}_i$.
  2. Compute $\frac{N^f_A(T-t)}{N}$ and $\frac{N^f_B(T-t)}{N}$
  3. Initialize $X^r(0)$ by sampling according to noise distribution $p^{noise}$.
  4. Run reverse Markov chain $X^r(t)$ with rates $u^{r}_{AB}(t) =\frac{ 0.5 C N^r_B(t)}{N^r_A(t)} ~~ u^{r}_{BA}(t) = \frac{ 0.5 CN^r_A(t)}{N^r_B(t)} $

The visualization of the solutions for the forward process and the reverse process are shown below.

The plot shows the evolution of a population of N=100 CTMCs according to the choice of $u_{AB}, u_{BA}$ in the forward phase of the denoising approach.


The plot shows the evolution of a population of N=100 CTMCs according to the choice of $u_{AB}, u_{BA}$ in the reverse phase of the denoising approach.The states are eventually converging to state $A$ as desired
 

Why Denoising Diffusion?

It is not completely clear to why the method works so well in practice. Clearly, in this toy example it's hard to make the case if one approach is better than the other. But in high dimension the denoising based approach seems to work better. 

This two state Markov chain model might be a good playground to draw theoretical insights. For example, if one assumes some bound on the rates $0 \leq u_{AB}, u_{BA} \leq C$, the rate at which the Markov chain converges to it's stationary distribution is bounded by $|\lambda_1| \leq C $. This quantity depends on the data distribution $p^{data}$. In the worst case, when $p^{data}$ is only has $1$ or $0$ as it's elements, we get $|\lambda_1| = 0.5 C$.

On the other hand, using the denoising diffusion approach, the rate of convergence instead depends on the choice of the noise distribution.  If $T$ is large enough, the forward process always has the rate of convergence of $|\lambda_1| = C$, because the rate of convergence depends on $p^{noise}$ instead of $p^{data}$. You can see this difference in the plots above. In the classical approach, you need a larger time horizon $T$ to sample from the same distribution, from the behavior of the solution of the differential equations. This argument can be made even stronger when the number of elements of $S$ are associated with the graph. This reasoning is certainly very half-baked because the score can take arbitrarily large values. So that seems like cheating since we ware bounding the rates in the classical method, but not in the latter. Anyway, that's my two cents. Peace out.



 


 



Comments

Popular posts from this blog

Parking cars with Laplacians

Some advice for postdocs applying for tenure track jobs