Diffused about Diffusion Models?

Nikhil Rasiwasia
6 min readNov 25, 2022


Ramping up in Diffusion-based Image Generation Models

Image generated by Dall-E 2 with the prompt “Diffused about Diffusion Models art”

I was a diffusion noob three weeks ago, but given the buzz, I wanted to jump on the diffusion train (which I feel I have been able to). The pace at which new developments are happening in diffusion based image generation (DbIG) space is mind-boggling. It gets hard to understand where to start the journey. In this post, I share my journey which might be useful to others who want to build a strong fundamental base to understand the world of diffusion models (DM), including understanding the math.

Note1: I will necessarily not talk about any of the techniques in any detail, but chart out a path from one paper to another. I believe there is an overdose of blogs/videos/papers that talk about various techniques. On the other hand, I did not find any blog to help guide how to build a strong foundation in DbIG.

Note2: It took me about 3 weeks of dedicated effort to start from fundamentals and build ground up. If you want to build a deep understanding, do dedicate around 2 weeks of your time, especially if you are unfamiliar with the math of Variational Auto Encoders and want to get an intuitive feel of DM math.

Let’s begin.

Step-1: Early Diffusion Model

Deep Unsupervised Learning using Nonequilibrium Thermodynamics [2015] — This is the first paper that introduced the ideas around using ‘Diffusion probabilistic models’. While the paper is easy ready if you skip over the math, to understand the math requires familiarity with Variational Inference. I would recommend getting familiar with Variational Auto Encoders (VAE) to follow the math.

Variational Auto Encoders [Optional]: Although not a requirement to understand diffusion models, a good understanding of VAE helps to understand the basic units of diffusion process, and the math behind it.

Alternate Interpretation [Optional]: Generative Modeling by Estimating Gradients of the Data Distribution [2019] is an alternate path for image generation which leads to the same end process as DM. In authors words “We introduce a new generative model where samples are produced via Langevin dynamics using gradients of the data distribution estimated with score matching”.

Step-2: DDPM

DDPM: Denoising Diffusion Probabilistic Models [2020] — This is what started the craze around DM for image generation.

Going Deeper into DDPM:

Going Deeper into Score Models [Optional]:

Step-3: Other Basics: U-net, Time-step Encoding, DDIM

U-Net: DDPM first used the U-Net architecture for DM, which I think is as important as the diffusion process itself in helping generate high quality images. Although understanding U-Net is not required for understanding the process, but if you want to follow more advanced works (timestep encoding, text conditioning), it is critical to know how U-Net works.

Time-step Encoding: Since DDPM uses the same U-Net model for all steps of diffusion de-noising, it is critical to feed the time-step into the U-Net model. This is based on the following paper A Style-Based Generator Architecture for Generative Adversarial Networks [2018]. Also one can learn the details reading the DDPM code.

DDIM: Denoising diffusion implicit models [Oct 2020] — Alternative popular sampling strategy from DM from the score-based literature.

Step-4: DM Being Established as the Default Choice for Image Generation

Step-5: Other Improvements

Step-6: Diffusion Model goes Mainstream

Three papers made diffusion models front-page material.

Stable Diffusion: High-Resolution Image Synthesis with Latent Diffusion Models [Dec 2021] — Made their code open-sourced which helped democratize DM. Helped improve computational complexity. Conditioning via cross-attention, etc. Understanding Stable Diffusion in detail — The Illustrated Stable Diffusion.

Dall-E 2: Hierarchical Text-Conditional Image Generation with CLIP Latents [Apr 2022] — Not open-source, but an online demo. Added an additional step of using CLIP image embeddings to condition and a prior to convert text CLIP embeddings to image embeddings.

Imagen: Photorealistic Text-to-Image Diffusion Models with Deep Language Understanding [May 2022] — Paper by Google, with following modification — Use of text-only embeddings (T5), thresholded guidance, cascaded model.

Step-7: Other Popular Diffusion papers for Image Generation till around Oct 2022 [Optional]

Finally, while DMs are taking bigger mind share for image generation, there are non-DM based models with equally good results (e.g, Make-A-Scene: Scene-Based Text-to-Image Generation with Human Priors by FAIR).

Thats it folks. Happy Diffusing.

I really enjoyed this magical journey of creating an image out of Big Bang radiations. If you feel there is a paper/blog/video that helped you get onboarded to the diffusion train, please do share with me.

Acknowledgements: I would like to sincerely thank Sen He, Jerry Wu and Tao Xiang for helping me in this exploration and pointing me in the right directions from time to time.

Final Note: I have build this knowledge in a short amount of time so there could be some errors in my understanding. Please let me know if anything I said here is factually incorrect.