Diffused about Diffusion Models?
Ramping up in Diffusion-based Image Generation Models
I was a diffusion noob three weeks ago, but given the buzz, I wanted to jump on the diffusion train (which I feel I have been able to). The pace at which new developments are happening in diffusion based image generation (DbIG) space is mind-boggling. It gets hard to understand where to start the journey. In this post, I share my journey which might be useful to others who want to build a strong fundamental base to understand the world of diffusion models (DM), including understanding the math.
Note1: I will necessarily not talk about any of the techniques in any detail, but chart out a path from one paper to another. I believe there is an overdose of blogs/videos/papers that talk about various techniques. On the other hand, I did not find any blog to help guide how to build a strong foundation in DbIG.
Note2: It took me about 3 weeks of dedicated effort to start from fundamentals and build ground up. If you want to build a deep understanding, do dedicate around 2 weeks of your time, especially if you are unfamiliar with the math of Variational Auto Encoders and want to get an intuitive feel of DM math.
Let’s begin.
Step-1: Early Diffusion Model
Deep Unsupervised Learning using Nonequilibrium Thermodynamics [2015] — This is the first paper that introduced the ideas around using ‘Diffusion probabilistic models’. While the paper is easy ready if you skip over the math, to understand the math requires familiarity with Variational Inference. I would recommend getting familiar with Variational Auto Encoders (VAE) to follow the math.
Variational Auto Encoders [Optional]: Although not a requirement to understand diffusion models, a good understanding of VAE helps to understand the basic units of diffusion process, and the math behind it.
- Tutorials: An Introduction to Variational Autoencoders, Tutorial on Variational Autoencoders
- Papers: Auto-Encoding Variational Bayes
- Code: Variational Autoencoder with Pytorch, LATENT SPACES (Part-2): A Simple Guide to Variational Autoencoders
Alternate Interpretation [Optional]: Generative Modeling by Estimating Gradients of the Data Distribution [2019] is an alternate path for image generation which leads to the same end process as DM. In authors words “We introduce a new generative model where samples are produced via Langevin dynamics using gradients of the data distribution estimated with score matching”.
Step-2: DDPM
DDPM: Denoising Diffusion Probabilistic Models [2020] — This is what started the craze around DM for image generation.
Going Deeper into DDPM:
- Explanation of DDPM Paper — What are Diffusion Models? [Blog], Introduction to Diffusion Models for Machine Learning [Blog]
- The Math — Diffusion Models | Paper Explanation | Math Explained [YouTube] video that covers the math in detail. Very useful to get a step by step insight into the math [Highly Recommended]
- Code — I still had some confusion left which were removed by following the code/re-coding DM using Diffusion Models | PyTorch Implementation [YouTube], Diffusion-Models-pytorch [Github], Diffusion models from scratch in PyTorch [YouTube]
- Understanding equivalence of DDPM and Score based generation — Generative Modelling by Estimating Gradients of the Data Distribution [Blog]
Going Deeper into Score Models [Optional]:
- Improved techniques for training score-based generative models [2020]
- Score-based generative modelling through stochastic differential equations [2020]
Step-3: Other Basics: U-net, Time-step Encoding, DDIM
U-Net: DDPM first used the U-Net architecture for DM, which I think is as important as the diffusion process itself in helping generate high quality images. Although understanding U-Net is not required for understanding the process, but if you want to follow more advanced works (timestep encoding, text conditioning), it is critical to know how U-Net works.
- U-Net: Convolutional Networks for Biomedical Image Segmentation [2015] — The U-Net Paper
- Fully Convolutional Networks for Semantic Segmentation [2014] — FCN paper which is the inspiration for U-Net
- Understanding U-Net in detail — Understanding U-Net architecture and building it from scratch [Youtube]
- De-convolutions — A guide to convolution arithmetic for deep learning, Up-sampling with Transposed Convolution, Deconvolution and Checkerboard Artifacts
Time-step Encoding: Since DDPM uses the same U-Net model for all steps of diffusion de-noising, it is critical to feed the time-step into the U-Net model. This is based on the following paper A Style-Based Generator Architecture for Generative Adversarial Networks [2018]. Also one can learn the details reading the DDPM code.
DDIM: Denoising diffusion implicit models [Oct 2020] — Alternative popular sampling strategy from DM from the score-based literature.
Step-4: DM Being Established as the Default Choice for Image Generation
- Improved Denoising Diffusion Probabilistic Models [Feb 2021] — Improvements to DDPM.
- Diffusion Models Beat GANs on Image Synthesis [May 2021] — Further improvements to IDDPM. This paper also introduced the idea of ‘classifier guidance’ to improve generation quality and provide a way to control the generation output. I believe this is what set the baseline for the follow up work on DbIG.
Step-5: Other Improvements
- Classifier-Free Diffusion Guidance [July 2022] — Improved results by conditioning the U-Net Model and following a ‘dropout’ style training. This is an alternative to classifier guidance which requires training an alternative image classifier.
- Pseudo Numerical Methods for Diffusion Models on Manifolds [Sept 2021] — Improvement to sampling speed.
- Image Super-Resolution via Iterative Refinement [Apr 2021] — Not for image generation but key to understanding future image conditioned DM and the cascading to improve image resolution.
Step-6: Diffusion Model goes Mainstream
Three papers made diffusion models front-page material.
Stable Diffusion: High-Resolution Image Synthesis with Latent Diffusion Models [Dec 2021] — Made their code open-sourced which helped democratize DM. Helped improve computational complexity. Conditioning via cross-attention, etc. Understanding Stable Diffusion in detail — The Illustrated Stable Diffusion.
Dall-E 2: Hierarchical Text-Conditional Image Generation with CLIP Latents [Apr 2022] — Not open-source, but an online demo. Added an additional step of using CLIP image embeddings to condition and a prior to convert text CLIP embeddings to image embeddings.
Imagen: Photorealistic Text-to-Image Diffusion Models with Deep Language Understanding [May 2022] — Paper by Google, with following modification — Use of text-only embeddings (T5), thresholded guidance, cascaded model.
Step-7: Other Popular Diffusion papers for Image Generation till around Oct 2022 [Optional]
- SDEdit: Guided Image Synthesis and Editing with Stochastic Differential Equations [Aug 2021]
- Palette: Image-to-Image Diffusion Models [Nov 2021]
- GLIDE: Towards Photorealistic Image Generation and Editing with Text-Guided Diffusion Models [Dec 2021]
- Semantic Image Synthesis via Diffusion Models [June 2022]
- An Image is Worth One Word: Personalizing Text-to-Image Generation using Textual Inversion [Aug 2022][Text Inversion]
- DreamBooth: Fine Tuning Text-to-Image Diffusion Models for Subject-Driven Generation [Aug 2022]
- Prompt-to-Prompt Image Editing with Cross Attention Control [Aug 2022]
- Imagic: Text-Based Real Image Editing with Diffusion Models [Oct 2022]
- MagicMix: Semantic Mixing with Diffusion Models [Oct 2022]
Finally, while DMs are taking bigger mind share for image generation, there are non-DM based models with equally good results (e.g, Make-A-Scene: Scene-Based Text-to-Image Generation with Human Priors by FAIR).
Thats it folks. Happy Diffusing.
I really enjoyed this magical journey of creating an image out of Big Bang radiations. If you feel there is a paper/blog/video that helped you get onboarded to the diffusion train, please do share with me.
Acknowledgements: I would like to sincerely thank Sen He, Jerry Wu and Tao Xiang for helping me in this exploration and pointing me in the right directions from time to time.
Final Note: I have build this knowledge in a short amount of time so there could be some errors in my understanding. Please let me know if anything I said here is factually incorrect.