project 5

Fun With Diffusion Models!

cover

Part A: The Power of Diffusion Models!

Part 0: Setup

The seed I chose is 80808. Here are the images I generated for this part with 20 inference steps:

original_generated_imgs

The quality of the images seems pretty consistent across different text prompts.

Here is the result of picking different numbers of inference steps:

100_steps

Part 1: Sampling Loops

1.1 Implementing the Forward Process

I implemented the forward process according to the spec.

forward_noising_camp

1.2 Classical Denoising

I tried to denoise the noised image with gaussian filtering. This worked about as well as expected.

classical_denoising_camp

1.3 One-Step Denoising

I implemented the one-step denoising process. I noticed that single-step denoising on an image with a lot of noise changed the content of the image.

single_step_denoising_camp

1.4 Iterative Denoising

I implemented the iterative denoising process according to the spec. Here's a gif created from every frame of the denoising process.

iterative_denoising_camp

And here's a comparison of the iterative denoising and other denoising methods:

iter_denoise_comparison

1.5 Diffusion Model Sampling

By applying the iterative denoising steps that I defined in part 1.4 to pure noise, I was able to generate new images.

generated_imgs_1 generated_imgs_2 generated_imgs_3 generated_imgs_4 generated_imgs_5

1.6 Classifier-Free Guidance (CFG)

I implemented classifier-free guidance according to the spec.

generated_cfg_1 generated_cfg_2 generated_cfg_3 generated_cfg_4 generated_cfg_5

1.7 Image-to-image Translation

I applied various amounts of noise to the test image to get the following results.

image_to_image_translation

Here is the result of applying different amounts of noise to cheems.

image_to_image_translation

And here are the results for a picture I found on a cool wikipedia page about ecdysis.

image_to_image_translation

1.7.1 Editing Hand-Drawn and Web Images

The image I chose from the web is nyan cat.

image_to_image_translation

Here are the results of editing two images I drew.

image_to_image_translation
image_to_image_translation

1.7.2 Inpainting

Here is the inpainted image of the Campanile.

image_to_image_translation

I wondered how the diffusion model would fill in nyan cat's rainbow trail. I was sort of disappointed.

image_to_image_translation

I also wondered how the diffusion model would fill in Aphex Twin's face; I was similarly disappointed.

image_to_image_translation

1.7.3 Text-Conditional Image-to-image Translation

I just turned everything into a rocket.

image_to_image_translation
image_to_image_translation
image_to_image_translation

1.8 Visual Anagrams

I was pleasantly surprised by how well this part worked. Here is my result for A visual anagram where on one orientation "an oil painting of people around a campfire" is displayed and, when flipped, "an oil painting of an old man" is displayed.

image_to_image_translation

Here are my other results:

image_to_image_translation
image_to_image_translation

(NOTE: I'm well aware that the left image is not a pencil, but I plan to keep it this way because I think it's funny.)


1.9 Hybrid Images

Although I got odd results, I'm pretty sure I implemented this part correctly. To justify my odd results, I'll explain my steps.

  1. Given my two prompts, \(p_1\) and \( p_2 \) I found noise estimates \(\epsilon_1(x_t, t, p_1) \) and \( \epsilon_2(x_t, t, p_2)\).

  2. I convolved \(\epsilon_1\) and \(\epsilon_2 \) with a Gaussian kernel (\(k=33, \sigma = 2\), as recommended) to get the low-frequency components, \(\epsilon_1^{(LF)}\) and \(\epsilon_2^{(LF)}\).

  3. I got the high-frequency component \(\epsilon_2^{(HF)} = \epsilon_2 - \epsilon_2^{(LF)}\).

  4. I added \(\epsilon_1^{(HF)}\) and \(\epsilon_2^{(HF)}\) together to get the hybrid noise estimate \(\epsilon_{\text{hybrid}}\).

  5. I used \( \epsilon_{\text{hybrid}} \) to denoise the image at time \(t\).

here is the result I got for the skull and waterfall:

hybrid_imgs

here are the other results I got:

hybrid_imgs hybrid_imgs hybrid_imgs

Part B: Diffusion Models from Scratch!

Part 1: Training a Single-Step Denoising UNet

1.1 Implementing the UNet

I'm not sure what to show for this part. But I did it!


1.2 Using the UNet to Train a Denoiser

Given a clean image, \(x\), I noised it to get \(z = x + \sigma \epsilon\), where \(\epsilon \sim \mathcal{N}(0, 1)\), and \(\sigma \in \{0.0, 0.2, 0.4, 0.6, 0.8, 1.0\}\).

noisy_imgs

using this, I created a dataloader for training, which outputted batches of the form \((x, z)\) with \( \sigma = 0.5 \).

noisy_imgs

1.2.1 Training

I trained the UNet for 5 epochs as desired and plotted training loss.

noisy_imgs
I sampled after 1 epoch.
noisy_imgs
And 6 epochs.
noisy_imgs

Pretty good!


1.2.2 Out-of-Distribution Testing

Then I tried denoising images with \(\sigma \neq 0.5\)

noisy_imgs

still, not bad!

Part 2: Training a Diffusion Model

2.1 Adding Time Conditioning to UNet

I followed the spec and implemented time conditioning. As the spec recommended, I used a a hidden dimension of 64 and a batch size of 128. For optimization, I used Adam with a learning rate of 0.001 and an exponential learning rate scheduler with \(\gamma = 0.9\).


2.2 Training the UNet

Here is the training loss for the time-conditioned UNet during 20 epochs.


2.3 Sampling from the UNet

here are my results for 5 and 20 epochs of training.
time_cond_samples time_cond_samples_2

2.4 Adding Class-Conditioning to UNet

I implemented class conditioning as specified in the spec. I used the same hyperparameters as in part 2.2, and got the following training loss for 20 epochs.

2.5 Sampling from the Class-Conditioned UNet

here are my results for 5 and 20 epochs of training.

class_cond_samples class_cond_samples_2

Frankly the denoised results do not look as good as I would have wanted them to. My handwriting is also pretty bad, so maybe it's not my place to be so critical.