CS180 Project 5

Using a Diffusion Model to Generate Images

0: Sampling

Here are some samples generated by the DeepFloyd IF model, along with the captions used to generate them. NOTE: In this project, all images were generated with random seed 1725. (These three images may have been generated with seed 181. Sorry!) Caption: an oil painting of a snowy mountain village

Caption: a man wearing a hat

Caption: a rocket ship

1: Adding Noise

In this step we add increasing levels of noise to an image of the Berkeley Campanile, and parameterize the noise level with an integer ranging from 0 to 1000, following the original paper. We'll see later what the model is able to do about this. Noise level 0:

Noise level 250:

Noise level 500:

Noise level 750:

A noise level of 1000 corresponds to a completely random image, where the pixel values are sampled from a Gaussian.

2. Classical Denoising

Here's an attempt at denoising using a Gaussian filter, the way we learned to for project 2: Noise level 250:

Noise level 500:

Noise level 750:

As you can see, the results aren't very good, and the blurring doesn't in general remove the staticy look of the noise.

3. Denoising with a Diffusion Model (in one step)

Here we use a diffusion model to predict the noise, which effectively allows us to predict the original image. The original image is found on the right. Noise level 250:

Noise level 500:

Noise level 750:

As you can see, the reconstruction is a lot more natural and image-like. (Of course the reconstruction can't be perfect because information is lost to the noising, so to make things image-like the model necessarily has to hallucinate some details.)

4. Denoising with a Diffusion Model (in multiple steps)

Here we instead take advantage of the time-conditioning input to diffusion models, which allows us to iteratively denoise the image. This is a lot more computationally expensive, but the result looks nicer because there is more compute around to recreate nice features. Here are some of the intermediate results, followed by the final reconstruction:

And on the same noise level, here is the one-shot model attempt, the classical Gaussian filter attempt, and the original image:

As described, the multiple-step model creates a much better looking image, though with this much noise it had to hallucinate a lot and so we actually have a picture of a substantially different tower now.

5. Sampling With a Diffusion Model

Given that the model removes noise from an image, what happens if we feed in an image of pure noise? (Well, we can generate a new image from scratch.) This particular model actually also has a text-conditioning feature, which allows us to input a prompt to condition the generation on. For these generations, we gave the generic prompt "a high quality image".

These things look vaguely like real images, but they are very blurry and have weird artifacts.

6. Classifier-Free Guidance

The text conditioning affects the noise estimate output by the model. If we compare this estimate to the estimate from a baseline text prompt (literally "", the empty string), we get a measure of what the text conditioning does. If we simply increase the weight of the text conditioning, we can increase its effect, and improve the quality of the image. Here is the same process as the above, but with the text conditioning weight increased by a factor of 5:

These images are clearly way better, and look like actual images.

7.Image-to-Image Translation

In section 4 above, we saw that when we added noise to an image, when the model denoised it, we got a slightly different image than the original. We can do this intentionally to create images of varying degrees of similarity to a given image, for fun. Here we do this with the picture of the Campanile again. On a sort of scale from 1-33 in degree of similarity to the original, here are images of similarity 1, 3, 5, 7, 10, and 20:

Now we repeat this with this cool penguin, as well as some custom hand-drawn images:

We can also inpaint a specific region of an image, by just noising the image and then constraining the model's reconstruction to only change the region. For example, here we decide to inpaint just the top part of the Campanile:

Here, we overwrite the right side of the image to reinterpret this person's stance:

Here's a slightly ridiculous output of the model which puts the wrong kind of face on this person:

Finally, we can also prompt the inpaint to steer the changes to the image in a specific way. Here's the photo of the Campanile again, but with the inpaint steered to replace the tower with a rocket. As before, we can change the degree of noising to affect the amount of required similarity to the original image.

Here's a cat I edited to look like a dog.

Here's a guy I edited to have a hat on.

8. Visual Anagrams

We can pull all sorts of tricks with the steering of the generation. In this section, in each iteration of the denoising, we reconstruct the image according to one prompt, then we denoise the image upside down according to another prompt. When this works, the result is one of those fun illusions where the image looks like something right side up, but something else upside down. This is an old man right side up, but people around a campfire upside down:

This is a skull right side up, but a hipster barista upside down:

This is a pair of cabins in a snowy village right side up, but a pair of dogs upside down:

9. Hybrid Images

Likewise since we generate the images by having the model predict the noise, we can manipulate the noise with high/low pass filters to create images which have (sort of) been steered according to different prompts for different frequencies. Here is something that is a waterfall close up, but a skull from a distance:

Here is something which is a bunch of people surrounding a fire close up, but a dog from a distance:

Here is something which is a rocket ship close up, but a pencil from a distance: (I really like this one, because the way of fitting the constraints of both prompts is a little clever: the body of the pencil is the exhaust of the ship, and the ship is just the graphite.)

This effect isn't always as simple as our interpretation. For example, whether at low or high pass, it's not very easy to make a dog look like a man with this number of pixels, and the opposing prompts don't do any "planning" to jointly fit the constraints. So any attempt to get "a dog close up, but a man from a distance" creates these cursed dog-men.

This concludes the messing around with the model.

Part B: Training a Diffusion Model

In this section of the project, we train a diffusion model to generate images of handwritten digits, based off the classic MNIST dataset. The training procedure is pretty straightforward. From the description of the task above, note that all it really takes to create training data of a generic type is to apply Gaussian noise to existing data. (Text constraints and such are a little more complicated.) To start with, here is a sample of the training data, as well as the way the datapoints look at varying levels of noise:

First, we train a U-net to predict the noise in the images in one shot. (Since this task was easy, the batch size was really large and the number of batches it took to converge was small.) Here is the training loss:

It should be noted that the training data for this particular model consisted only of data that was noised at sigma=0.5. Here are some sample denoises from the model after training for 1 epoch:

and for 5 epochs:

We also try to use this model to denoise images that were noised at a different sigma. Surprisingly, it works really well:

Time Conditioning

But this model, which is intrinsically one-shot, can't be used to perform the iterative denoising process that we used in part A. Instead, we need to bake some context into the model that tells it approximately how much noise was added to the image. This is called time-conditioning, and we basically achieve it by injecting an embedding of the sigma value into the model. As instructed, since this was a much harder task, we trained the model for more epochs and with a different learning rate schedule. Here is the training loss for a model that was time-conditioned in this way:

And here are some sample (single-iteration)attempts from the model to denoise:

Finally, here are some denoised samples from the model after training for 5 epochs:

and for 20 epochs:

Class Conditioning

We can condition the model on a class label as well, by just injecting an embedding of the one-hot label into the model. In addition we sometimes inject an embedding of the zero vector, so that the model can still generate the images without being conditioned. The fact that the model has a "null" embedding also allows us to do the classifier-free guidance trick, the way we did in part A. Here is the loss schedule for this model:

Here are some sample denoises from the model:

And here are some denoised samples from the model after training for 5 epochs:

and for 20 epochs: