Introduction
By applying specific modern state-of-the-art techniques, stable diffusion models make it possible to generate images and audio. Stable Diffusion works by modifying input data with the guide of text input and generating new creative output data. In this article, we will see how to generate new images from a given input image by employing depth-to-depth model diffusers on the PyTorch backend with a Hugging Face pipeline. We are using Hugging Face since they have made an easy-to-use diffusion pipeline available.
Learn More: Hugging Face Transformers Pipeline Functions
Learning Objectives
- Understand the concept of Stable Diffusion and its application in generating images and audio using modern state-of-the-art techniques.
- Gain knowledge of the key components and techniques involved in Stable Diffusion, such as latent diffusion models, denoising autoencoders, variational autoencoders, U-Net blocks, and text encoders.
- Explore common applications of diffusion models, including text-to-image, text-to-videos, and text-to-3D conversions.
- Learn how to set up the environment for Stable Diffusion, including utilizing GPU and installing necessary libraries and dependencies.
- Develop practical skills in applying Stable Diffusion by loading and diffusing images, creating text prompts to guide the output, adjusting diffusion levels, and understanding the limitations and challenges associated with diffusion models.
This article was published as a part of the Data Science Blogathon.
Table of contents
- Introduction
- What is a Stable Diffusion?
- The Concepts of Stable Diffusion
- Common Applications of Diffusion
- Setting Up Environment
- Importing Dependencies
- Instantiating the Pre-trained Diffusers
- Preparing Image Data
- Loading Image
- Creating Text Prompts
- Creating Negative Prompts
- Adjusting Diffusion Level
- Limitations of Diffusion Models
- Conclusion
- Frequently Asked Questions
What is a Stable Diffusion?
Stable Diffusion models function as latent diffusion models. It learns the latent structure of input by modeling how the data attributes diffuse through the latent space. They belong to the deep generative neural network. It is considered stable because we guide the results using original images, text, etc. On the other hand, an unstable diffusion will be unpredictable.
The Concepts of Stable Diffusion
Stable Diffusion uses the Diffusion or latent diffusion model (LDM), a probabilistic model. These models are trained like other deep learning models. Still, the objective here is removing the need for continuous applications of signal processing denoting a kind of noise in the signals in which the probability density function equals the normal distribution. We refer to this as the Gaussian noise applied to the training images. We achieve this through a sequence of denoising autoencoders (DAE). DAEs contribute by changing the reconstruction criterion. This is what alters the continuous application of signal processing. It is initialized to add a noise process to the standard autoencoder.
In a more detailed explanation, Stable Diffusion consists of 3 essential parts: First is the variational autoencoder (VAE) which, in simple terms, is an artificial neural network that performs as probabilistic graphical models. Next is the U-Net block. This convolutional neural network (CNN) was developed for image segmentation. Lastly is the text encoder part. A trained CLIP ViT-L/14 text encoder deals with this. It handles the transformations of the text prompts into an embedding space.
The VAE encoder compresses the image pixel space values into a smaller dimensional latent space to carry out image diffusion. This helps the image not to lose details. It is represented again in pixeled pictures.
Common Applications of Diffusion
Let us quickly look at three common areas where diffusion models can be applied:
- Text-to-Image: This approach does not use images but a piece of text “prompt” to generate related photos.
- Text-to-Videos: Diffusion models are used for generating videos out of text prompts. Current research uses this in media to do interesting feats like creating online ad videos, explaining concepts, and creating short animation videos, song videos, etc.
- Text-to-3D: This stable diffusion approach converts input text to 3D images.
Applying diffusers can help generate free images that are plagiarism free. This provides content for your projects, materials, and even marketing brands. Instead of hiring a painter or photographer, you can generate your images. Instead of a voice-over artist, you can create your unique audio. Now let’s look at Image-to-image Generation.
Also Read: Bring Doodles to Life: Meta Open-Sources AI Model
Setting Up Environment
This task requires GPU and a good development environment like processing images and graphics. You are expected to ensure you have GPU available if you want to follow along with this project. We can use Google Colab since it provides a suitable environment and GPU, and you can search for it online. Follow the steps below to engage the available GPU:
- Go to the Runtime tab towards the top right.
- After selecting Runtime, click the Change Runtime Type option.
- Then select GPU as a hardware accelerator from the drop-down option.
You can find all the code on GitHub.
Importing Dependencies
There are several dependencies in using the pipeline from Huggingface. We will first start by importing them into our project environment.
Installing Libraries
Some libraries are not preinstalled in Colab. We need to start by installing them before importing from them.
# Installing required libraries
%pip install --quiet --upgrade diffusers transformers scipy ftfy
# Installing required libraries
%pip install --quiet --upgrade accelerate
Let us explain the installations we have done above. Firstly are the diffusers, transformers, scipy, and ftfy. SciPy and ftfy are standard Python libraries we employ for everyday Python tasks. We will explain the new major libraries below.
Diffusers: Diffusers is a library made available by Hugging Face for getting well-trained diffusion models for generating images. We are going to use it for accessing our pipeline and other packages.
Transformers: Transformers contain tools and APIs that help us cut training costs from scratch.
# Backend
import torch
# Internet access
import requests
# Regular Python library for Image processing
from PIL import Image
# Hugging face pipeline
from diffusers import StableDiffusionDepth2ImgPipeline
StableDiffusionDepth2ImgPipeline is the library that reduces our code. All we need to do is pass an image and a prompt describing our expectations.
Instantiating the Pre-trained Diffusers
Next, we just make an instance of the pre-trained diffuser we imported above and assign it to our GPU. Here this is Cuda.
# Creating a variable instance of the pipeline
pipe = StableDiffusionDepth2ImgPipeline.from_pretrained(
"stabilityai/stable-diffusion-2-depth",
torch_dtype=torch.float16,
)
# Assigning to GPU
pipe.to("cuda")
Preparing Image Data
Let’s define a function to help us check images from URLs. You can skip this step to try an image you have locally. Mount the drive in Colab.
# Accesssing images from the web
import urllib.parse as parse
import os
import requests
# Verify URL
def check_url(string):
try:
result = parse.urlparse(string)
return all([result.scheme, result.netloc, result.path])
except:
return False
We can define another function to use the check_url function for loading an image.
# Load an image
def load_image(image_path):
if check_url(image_path):
return Image.open(requests.get(image_path, stream=True).raw)
elif os.path.exists(image_path):
return Image.open(image_path)
Loading Image
Now, we need an image to diffuse into another image. You can use your photo. In this example, we are using an online image for convenience. Feel free to use your URL or images.
# Loading an image URL
img = load_image("https://img.freepik.com/free-photo/stacked-tomatoes_1353-262.jpg?w=740&t=st=1683821147~exp=1683821747~hmac=708f16371d1e158d76c8ea5e8b9790fb68dc75009750b8328e17c21f16d36468")
# Displaying the Image
img
Creating Text Prompts
Now we have a usable image. Let’s now show some diffusion feats on it. To achieve this, we wrap prompts to the pictures. These are sets of texts with keywords describing our expectations from the Diffusion. Instead of generating a random new image, we can use prompts to guide the model’s output.
Note that we set the strength to 0.7. This is an average. Also, note the negative_prompt is set to None. We will look at this more later.
# Setting Image prompt
prompt = "Some sliced tomatoes mixed"
# Assigning to pipeline
pipe(prompt=prompt, image=img, negative_prompt=None, strength=0.7).images[0]
Now we can continue with this step on new images. The method remains;
Loading the image to be diffused, and
Creating a text description of the target image.
You can create some examples on your own.
Creating Negative Prompts
Another approach is to create a negative prompt to counter the intended output. This makes the pipeline more flexible. We can do this by assigning a negative prompt to the negative_prompt variable.
# Loading an image URL
img = load_image("https://img.freepik.com/free-photo/stacked-tomatoes_1353-262.jpg?w=740&t=st=1683821147~exp=1683821747~hmac=708f16371d1e158d76c8ea5e8b9790fb68dc75009750b8328e17c21f16d36468")
# Displaying the Image
img
# Setting Image prompt
prompt = ""
n_prompt = "rot, bad, decayed, wrinkled"
# Assigning to pipeline
pipe(prompt=prompt, image=img, negative_prompt=n_prompt, strength=0.7).images[0]
Adjusting Diffusion Level
You may ask about altering how much the new image changes from the first. We can achieve this by changing the strength level. We will observe the effect of different strength levels on the previous image.
At strength = 0.1
# Setting Image prompt
prompt = ""
n_prompt = "rot, bad, decayed, wrinkled"
# Assigning to pipeline
pipe(prompt=prompt, image=img, negative_prompt=n_prompt, strength=0.1).images[0]
On strength = 0.4
# Setting Image prompt
prompt = ""
n_prompt = "rot, bad, decayed, wrinkled"
# Assigning to pipeline
pipe(prompt=prompt, image=img, negative_prompt=n_prompt, strength=0.4).images[0]
At strength = 1.0
# Setting Image prompt
prompt = ""
n_prompt = "rot, bad,decayed, wrinkled"
# Assigning to pipeline
pipe(prompt=prompt, image=img, negative_prompt=n_prompt, strength=1.0).images[0]
The strength variable makes it possible to work on the effect of Diffusion on the new image generated. This makes it more flexible and adjustable.
Limitations of Diffusion Models
Before we call it a wrap on Stable Diffusion, one must understand that one can face some limitations and challenges with these pipelines. Every new technology always has some issues at first.
- We trained the stable diffusion model on images with 512×512 resolution. The implication is that when we generate new photos and desire dimensions higher than 512×512, the image quality tends to degrade. Although, there is an attempt to solve this problem by updating higher versions of the Stable Diffusion model where we can natively generate images but at 768×768 resolution. Although people attempt to improve things, as long as there is a maximum resolution, the use case will primarily limit printing large banners and flyers.
- Training the dataset on the LAION database. It is a non-profit organization that provides datasets, tools, and models for research purposes. This has shown that the model could not identify human limbs and faces richly.
- Stable Diffusion on a CPU can run in a feasible time ranging from a few seconds to a few minutes. This removes the need for a high computing environment. It can only be a bit complex when the pipeline is customized. This can demand high RAM and processor, but the available channel takes less complexity.
- Lastly is the issue of Legal rights. The practice can easily suffer legal matters as the models require vast images and datasets to learn and perform well. An instance is the January 2023 lawsuits from three artists for copyright infringement against Stability AI, Midjourney, and DeviantArt. Therefore, there can be limitations in freely building these images.
Conclusion
In conclusion, while the concept of diffusers is cutting-edge, the Hugging Face pipeline makes it easy to integrate into our projects with an easy and very direct code underside. Using prompts on the images makes it possible to set and bring an imaginary picture to the Diffusion. Additionally, the strength variable is another critical parameter. It helps us with the level of Diffusion. We have seen how to generate new images from images.
Key Takeaways
- By applying state-of-the-art techniques, stable diffusion models generate images and audio.
- Typical applications of Diffusion include Text-to-image, Text-to-Videos, and Text-to-3D.
- StableDiffusion Depth2ImgPipeline is the library that reduces our code, so we only need to pass an image to describe our expectations.
Learn More: Pytorch | Getting Started With Pytorch
Reference Links
The media shown in this article is not owned by Analytics Vidhya and is used at the Author’s discretion.
Frequently Asked Questions
A. The Stable Diffusion method is a technique used in machine learning for generating realistic and high-quality synthetic images. It leverages diffusion processes to progressively refine noisy images into coherent and visually appealing samples.
A. Stable Diffusion methods, such as Diffusion Models, are available as open-source implementations. They can be accessed and used for free on various platforms, including GitHub and other machine learning libraries.
A. An example of a Stable Diffusion technique is the Diffusion Models with denoising priors. This approach involves iteratively updating an initial noisy image by applying a series of transformations, resulting in a smoother and clearer output.
A. The best Stable Diffusion model choice depends on the specific task and dataset. Different models, such as Deep Diffusion Models or variants like DALL-E, offer different capabilities and performance levels.