By Jing Gu (Zibai)
Generative AI's image generation technology has seen rapid development in recent years. It can create images based on human language descriptions and is extensively used in industries such as fashion, architecture, animation, advertising, and gaming.
Stable Diffusion WebUI is one of the most popular projects on GitHub utilizing generative AI for image creation. It encodes text using ClipText, then employs UNet and Scheduler to perform diffusion in the latent space, and finally uses an Autoencoder Decoder to transform the diffused information generated in the second step into an image.
Stable Diffusion Pipeline
The main challenge of the Stable Diffusion model is the slow image generation. To address this, Stable Diffusion has employed various methods to accelerate image generation, allowing real-time image generation. Stable Diffusion uses an encoder to convert an image from 3*512*512
to 4*64*64
, which greatly reduces the amount of computation. Performing diffusion in the latent space greatly decreases computational complexity while ensuring the quality of image generation. Generating a complex description image on a GPU takes about 4 seconds, which is still considered slow for many consumer applications.
TensorRT, provided by NVIDIA, is a high-performance deep learning inference framework that enhances the concurrency of latency-sensitive applications by optimizing the compiler and runtime. It can optimize almost all deep neural networks, including CNNs, RNNs, and Transformers. The specific optimizations include:
• Reduces the mixed precision and supports FP32, TF32, FP16, and INT8
• Optimizes the GPU memory bandwidth
• Automatically adjusts the kernel function to select the best algorithm for the target GPU
• Provides dynamic Tensor memory allocation to improve memory usage
• Supports scaling to handle multiple computing streams
• Temporal fusion: Optimizes RNNs with time steps
The following figure shows the basic process of TensorRT, which can be divided into the building period and the running period.
TensorRT Pipeline
The cloud-native AI suite is a Container Service for Kubernetes (ACK) offering provided by Alibaba Cloud, utilizing cloud-native AI technologies and products to assist enterprises in deploying cloud-native AI systems swiftly and efficiently.
This article explains how to leverage TensorRT to speed up image generation in Stable Diffusion using the Alibaba Cloud ACK cloud-native AI suite.
1. Enter the following command in the notebook that you created to install the required dependencies.
!pip config set global.index-url https://pypi.tuna.tsinghua.edu.cn/simple
!pip install --upgrade "torch <2.0.0"
!pip install --upgrade "tensorrt>=8.6"
!pip install --upgrade "accelerate" "diffusers==0.21.4" "transformers"
!pip install --extra-index-url https://pypi.ngc.nvidia.com --upgrade "onnx-graphsurgeon" "onnxruntime" "polygraphy"
!pip install polygraphy==0.47.1 -i https://pypi.ngc.nvidia.com
2. Download the dataset.
import diffusers
import torch
import tensorrt
from diffusers.pipelines.stable_diffusion import StableDiffusionPipeline
from diffusers import DDIMScheduler
# By default, the dataset is downloaded from the huggingface. If your machine cannot access the huggingface, you can also use a local model.
# If you use a local model, replace runwayml/stable-diffusion-v1-5 with the address of the local model.
model_path = "runwayml/stable-diffusion-v1-5"
scheduler = DDIMScheduler.from_pretrained(model_path, subfolder="scheduler")
3. Use TensorRT to generate a serialized network (the internal representation of the TRT computing graph)
# Use a custom pipeline.
pipe_trt = StableDiffusionPipeline.from_pretrained(
model_path,
custom_pipeline="stable_diffusion_tensorrt_txt2img",
revision='fp16',
torch_dtype=torch.float16,
scheduler=scheduler,
)
# Set the cache address.
# An engine folder is generated under the cache address, which contains the clip.plan, unet.plan, and vae.plan files. The initial generation of plan files on A10 takes about 35 minutes.
pipe_trt.set_cached_folder(model_path, revision='fp16')
pipe_trt = pipe_trt.to("cuda")
4. Use the compiled model for inference.
# Generate an image.
prompt = "A beautiful ship is floating in the clouds, unreal engine, cozy indoor lighting, artstation, detailed, digital painting, cinematic"
neg_prompt = "ugly"
import time
start_time = time.time()
image = pipe_trt(prompt, negative_prompt=neg_prompt).images[0]
end_time = time.time()
print("time: "+str(round(end_time-start_time, 2))+"s")
display(image)
Generating a single image takes 2.31 seconds.
The performance test is based on the lambda-diffusers project on GitHub. The number of prompts is 1, and the batch size is 50. Repeat 100 times. The GPU specification is A10, and the corresponding ECS instance type is ecs.gn7i-c8g1.2xlarge.
The experimental results indicate that enabling xformers and TensorRT optimization reduces the average image generation time in Stable Diffusion by 44.7% and decreases GPU memory usage by 37.6%.
Solutions to Engineering Challenges of Generative AI Model Services in Cloud-native Scenarios
508 posts | 49 followers
FollowAlibaba Cloud Native Community - April 2, 2024
Alibaba Container Service - July 24, 2024
Alibaba Cloud Native Community - September 25, 2023
Alibaba Container Service - August 30, 2024
Farruh - October 2, 2023
Alibaba Cloud Community - July 11, 2024
508 posts | 49 followers
FollowAccelerate innovation with generative AI to create new business success
Learn MoreTop-performance foundation models from Alibaba Cloud
Learn MoreAccelerate AI-driven business and AI model training and inference with Alibaba Cloud GPU technology
Learn MoreProvides a control plane to allow users to manage Kubernetes clusters that run based on different infrastructure resources
Learn MoreMore Posts by Alibaba Cloud Native Community