Edge cloud Best Practices for QwQ-32B inference business - ENS

This topic provides a detailed introduction to the features and key metrics of the QwQ-32B model, best practices for edge cloud deployment, and steps for setting up a testing environment. It offers a comprehensive guide to help you quickly understand model features, deployment requirements, and performance optimization methods. This enables you to efficiently deploy and use the model in an edge cloud environment, improving inference efficiency and reducing costs.

Qwq-32B model

Model introduction

The QwQ-32B model is an open-source inference model based on Qwen2.5-32B training, which significantly improves model inference capabilities through reinforcement learning. The model's core metrics in mathematics and code (AIME 24/25, LiveCodeBench) and general metrics (IFEval, LiveBench, etc.) reach the level of the full version of DeepSeek-R1. It achieves a paradigm breakthrough of "small parameters, high performance" in inference tasks, providing a more cost-effective option for high-cost large model deployment.

Scenarios

The QwQ-32B model is suitable for mathematical logical reasoning, long document processing, code generation, and other scenarios. It also performs excellently in Chinese knowledge answering and multi-round conversation scenarios. The typical inference scenarios are classified as follows:

Inference scenario type	Average input length (Tokens)	Average output length (Tokens)	Typical application cases

Inference scenario type	Average input length (Tokens)	Average output length (Tokens)	Typical application cases
Mathematical logical reasoning	0.5K-1.5K	0.8-3.6K	MATH problem solving, LSAT logical problem analysis
Knowledge Q&A	1K-4K	0.2K-1K	MMLU knowledge assessment, medical consultation
Multi-round conversation system	2K-8K	0.5K-2K	Customer service dialogue, psychological consultation
Long document processing	8K-16K	1K-4K	Paper summary, legal document analysis
Code generation/debugging	0.3K-2K	1K-5K	Function implementation, error fixing

Key metrics for model inference

Metric	Definition

Metric	Definition
Model precision	The numerical precision used for model weights and calculations. Lower precision versions of the model occupy less video memory and require fewer resource costs, but reduce accuracy on complex tasks.
Concurrency	The number of user requests processed simultaneously. Higher concurrency represents greater business capacity, but increased concurrency also leads to increased video memory and video memory bandwidth usage.
Input length	The number of tokens in the user-provided prompt, which directly affects video memory usage. High input length affects TTFT.
Output length	The number of tokens in the response text generated by the model, which directly affects video memory usage. High output length triggers truncation or causes OOM.
TTFT (Time To First Token)	The time from when a user request is submitted to when the first output token is received. High first token latency causes sluggish user experience. It is recommended to control TTFT below 1s and not exceed 2s.
TPOT (Time Per Output Token)	The average time required to generate each output token (excluding the first token), reflecting the match between generation speed and reading experience. It is recommended to control TPOT below 50ms and not exceed 100ms.
Single-route throughput	The rate of output tokens per route (Tokens/s). Lower single-route throughput results in poor customer experience. It is recommended to control it within the range of 10Tokens/s to 30Tokens/s.
Video memory usage rate	The percentage of video memory usage at runtime. Video memory usage consists of model parameters, KV Cache, and intermediate activation values. High video memory usage (such as >95%) easily triggers OOM, directly affecting service availability.

Best Practices for deploying QwQ-32B model on edge cloud

Edge cloud provides multi-specification, differentiated heterogeneous computing resources on widely distributed nodes to meet the heterogeneous computing power requirements of different scenarios. The single card video memory ranges from 12 GB to 48 GB. The recommended best configurations and inference performance for deploying QwQ-32B models of different precision on edge cloud are as follows:

QwQ-32B FP16 precision is recommended to use 48 GB video memory dual-card instance

The 48 GB video memory dual-card instance is in virtual machine form, with the following resource configuration
Environment parameters
Environment parameters
CPU
96 cores
Memory
384GB
GPU
NVIDIA 48GB * 2
Operating system
Ubuntu 22.04
Docker version
26.1.3
GPU Driver
Driver Version: 570.124.06
CUDA Version: 12.4
Inference frameworks
vllm 0.7.2

Performance in specific scenarios

Scenario type	Input length	Output length	Concurrency	Single-route throughput (Tokens)	TTFT (s)	TPOT (ms)	Video memory usage rate

Scenario type	Input length	Output length	Concurrency	Single-route throughput (Tokens)	TTFT (s)	TPOT (ms)	Video memory usage rate
Mathematical logical reasoning & code generation	1K	4K	4	14.5	0.6	67.4	95%
Mathematical logical reasoning & code generation	1K	4K	8	13.3	1.6	71.3	95%
Knowledge Q&A	4K	1K	2	14.2	1.8	68.6	95%
Knowledge Q&A	4K	1K	4	13	2.7	72.7	95%
Multi-round conversation system & long document processing	4K	4K	2	14.64	1.7	71.5	95%
Multi-round conversation system & long document processing	4K	4K	4	13.6	2.9	82.3	95%

Mathematical logical reasoning & code generation scenario:
This scenario is characterized by short input and long output business, with input length range: 0.3K~2K, output length range: 0.8K~5K.
With a concurrency of 4, the single-route throughput approaches 15 Tokens/s, and TTFT is less than 1s, achieving the best cost-performance ratio in terms of user experience and cost. When the concurrency is 8, the increased TTFT slightly affects the user experience but is still within an acceptable range. If you want to achieve better cost efficiency, you can increase the concurrency.
Knowledge Q&A scenario:
This scenario is characterized by long input and short output business, with input length range: 1K~4K, output length range: 0.2K~1K.
The best working range for a single instance can support a concurrency of 2. When the concurrency increases to 4, TTFT exceeds 2s. Considering network latency issues, the impact on user experience is still within an acceptable range.
Multi-round conversation system & long document processing scenario:
This scenario is characterized by long input and long output business, with input length range: 2K~16K, output length range: 1K~4K.
Increasing the input length not only increases video memory consumption but also significantly affects TTFT. The best working range for a single instance is a concurrency of 2. You can control the input length and concurrency according to your actual business situation.

QwQ-32B INT4 precision is recommended to use 12 GB video memory five-card instance

The 12 GB video memory five-card instance is in bare metal form, with the following resource configuration

Environment parameters

Environment parameters
CPU	24Core×2，3.0-4.0GHz
Memory	256GB
GPU	NVIDIA 12GB * 5
Operating system	Ubuntu 20.04
Docker version	28.0.1
GPU Driver	Driver Version: 570.124.06 CUDA Version: 12.4
Inference frameworks	vllm 0.7.2

Performance in specific scenarios

The 12 GB video memory five-card instance can meet performance requirements in terms of single-route throughput for both single-route and multi-route concurrency. However, due to the limited size of single-card video memory, TTFT performance is less than ideal. It is recommended to deploy mathematical logical reasoning and code generation businesses on this configuration. For scenarios with larger input lengths such as knowledge Q&A, multi-round conversations, and long document processing, it is recommended to use a 48 GB video memory dual-card instance.

Scenario type	Input length	Output length	Concurrency	Single-route throughput (Tokens)	TTFT (s)	TPOT (ms)	Video memory usage rate

Scenario type	Input length	Output length	Concurrency	Single-route throughput (Tokens)	TTFT (s)	TPOT (ms)	Video memory usage rate
Mathematical logical reasoning & code generation	1K	4K	2	37	1.3	26.4	96.5%
	1K	4K	4	32.5	1.7	28.7	96.5%
	1K	4K	8	24.6	3.5	61.5	96.5%
Knowledge Q&A	4K	1K	1	33.5	4.7	25.1	96.5%
Multi-round conversation system & long document processing	4K	4K	1	35.8	4.7	26.6	96.5%
Multi-round conversation system & long document processing	8K	4K	1	21.9	9.3	43.3	96.5%

Mathematical logical reasoning & code generation scenario
With a concurrency of 2, the single-route throughput can reach 37 Tokens/s, with TTFT at 1.3s, achieving the best cost-performance ratio in terms of user experience and cost. When the concurrency increases to 8, the impact on user experience is significant. If you want to pursue a more cost-effective solution, you can increase the concurrency to 4.
Knowledge Q&A scenario & multi-round conversation system & long document processing scenario
Due to the large input length, which occupies a significant amount of video memory space, TTFT approaches 5s under single concurrency, making it unsuitable for production applications. However, it can be used for setting up PoC environments.

Setting up the testing environment

Creating and initializing a 48GB video memory dual-card instance

Creating an instance through the console

Log on to the ENS console.
In the left-side navigation pane, click Computing & Image > > Instance.

On the Instance page, click Create Instance. You can learn about creating an instance through Create Instance to understand the parameters when creating an ENS instance.

Configure according to your needs. The recommended configuration is as follows:

Page	Parameter options	Reference value

Page	Parameter options	Reference value
Basic configuration	Billing method	Subscription
	Instance type	X86 Computing
	Instance type	NVIDIA 48GB * 2 (For detailed specifications, please consult your account manager)
	Image	Ubuntu ubuntu_22_04_x64_20G_alibase_20240926
Network and storage	Network	Self-built network
	System disk	Ultra disk 80G+
	Data disk	Ultra disk 1T+
System settings	Password settings	Password/key pair

Confirm the order.
After you complete the system settings, you can click Confirm Order in the lower right corner. The system will configure the instance according to your configuration and display the price. After payment, you will be prompted that the payment is successful and can jump to the ENS console.
You can query the instance you created in the instance list of the ENS console. If the status of the instance you created is Running, it means you can use the instance.

Creating an instance through OpenAPI

You can also create an instance using the OpenAPI method. You can quickly use OpenAPI to create an instance in the Alibaba Cloud Developer Portal.

The reference code for the call parameters is as follows. Please adjust flexibly:

{
  "InstanceType": "ens.gnxxxx",         
  "InstanceChargeType": "PrePaid",
  "ImageId": "ubuntu_22_04_x64_20G_alibase_20240926",
  "ScheduleAreaLevel": "Region",
  "EnsRegionId": "cn-your—ens-region",     
  "Password": ,           
  "InternetChargeType": "95BandwidthByMonth", 
  "SystemDisk": {
    "Size": 80,
    "Category": "cloud_efficiency"
  },
  "DataDisk": [
    {
      "Category": "cloud_efficiency",
      "Size": 1024
    }
  ],
  "InternetMaxBandwidthOut": 5000,
  "Amount": 1,
  "NetWorkId": "n-xxxxxxxxxxxxxxx",
  "VSwitchId": "vsw-xxxxxxxxxxxxxxx",
  "InstanceName": "test",        
  "HostName": "test",
  "PublicIpIdentification": true,
  "InstanceChargeStrategy": "instance",      
}

Instance login and disk initialization

Instance login

You can refer to Connect to an instance to log in to the instance.

Disk initialization

Root directory expansion.

After the instance is newly created or expanded, you need to expand the root partition online without restarting

# Install cloud environment toolkit
sudo apt-get update
sudo apt-get install -y cloud-guest-utils

# Ensure GPT partition tool sgdisk exists
type sgdisk || sudo apt-get install -y gdisk

# Expand physical partition
sudo LC_ALL=en_US.UTF-8 growpart /dev/vda 3

# Adjust file system size
sudo resize2fs /dev/vda3

# Verify expansion result
df -h

Data disk mounting

You need to format the data disk and mount it. The following is for reference. Operate as needed.

# Identify new disk
lsblk

# Format directly without partitioning
sudo mkfs -t ext4 /dev/vdb

# Configure mounting
sudo mkdir /data
echo "UUID=$(sudo blkid -s UUID -o value /dev/vdb) /data ext4 defaults,nofail 0 0" | sudo tee -a /etc/fstab

# Verify
sudo mount -a
df -hT /data

# Modify permissions
sudo chown $USER:$USER $MOUNT_DIR

Note

If you want to create an image based on this instance, you need to delete the line containing ext4 defaults 0 0 in the /etc/fstab file. If you do not delete it, the instance created from your image will not be able to start.

Installing the vllm inference environment

Installing CUDA

You can refer to CUDA Toolkit 12.4 Downloads | NVIDIA Developer to complete the installation of CUDA.

# Install CUDA Toolkit
wget https://developer.download.nvidia.com/compute/cuda/12.4.0/local_installers/cuda_12.4.0_570.124.06_linux.run
chmod +x cuda_12.4.0_570.124.06_linux.run

# This step requires waiting for a while and interacting with the graphical interface
sudo sh cuda_12.4.0_570.124.06_linux.run

# Add environment variables
vim ~/.bashrc
export PATH="$PATH:/usr/local/cuda-12.8/bin"
export LD_LIBRARY_PATH="$LD_LIBRARY_PATH:/usr/local/cuda-12.8/lib64"
source ~/.bashrc

# Verify success
nvcc  -V
nvidia-smi

Auxiliary software installation (optional)

UV is a good Python virtual environment and dependency management tool, suitable for clusters that need to run multiple models. You can refer to Installation | uv (astral.sh) to complete the installation of UV.

# Install uv, default installation in ~/.local/bin/
curl -LsSf https://astral.sh/uv/install.sh | sh

# Edit ~/.bashrc
export PATH="$PATH:~/.local/bin"

source ~/.bashrc

# Create a clean venv environment
uv venv myenv --python 3.12 --seed
source myenv/bin/activate

If you find that the CUDA environment variables you set previously become invalid after installing UV, and nvcc\nvidia-smi cannot be found, please execute the following operations:

vim myenv/bin/activate 
Add
export PATH="$PATH:/usr/local/cuda-12.8/bin"
export LD_LIBRARY_PATH="$LD_LIBRARY_PATH:/usr/local/cuda-12.8/lib64"
after export PATH

# Install vllm and modelscope
uv pip install vllm==0.7.2
uv pip install modelscope

# GPU monitoring tool, you can also use the default nvidia-smi
uv pip install nvitop

Downloading QwQ-32B model and VLLM benchmark script

# Download the model, please download to the data disk /data to avoid space shortage errors
mkdir -p /data/Qwen/QwQ-32B
cd /data/Qwen/QwQ-32B
modelscope download --model Qwen/QwQ-32B --local_dir .

# Download dataset (optional)
wget https://www.modelscope.cn/datasets/gliang1001/ShareGPT_V3_unfiltered_cleaned_split/resolve/master/ShareGPT_V3_unfiltered_cleaned_split.json

# Install git (if git is not available)
apt update
apt install git -y

# Download vllm-git, which includes test scripts
git clone https://github.com/vllm-project/vllm.git

Online testing

Starting the vllm server

vllm serve /data/Qwen/QwQ-32B/ \
  --host 127.0.0.1 \
  --port 8080 \
  --tensor-parallel-size 2 \
  --trust-remote-code \
  --served-model-name qw \
  --gpu-memory-utilization 0.95 \
  --enforce-eage \
  --max-num-batched-Tokens 8192 \
  --max-model-len 8192 \
  --enable-prefix-caching

Starting the test

python3 ./vllm/benchmarks/benchmark_serving.py --backend vllm --served-model-name qw --model /data/Qwen/QwQ-32B --dataset-name random --random-input 1024 --random-output 4096 --random-range-ratio 1 --max-concurrency 4 --num-prompts 10 --host 127.0.0.1 --port 8080 --save-result --result-dir /data/logs/ --result-filename QwQ-32B-4-1-4.log

Test completion

The test results are as follows:

Environment parameters
CPU	96 cores
Memory	384GB
GPU	NVIDIA 48GB * 2
Operating system	Ubuntu 22.04
Docker version	26.1.3
GPU Driver	Driver Version: 570.124.06 CUDA Version: 12.4
Inference frameworks	vllm 0.7.2