This topic provides a detailed introduction to the features and key metrics of the QwQ-32B model, best practices for edge cloud deployment, and steps for setting up a testing environment. It offers a comprehensive guide to help you quickly understand model features, deployment requirements, and performance optimization methods. This enables you to efficiently deploy and use the model in an edge cloud environment, improving inference efficiency and reducing costs.
Qwq-32B model
Model introduction
The QwQ-32B model is an open-source inference model based on Qwen2.5-32B training, which significantly improves model inference capabilities through reinforcement learning. The model's core metrics in mathematics and code (AIME 24/25, LiveCodeBench) and general metrics (IFEval, LiveBench, etc.) reach the level of the full version of DeepSeek-R1. It achieves a paradigm breakthrough of "small parameters, high performance" in inference tasks, providing a more cost-effective option for high-cost large model deployment.
Scenarios
The QwQ-32B model is suitable for mathematical logical reasoning, long document processing, code generation, and other scenarios. It also performs excellently in Chinese knowledge answering and multi-round conversation scenarios. The typical inference scenarios are classified as follows:
Inference scenario type | Average input length (Tokens) | Average output length (Tokens) | Typical application cases |
Mathematical logical reasoning | 0.5K-1.5K | 0.8-3.6K | MATH problem solving, LSAT logical problem analysis |
Knowledge Q&A | 1K-4K | 0.2K-1K | MMLU knowledge assessment, medical consultation |
Multi-round conversation system | 2K-8K | 0.5K-2K | Customer service dialogue, psychological consultation |
Long document processing | 8K-16K | 1K-4K | Paper summary, legal document analysis |
Code generation/debugging | 0.3K-2K | 1K-5K | Function implementation, error fixing |
Key metrics for model inference
Metric | Definition |
Model precision | The numerical precision used for model weights and calculations. Lower precision versions of the model occupy less video memory and require fewer resource costs, but reduce accuracy on complex tasks. |
Concurrency | The number of user requests processed simultaneously. Higher concurrency represents greater business capacity, but increased concurrency also leads to increased video memory and video memory bandwidth usage. |
Input length | The number of tokens in the user-provided prompt, which directly affects video memory usage. High input length affects TTFT. |
Output length | The number of tokens in the response text generated by the model, which directly affects video memory usage. High output length triggers truncation or causes OOM. |
TTFT (Time To First Token) | The time from when a user request is submitted to when the first output token is received. High first token latency causes sluggish user experience. It is recommended to control TTFT below 1s and not exceed 2s. |
TPOT (Time Per Output Token) | The average time required to generate each output token (excluding the first token), reflecting the match between generation speed and reading experience. It is recommended to control TPOT below 50ms and not exceed 100ms. |
Single-route throughput | The rate of output tokens per route (Tokens/s). Lower single-route throughput results in poor customer experience. It is recommended to control it within the range of 10Tokens/s to 30Tokens/s. |
Video memory usage rate | The percentage of video memory usage at runtime. Video memory usage consists of model parameters, KV Cache, and intermediate activation values. High video memory usage (such as >95%) easily triggers OOM, directly affecting service availability. |
Best Practices for deploying QwQ-32B model on edge cloud
Edge cloud provides multi-specification, differentiated heterogeneous computing resources on widely distributed nodes to meet the heterogeneous computing power requirements of different scenarios. The single card video memory ranges from 12 GB to 48 GB. The recommended best configurations and inference performance for deploying QwQ-32B models of different precision on edge cloud are as follows:
QwQ-32B FP16 precision is recommended to use 48 GB video memory dual-card instance
The 48 GB video memory dual-card instance is in virtual machine form, with the following resource configuration
Environment parameters
CPU
96 cores
Memory
384GB
GPU
NVIDIA 48GB * 2
Operating system
Ubuntu 22.04
Docker version
26.1.3
GPU Driver
Driver Version: 570.124.06
CUDA Version: 12.4
Inference frameworks
vllm 0.7.2
Performance in specific scenarios
Scenario type
Input length
Output length
Concurrency
Single-route throughput (Tokens)
TTFT (s)
TPOT (ms)
Video memory usage rate
Mathematical logical reasoning & code generation
1K
4K
4
14.5
0.6
67.4
95%
1K
4K
8
13.3
1.6
71.3
95%
Knowledge Q&A
4K
1K
2
14.2
1.8
68.6
95%
4K
1K
4
13
2.7
72.7
95%
Multi-round conversation system & long document processing
4K
4K
2
14.64
1.7
71.5
95%
4K
4K
4
13.6
2.9
82.3
95%
Mathematical logical reasoning & code generation scenario:
This scenario is characterized by short input and long output business, with input length range: 0.3K~2K, output length range: 0.8K~5K.
With a concurrency of 4, the single-route throughput approaches 15 Tokens/s, and TTFT is less than 1s, achieving the best cost-performance ratio in terms of user experience and cost. When the concurrency is 8, the increased TTFT slightly affects the user experience but is still within an acceptable range. If you want to achieve better cost efficiency, you can increase the concurrency.
Knowledge Q&A scenario:
This scenario is characterized by long input and short output business, with input length range: 1K~4K, output length range: 0.2K~1K.
The best working range for a single instance can support a concurrency of 2. When the concurrency increases to 4, TTFT exceeds 2s. Considering network latency issues, the impact on user experience is still within an acceptable range.
Multi-round conversation system & long document processing scenario:
This scenario is characterized by long input and long output business, with input length range: 2K~16K, output length range: 1K~4K.
Increasing the input length not only increases video memory consumption but also significantly affects TTFT. The best working range for a single instance is a concurrency of 2. You can control the input length and concurrency according to your actual business situation.
QwQ-32B INT4 precision is recommended to use 12 GB video memory five-card instance
The 12 GB video memory five-card instance is in bare metal form, with the following resource configuration
Environment parameters
CPU
24Core×2,3.0-4.0GHz
Memory
256GB
GPU
NVIDIA 12GB * 5
Operating system
Ubuntu 20.04
Docker version
28.0.1
GPU Driver
Driver Version: 570.124.06
CUDA Version: 12.4
Inference frameworks
vllm 0.7.2
Performance in specific scenarios
The 12 GB video memory five-card instance can meet performance requirements in terms of single-route throughput for both single-route and multi-route concurrency. However, due to the limited size of single-card video memory, TTFT performance is less than ideal. It is recommended to deploy mathematical logical reasoning and code generation businesses on this configuration. For scenarios with larger input lengths such as knowledge Q&A, multi-round conversations, and long document processing, it is recommended to use a 48 GB video memory dual-card instance.
Scenario type
Input length
Output length
Concurrency
Single-route throughput (Tokens)
TTFT (s)
TPOT (ms)
Video memory usage rate
Mathematical logical reasoning & code generation
1K
4K
2
37
1.3
26.4
96.5%
1K
4K
4
32.5
1.7
28.7
96.5%
1K
4K
8
24.6
3.5
61.5
96.5%
Knowledge Q&A
4K
1K
1
33.5
4.7
25.1
96.5%
Multi-round conversation system & long document processing
4K
4K
1
35.8
4.7
26.6
96.5%
8K
4K
1
21.9
9.3
43.3
96.5%
Mathematical logical reasoning & code generation scenario
With a concurrency of 2, the single-route throughput can reach 37 Tokens/s, with TTFT at 1.3s, achieving the best cost-performance ratio in terms of user experience and cost. When the concurrency increases to 8, the impact on user experience is significant. If you want to pursue a more cost-effective solution, you can increase the concurrency to 4.
Knowledge Q&A scenario & multi-round conversation system & long document processing scenario
Due to the large input length, which occupies a significant amount of video memory space, TTFT approaches 5s under single concurrency, making it unsuitable for production applications. However, it can be used for setting up PoC environments.
Setting up the testing environment
Creating and initializing a 48GB video memory dual-card instance
Creating an instance through the console
Log on to the ENS console.
In the left-side navigation pane, click .
On the Instance page, click Create Instance. You can learn about creating an instance through Create Instance to understand the parameters when creating an ENS instance.
Configure according to your needs. The recommended configuration is as follows:
Page
Parameter options
Reference value
Basic configuration
Billing method
Subscription
Instance type
X86 Computing
Instance type
NVIDIA 48GB * 2
(For detailed specifications, please consult your account manager)
Image
Ubuntu
ubuntu_22_04_x64_20G_alibase_20240926
Network and storage
Network
Self-built network
System disk
Ultra disk 80G+
Data disk
Ultra disk 1T+
System settings
Password settings
Password/key pair
Confirm the order.
After you complete the system settings, you can click Confirm Order in the lower right corner. The system will configure the instance according to your configuration and display the price. After payment, you will be prompted that the payment is successful and can jump to the ENS console.
You can query the instance you created in the instance list of the ENS console. If the status of the instance you created is Running, it means you can use the instance.
Creating an instance through OpenAPI
You can also create an instance using the OpenAPI method. You can quickly use OpenAPI to create an instance in the Alibaba Cloud Developer Portal.
The reference code for the call parameters is as follows. Please adjust flexibly:
{
"InstanceType": "ens.gnxxxx",
"InstanceChargeType": "PrePaid",
"ImageId": "ubuntu_22_04_x64_20G_alibase_20240926",
"ScheduleAreaLevel": "Region",
"EnsRegionId": "cn-your—ens-region",
"Password": ,
"InternetChargeType": "95BandwidthByMonth",
"SystemDisk": {
"Size": 80,
"Category": "cloud_efficiency"
},
"DataDisk": [
{
"Category": "cloud_efficiency",
"Size": 1024
}
],
"InternetMaxBandwidthOut": 5000,
"Amount": 1,
"NetWorkId": "n-xxxxxxxxxxxxxxx",
"VSwitchId": "vsw-xxxxxxxxxxxxxxx",
"InstanceName": "test",
"HostName": "test",
"PublicIpIdentification": true,
"InstanceChargeStrategy": "instance",
}
Instance login and disk initialization
Instance login
You can refer to Connect to an instance to log in to the instance.
Disk initialization
Root directory expansion.
After the instance is newly created or expanded, you need to expand the root partition online without restarting
# Install cloud environment toolkit sudo apt-get update sudo apt-get install -y cloud-guest-utils # Ensure GPT partition tool sgdisk exists type sgdisk || sudo apt-get install -y gdisk # Expand physical partition sudo LC_ALL=en_US.UTF-8 growpart /dev/vda 3 # Adjust file system size sudo resize2fs /dev/vda3 # Verify expansion result df -h
Data disk mounting
You need to format the data disk and mount it. The following is for reference. Operate as needed.
# Identify new disk lsblk # Format directly without partitioning sudo mkfs -t ext4 /dev/vdb # Configure mounting sudo mkdir /data echo "UUID=$(sudo blkid -s UUID -o value /dev/vdb) /data ext4 defaults,nofail 0 0" | sudo tee -a /etc/fstab # Verify sudo mount -a df -hT /data # Modify permissions sudo chown $USER:$USER $MOUNT_DIR
If you want to create an image based on this instance, you need to delete the line containing
ext4 defaults 0 0
in the/etc/fstab
file. If you do not delete it, the instance created from your image will not be able to start.
Installing the vllm inference environment
Installing CUDA
You can refer to CUDA Toolkit 12.4 Downloads | NVIDIA Developer to complete the installation of CUDA.
# Install CUDA Toolkit
wget https://developer.download.nvidia.com/compute/cuda/12.4.0/local_installers/cuda_12.4.0_570.124.06_linux.run
chmod +x cuda_12.4.0_570.124.06_linux.run
# This step requires waiting for a while and interacting with the graphical interface
sudo sh cuda_12.4.0_570.124.06_linux.run
# Add environment variables
vim ~/.bashrc
export PATH="$PATH:/usr/local/cuda-12.8/bin"
export LD_LIBRARY_PATH="$LD_LIBRARY_PATH:/usr/local/cuda-12.8/lib64"
source ~/.bashrc
# Verify success
nvcc -V
nvidia-smi
Auxiliary software installation (optional)
UV is a good Python virtual environment and dependency management tool, suitable for clusters that need to run multiple models. You can refer to Installation | uv (astral.sh) to complete the installation of UV.
# Install uv, default installation in ~/.local/bin/
curl -LsSf https://astral.sh/uv/install.sh | sh
# Edit ~/.bashrc
export PATH="$PATH:~/.local/bin"
source ~/.bashrc
# Create a clean venv environment
uv venv myenv --python 3.12 --seed
source myenv/bin/activate
If you find that the CUDA environment variables you set previously become invalid after installing UV, and nvcc\nvidia-smi
cannot be found, please execute the following operations:
vim myenv/bin/activate
Add
export PATH="$PATH:/usr/local/cuda-12.8/bin"
export LD_LIBRARY_PATH="$LD_LIBRARY_PATH:/usr/local/cuda-12.8/lib64"
after export PATH
# Install vllm and modelscope
uv pip install vllm==0.7.2
uv pip install modelscope
# GPU monitoring tool, you can also use the default nvidia-smi
uv pip install nvitop
Downloading QwQ-32B model and VLLM benchmark script
# Download the model, please download to the data disk /data to avoid space shortage errors
mkdir -p /data/Qwen/QwQ-32B
cd /data/Qwen/QwQ-32B
modelscope download --model Qwen/QwQ-32B --local_dir .
# Download dataset (optional)
wget https://www.modelscope.cn/datasets/gliang1001/ShareGPT_V3_unfiltered_cleaned_split/resolve/master/ShareGPT_V3_unfiltered_cleaned_split.json
# Install git (if git is not available)
apt update
apt install git -y
# Download vllm-git, which includes test scripts
git clone https://github.com/vllm-project/vllm.git
Online testing
Starting the vllm server
vllm serve /data/Qwen/QwQ-32B/ \
--host 127.0.0.1 \
--port 8080 \
--tensor-parallel-size 2 \
--trust-remote-code \
--served-model-name qw \
--gpu-memory-utilization 0.95 \
--enforce-eage \
--max-num-batched-Tokens 8192 \
--max-model-len 8192 \
--enable-prefix-caching
Starting the test
python3 ./vllm/benchmarks/benchmark_serving.py --backend vllm --served-model-name qw --model /data/Qwen/QwQ-32B --dataset-name random --random-input 1024 --random-output 4096 --random-range-ratio 1 --max-concurrency 4 --num-prompts 10 --host 127.0.0.1 --port 8080 --save-result --result-dir /data/logs/ --result-filename QwQ-32B-4-1-4.log
Test completion
The test results are as follows: