This topic was translated by AI and is currently in queue for revision by our editors. Alibaba Cloud does not guarantee the accuracy of AI-translated content. Request expedited revision

Edge cloud Best Practices for QwQ-32B inference business

Updated at: 2025-03-28 05:40

This topic provides a detailed introduction to the features and key metrics of the QwQ-32B model, best practices for edge cloud deployment, and steps for setting up a testing environment. It offers a comprehensive guide to help you quickly understand model features, deployment requirements, and performance optimization methods. This enables you to efficiently deploy and use the model in an edge cloud environment, improving inference efficiency and reducing costs.

Qwq-32B model

Model introduction

The QwQ-32B model is an open-source inference model based on Qwen2.5-32B training, which significantly improves model inference capabilities through reinforcement learning. The model's core metrics in mathematics and code (AIME 24/25, LiveCodeBench) and general metrics (IFEval, LiveBench, etc.) reach the level of the full version of DeepSeek-R1. It achieves a paradigm breakthrough of "small parameters, high performance" in inference tasks, providing a more cost-effective option for high-cost large model deployment.

Scenarios

The QwQ-32B model is suitable for mathematical logical reasoning, long document processing, code generation, and other scenarios. It also performs excellently in Chinese knowledge answering and multi-round conversation scenarios. The typical inference scenarios are classified as follows:

Inference scenario type

Average input length (Tokens)

Average output length (Tokens)

Typical application cases

Mathematical logical reasoning

0.5K-1.5K

0.8-3.6K

MATH problem solving, LSAT logical problem analysis

Knowledge Q&A

1K-4K

0.2K-1K

MMLU knowledge assessment, medical consultation

Multi-round conversation system

2K-8K

0.5K-2K

Customer service dialogue, psychological consultation

Long document processing

8K-16K

1K-4K

Paper summary, legal document analysis

Code generation/debugging

0.3K-2K

1K-5K

Function implementation, error fixing

Key metrics for model inference

Metric

Definition

Model precision

The numerical precision used for model weights and calculations. Lower precision versions of the model occupy less video memory and require fewer resource costs, but reduce accuracy on complex tasks.

Concurrency

The number of user requests processed simultaneously. Higher concurrency represents greater business capacity, but increased concurrency also leads to increased video memory and video memory bandwidth usage.

Input length

The number of tokens in the user-provided prompt, which directly affects video memory usage. High input length affects TTFT.

Output length

The number of tokens in the response text generated by the model, which directly affects video memory usage. High output length triggers truncation or causes OOM.

TTFT (Time To First Token)

The time from when a user request is submitted to when the first output token is received. High first token latency causes sluggish user experience. It is recommended to control TTFT below 1s and not exceed 2s.

TPOT (Time Per Output Token)

The average time required to generate each output token (excluding the first token), reflecting the match between generation speed and reading experience. It is recommended to control TPOT below 50ms and not exceed 100ms.

Single-route throughput

The rate of output tokens per route (Tokens/s). Lower single-route throughput results in poor customer experience. It is recommended to control it within the range of 10Tokens/s to 30Tokens/s.

Video memory usage rate

The percentage of video memory usage at runtime. Video memory usage consists of model parameters, KV Cache, and intermediate activation values. High video memory usage (such as >95%) easily triggers OOM, directly affecting service availability.

Best Practices for deploying QwQ-32B model on edge cloud

Edge cloud provides multi-specification, differentiated heterogeneous computing resources on widely distributed nodes to meet the heterogeneous computing power requirements of different scenarios. The single card video memory ranges from 12 GB to 48 GB. The recommended best configurations and inference performance for deploying QwQ-32B models of different precision on edge cloud are as follows:

  • QwQ-32B FP16 precision is recommended to use 48 GB video memory dual-card instance

    • The 48 GB video memory dual-card instance is in virtual machine form, with the following resource configuration

      Environment parameters

      CPU

      96 cores

      Memory

      384GB

      GPU

      NVIDIA 48GB * 2

      Operating system

      Ubuntu 22.04

      Docker version

      26.1.3

      GPU Driver

      Driver Version: 570.124.06

      CUDA Version: 12.4

      Inference frameworks

      vllm 0.7.2

    • Performance in specific scenarios

      Scenario type

      Input length

      Output length

      Concurrency

      Single-route throughput (Tokens)

      TTFT (s)

      TPOT (ms)

      Video memory usage rate

      Mathematical logical reasoning & code generation

      1K

      4K

      4

      14.5

      0.6

      67.4

      95%

      1K

      4K

      8

      13.3

      1.6

      71.3

      95%

      Knowledge Q&A

      4K

      1K

      2

      14.2

      1.8

      68.6

      95%

      4K

      1K

      4

      13

      2.7

      72.7

      95%

      Multi-round conversation system & long document processing

      4K

      4K

      2

      14.64

      1.7

      71.5

      95%

      4K

      4K

      4

      13.6

      2.9

      82.3

      95%

      • Mathematical logical reasoning & code generation scenario:

        This scenario is characterized by short input and long output business, with input length range: 0.3K~2K, output length range: 0.8K~5K.

        With a concurrency of 4, the single-route throughput approaches 15 Tokens/s, and TTFT is less than 1s, achieving the best cost-performance ratio in terms of user experience and cost. When the concurrency is 8, the increased TTFT slightly affects the user experience but is still within an acceptable range. If you want to achieve better cost efficiency, you can increase the concurrency.

      • Knowledge Q&A scenario:

        This scenario is characterized by long input and short output business, with input length range: 1K~4K, output length range: 0.2K~1K.

        The best working range for a single instance can support a concurrency of 2. When the concurrency increases to 4, TTFT exceeds 2s. Considering network latency issues, the impact on user experience is still within an acceptable range.

      • Multi-round conversation system & long document processing scenario:

        This scenario is characterized by long input and long output business, with input length range: 2K~16K, output length range: 1K~4K.

        Increasing the input length not only increases video memory consumption but also significantly affects TTFT. The best working range for a single instance is a concurrency of 2. You can control the input length and concurrency according to your actual business situation.

  • QwQ-32B INT4 precision is recommended to use 12 GB video memory five-card instance

    • The 12 GB video memory five-card instance is in bare metal form, with the following resource configuration

      Environment parameters

      CPU

      24Core×2,3.0-4.0GHz

      Memory

      256GB

      GPU

      NVIDIA 12GB * 5

      Operating system

      Ubuntu 20.04

      Docker version

      28.0.1

      GPU Driver

      Driver Version: 570.124.06

      CUDA Version: 12.4

      Inference frameworks

      vllm 0.7.2

    • Performance in specific scenarios

      The 12 GB video memory five-card instance can meet performance requirements in terms of single-route throughput for both single-route and multi-route concurrency. However, due to the limited size of single-card video memory, TTFT performance is less than ideal. It is recommended to deploy mathematical logical reasoning and code generation businesses on this configuration. For scenarios with larger input lengths such as knowledge Q&A, multi-round conversations, and long document processing, it is recommended to use a 48 GB video memory dual-card instance.

      Scenario type

      Input length

      Output length

      Concurrency

      Single-route throughput (Tokens)

      TTFT (s)

      TPOT (ms)

      Video memory usage rate

      Mathematical logical reasoning & code generation

      1K

      4K

      2

      37

      1.3

      26.4

      96.5%

      1K

      4K

      4

      32.5

      1.7

      28.7

      96.5%

      1K

      4K

      8

      24.6

      3.5

      61.5

      96.5%

      Knowledge Q&A

      4K

      1K

      1

      33.5

      4.7

      25.1

      96.5%

      Multi-round conversation system & long document processing

      4K

      4K

      1

      35.8

      4.7

      26.6

      96.5%

      8K

      4K

      1

      21.9

      9.3

      43.3

      96.5%

      • Mathematical logical reasoning & code generation scenario

        With a concurrency of 2, the single-route throughput can reach 37 Tokens/s, with TTFT at 1.3s, achieving the best cost-performance ratio in terms of user experience and cost. When the concurrency increases to 8, the impact on user experience is significant. If you want to pursue a more cost-effective solution, you can increase the concurrency to 4.

      • Knowledge Q&A scenario & multi-round conversation system & long document processing scenario

        Due to the large input length, which occupies a significant amount of video memory space, TTFT approaches 5s under single concurrency, making it unsuitable for production applications. However, it can be used for setting up PoC environments.

Setting up the testing environment

Creating and initializing a 48GB video memory dual-card instance

Creating an instance through the console

  1. Log on to the ENS console.

  2. In the left-side navigation pane, click Computing & Image > > Instance.

  3. On the Instance page, click Create Instance. You can learn about creating an instance through Create Instance to understand the parameters when creating an ENS instance.

    1. Configure according to your needs. The recommended configuration is as follows:

      Page

      Parameter options

      Reference value

      Basic configuration

      Billing method

      Subscription

      Instance type

      X86 Computing

      Instance type

      NVIDIA 48GB * 2

      (For detailed specifications, please consult your account manager)

      Image

      Ubuntu

      ubuntu_22_04_x64_20G_alibase_20240926

      Network and storage

      Network

      Self-built network

      System disk

      Ultra disk 80G+

      Data disk

      Ultra disk 1T+

      System settings

      Password settings

      Password/key pair

    2. Confirm the order.

      After you complete the system settings, you can click Confirm Order in the lower right corner. The system will configure the instance according to your configuration and display the price. After payment, you will be prompted that the payment is successful and can jump to the ENS console.

      You can query the instance you created in the instance list of the ENS console. If the status of the instance you created is Running, it means you can use the instance.

Creating an instance through OpenAPI

You can also create an instance using the OpenAPI method. You can quickly use OpenAPI to create an instance in the Alibaba Cloud Developer Portal.

The reference code for the call parameters is as follows. Please adjust flexibly:

{
  "InstanceType": "ens.gnxxxx",         
  "InstanceChargeType": "PrePaid",
  "ImageId": "ubuntu_22_04_x64_20G_alibase_20240926",
  "ScheduleAreaLevel": "Region",
  "EnsRegionId": "cn-your—ens-region",     
  "Password": ,           
  "InternetChargeType": "95BandwidthByMonth", 
  "SystemDisk": {
    "Size": 80,
    "Category": "cloud_efficiency"
  },
  "DataDisk": [
    {
      "Category": "cloud_efficiency",
      "Size": 1024
    }
  ],
  "InternetMaxBandwidthOut": 5000,
  "Amount": 1,
  "NetWorkId": "n-xxxxxxxxxxxxxxx",
  "VSwitchId": "vsw-xxxxxxxxxxxxxxx",
  "InstanceName": "test",        
  "HostName": "test",
  "PublicIpIdentification": true,
  "InstanceChargeStrategy": "instance",      
}

Instance login and disk initialization

Instance login

You can refer to Connect to an instance to log in to the instance.

Disk initialization

  1. Root directory expansion.

    After the instance is newly created or expanded, you need to expand the root partition online without restarting

    # Install cloud environment toolkit
    sudo apt-get update
    sudo apt-get install -y cloud-guest-utils
    
    # Ensure GPT partition tool sgdisk exists
    type sgdisk || sudo apt-get install -y gdisk
    
    # Expand physical partition
    sudo LC_ALL=en_US.UTF-8 growpart /dev/vda 3
    
    # Adjust file system size
    sudo resize2fs /dev/vda3
    
    # Verify expansion result
    df -h

    image

  2. Data disk mounting

    You need to format the data disk and mount it. The following is for reference. Operate as needed.

    # Identify new disk
    lsblk
    
    # Format directly without partitioning
    sudo mkfs -t ext4 /dev/vdb
    
    # Configure mounting
    sudo mkdir /data
    echo "UUID=$(sudo blkid -s UUID -o value /dev/vdb) /data ext4 defaults,nofail 0 0" | sudo tee -a /etc/fstab
    
    # Verify
    sudo mount -a
    df -hT /data
    
    # Modify permissions
    sudo chown $USER:$USER $MOUNT_DIR

    image

    Note

    If you want to create an image based on this instance, you need to delete the line containing ext4 defaults 0 0 in the /etc/fstab file. If you do not delete it, the instance created from your image will not be able to start.

Installing the vllm inference environment

Installing CUDA

You can refer to CUDA Toolkit 12.4 Downloads | NVIDIA Developer to complete the installation of CUDA.

# Install CUDA Toolkit
wget https://developer.download.nvidia.com/compute/cuda/12.4.0/local_installers/cuda_12.4.0_570.124.06_linux.run
chmod +x cuda_12.4.0_570.124.06_linux.run

# This step requires waiting for a while and interacting with the graphical interface
sudo sh cuda_12.4.0_570.124.06_linux.run

# Add environment variables
vim ~/.bashrc
export PATH="$PATH:/usr/local/cuda-12.8/bin"
export LD_LIBRARY_PATH="$LD_LIBRARY_PATH:/usr/local/cuda-12.8/lib64"
source ~/.bashrc

# Verify success
nvcc  -V
nvidia-smi

Auxiliary software installation (optional)

UV is a good Python virtual environment and dependency management tool, suitable for clusters that need to run multiple models. You can refer to Installation | uv (astral.sh) to complete the installation of UV.

# Install uv, default installation in ~/.local/bin/
curl -LsSf https://astral.sh/uv/install.sh | sh

# Edit ~/.bashrc
export PATH="$PATH:~/.local/bin"

source ~/.bashrc

# Create a clean venv environment
uv venv myenv --python 3.12 --seed
source myenv/bin/activate

If you find that the CUDA environment variables you set previously become invalid after installing UV, and nvcc\nvidia-smi cannot be found, please execute the following operations:

vim myenv/bin/activate 
Add
export PATH="$PATH:/usr/local/cuda-12.8/bin"
export LD_LIBRARY_PATH="$LD_LIBRARY_PATH:/usr/local/cuda-12.8/lib64"
after export PATH
# Install vllm and modelscope
uv pip install vllm==0.7.2
uv pip install modelscope

# GPU monitoring tool, you can also use the default nvidia-smi
uv pip install nvitop

Downloading QwQ-32B model and VLLM benchmark script

# Download the model, please download to the data disk /data to avoid space shortage errors
mkdir -p /data/Qwen/QwQ-32B
cd /data/Qwen/QwQ-32B
modelscope download --model Qwen/QwQ-32B --local_dir .

# Download dataset (optional)
wget https://www.modelscope.cn/datasets/gliang1001/ShareGPT_V3_unfiltered_cleaned_split/resolve/master/ShareGPT_V3_unfiltered_cleaned_split.json

# Install git (if git is not available)
apt update
apt install git -y

# Download vllm-git, which includes test scripts
git clone https://github.com/vllm-project/vllm.git

Online testing

Starting the vllm server

vllm serve /data/Qwen/QwQ-32B/ \
  --host 127.0.0.1 \
  --port 8080 \
  --tensor-parallel-size 2 \
  --trust-remote-code \
  --served-model-name qw \
  --gpu-memory-utilization 0.95 \
  --enforce-eage \
  --max-num-batched-Tokens 8192 \
  --max-model-len 8192 \
  --enable-prefix-caching 

Starting the test

python3 ./vllm/benchmarks/benchmark_serving.py --backend vllm --served-model-name qw --model /data/Qwen/QwQ-32B --dataset-name random --random-input 1024 --random-output 4096 --random-range-ratio 1 --max-concurrency 4 --num-prompts 10 --host 127.0.0.1 --port 8080 --save-result --result-dir /data/logs/ --result-filename QwQ-32B-4-1-4.log

Test completion

The test results are as follows:

image

  • On this page (1)
  • Qwq-32B model
  • Model introduction
  • Scenarios
  • Key metrics for model inference
  • Best Practices for deploying QwQ-32B model on edge cloud
  • Setting up the testing environment
  • Creating and initializing a 48GB video memory dual-card instance
  • Instance login and disk initialization
  • Installing the vllm inference environment
  • Online testing
Feedback