All Products
Search
Document Center

Container Service for Kubernetes:Best practices for deploying the DeepSeek full version across multiple nodes in ACK

Last Updated:Feb 09, 2026

This topic describes best practices for deploying the DeepSeek full version model across multiple nodes in Alibaba Cloud Container Service for Kubernetes (ACK) to achieve optimal performance, scalability, and reliability.

Background information

DeepSeek Overview

DeepSeek is a series of large language models. The full version models support both hybrid-thinking and thinking-only modes, providing advanced reasoning capabilities for complex tasks. These models are particularly suitable for scenarios requiring deep analytical thinking and problem-solving.

Supported DeepSeek models:

  • Hybrid-thinking models (disabled by default): deepseek-v3.2, deepseek-v3.2-exp, deepseek-v3.1

  • Thinking-only models: deepseek-r1, deepseek-r1-0528, deepseek-r1 distilled models

Note

DeepSeek models are currently only available in the Beijing region.

ACK distributed deployment benefits

Deploying DeepSeek models across multiple nodes in ACK provides several advantages:

  • Scalability: Dynamically scale compute resources based on workload demands

  • High availability: Distribute workloads across multiple nodes to eliminate single points of failure

  • Resource optimization: Efficiently utilize GPU resources across the cluster

  • Performance: Reduce latency through intelligent load balancing and proximity scheduling

Prerequisites

  • Create an ACK managed cluster (Pro edition) with GPU nodes. The cluster Kubernetes version must be 1.22 or later.

    Ensure your cluster contains nodes with compatible NVIDIA GPUs (recommended: A10, V100, or A100). Use NVIDIA driver version 525 for optimal performance.

  • Ensure you have sufficient permissions to create and manage Kubernetes resources, including deployments, services, and persistent volumes.

  • Configure Object Storage Service (OSS) for model storage and caching. Create a bucket to store model weights and configuration files.

  • Set up proper network policies and security groups to allow inter-node communication and external access to the inference service.

Architecture design principles

Multi-node deployment strategy

When deploying DeepSeek models across multiple nodes, consider the following architectural approaches:

Horizontal Pod Autoscaler (HPA) approach

Deploy multiple replicas of the DeepSeek inference service across different nodes, allowing Kubernetes to automatically scale based on CPU, memory, or custom metrics.

Node affinity and anti-affinity

Use node affinity rules to ensure optimal distribution of DeepSeek pods across different nodes and availability zones:

affinity:
  nodeAffinity:
    requiredDuringSchedulingIgnoredDuringExecution:
      nodeSelectorTerms:
      - matchExpressions:
        - key: kubernetes.io/hostname
          operator: NotIn
          values:
          - node-1
          - node-2
  podAntiAffinity:
    preferredDuringSchedulingIgnoredDuringExecution:
    - weight: 100
      podAffinityTerm:
        labelSelector:
          matchLabels:
            app: deepseek-inference
        topologyKey: kubernetes.io/hostname

Resource management

Configure appropriate resource requests and limits for DeepSeek pods:

resources:
  requests:
    cpu: "4"
    memory: "16Gi"
    nvidia.com/gpu: "1"
  limits:
    cpu: "8"
    memory: "32Gi"
    nvidia.com/gpu: "1"

Deployment

Step 1: Prepare the DeepSeek model

  1. Download the DeepSeek model files to your OSS bucket:

    #!/bin/bash
    # Download DeepSeek model from ModelScope
    git lfs install
    git clone https://www.modelscope.cn/deepseek-ai/DeepSeek-V3.2.git /tmp/deepseek-v3.2
    # Upload to OSS
    ossutil cp -r /tmp/deepseek-v3.2 oss://your-bucket/models/deepseek-v3.2/
  2. Optimize the model for inference using TensorRT-LLM (optional but recommended for better performance):

    cd /workspace/tensorrt_llm/examples/deepseek
    python convert_checkpoint.py \
      --model_dir /models/deepseek-v3.2 \
      --output_dir /models/deepseek-v3.2-trtllm \
      --dtype float16

Step 2: Configure storage and caching

  1. Create a Fluid Dataset for efficient model loading:

    apiVersion: data.fluid.io/v1alpha1
    kind: Dataset
    metadata:
      name: deepseek-models
    spec:
      mounts:
      - mountPoint: oss://your-bucket/models/
        name: models
        path: /
        options:
          fs.oss.endpoint: oss-cn-beijing.aliyuncs.com
        encryptOptions:
          - name: fs.oss.accessKeyId
            valueFrom:
              secretKeyRef:
                name: oss-secret
                key: accessKeyId
          - name: fs.oss.accessKeySecret
            valueFrom:
              secretKeyRef:
                name: oss-secret
                key: accessKeySecret
  2. Configure JindoRuntime for caching:

    apiVersion: data.fluid.io/v1alpha1
    kind: JindoRuntime
    metadata:
      name: deepseek-models
    spec:
      replicas: 3
      tieredstore:
        levels:
          - mediumtype: MEM
            volumeType: emptyDir
            path: /dev/shm
            quota: 50Gi
            high: "0.95"
            low: "0.7"

Step 3: Deploy the inference service

  1. Create a Kubernetes deployment with multiple replicas:

    apiVersion: apps/v1
    kind: Deployment
    metadata:
      name: deepseek-inference
    spec:
      replicas: 3
      selector:
        matchLabels:
          app: deepseek-inference
      template:
        metadata:
          labels:
            app: deepseek-inference
        spec:
          containers:
          - name: deepseek-server
            image: registry.cn-beijing.aliyuncs.com/your-registry/deepseek-inference:latest
            ports:
            - containerPort: 8000
            env:
            - name: MODEL_PATH
              value: "/models/deepseek-v3.2"
            - name: ENABLE_THINKING
              value: "true"
            resources:
              requests:
                cpu: "4"
                memory: "16Gi"
                nvidia.com/gpu: "1"
              limits:
                cpu: "8"
                memory: "32Gi"
                nvidia.com/gpu: "1"
            volumeMounts:
            - name: model-storage
              mountPath: /models
          volumes:
          - name: model-storage
            persistentVolumeClaim:
              claimName: deepseek-models
          affinity:
            podAntiAffinity:
              preferredDuringSchedulingIgnoredDuringExecution:
              - weight: 100
                podAffinityTerm:
                  labelSelector:
                    matchLabels:
                      app: deepseek-inference
                  topologyKey: kubernetes.io/hostname
  2. Create a load balancer service:

    apiVersion: v1
    kind: Service
    metadata:
      name: deepseek-inference-service
    spec:
      selector:
        app: deepseek-inference
      ports:
        - protocol: TCP
          port: 80
          targetPort: 8000
      type: LoadBalancer

Step 4: Configure auto scaling

  1. Configure Horizontal Pod Autoscaler:

    apiVersion: autoscaling/v2
    kind: HorizontalPodAutoscaler
    metadata:
      name: deepseek-hpa
    spec:
      scaleTargetRef:
        apiVersion: apps/v1
        kind: Deployment
        name: deepseek-inference
      minReplicas: 2
      maxReplicas: 10
      metrics:
      - type: Resource
        resource:
          name: cpu
          target:
            type: Utilization
            averageUtilization: 70
      - type: Resource
        resource:
          name: memory
          target:
            type: Utilization
            averageUtilization: 80
  2. Set up custom metrics for GPU utilization (optional):

    # Custom metric for GPU utilization
    - type: Pods
      pods:
        metric:
          name: gpu_utilization
        target:
          type: AverageValue
          averageValue: "75"

Monitoring and optimization

Monitoring setup

Implement comprehensive monitoring for your DeepSeek deployment:

  • Prometheus metrics for CPU, memory, and GPU utilization

  • Grafana dashboards for real-time visualization

  • Centralized logging with Fluentd or Logstash

  • Distributed tracing with Jaeger for request flow analysis

Performance optimization techniques

Apply these optimization strategies for better performance:

Request batching

Configure batching to improve throughput:

env:
- name: MAX_BATCH_SIZE
  value: "8"
- name: BATCH_TIMEOUT_MS
  value: "100"

Model caching strategies

Use multi-tier caching to reduce model loading time:

  1. RAM-based caching for frequently accessed model parts

  2. NVMe SSD caching for larger model components

  3. Redis or Memcached for cross-node sharing

Connection pooling

Implement connection pooling to reduce overhead:

env:
- name: CONNECTION_POOL_SIZE
  value: "50"
- name: KEEP_ALIVE_TIMEOUT
  value: "300"

Best practices

Resource planning

  • GPU allocation: Allocate dedicated GPUs per DeepSeek instance to avoid resource contention

  • Memory sizing: Reserve at least 2x model size in RAM for optimal performance

  • CPU cores: Allocate 4-8 CPU cores per GPU instance for preprocessing tasks

Scaling strategies

  • Gradual scaling: Scale incrementally based on actual load patterns rather than peak estimates

  • Zone distribution: Distribute replicas across multiple availability zones for fault tolerance

  • Warm-up periods: Implement gradual traffic increase during scaling events

Cost optimization

  • Spot instances: Use spot instances for non-critical inference workloads

  • Autoscaling boundaries: Set appropriate min/max limits to balance performance and cost

  • Idle shutdown: Implement automatic shutdown for low-usage periods

Security considerations

  • Network policies: Restrict pod-to-pod communication using Kubernetes Network Policies

  • TLS encryption: Enable TLS for all external communications

  • RBAC: Implement Role-Based Access Control for cluster resources

  • Secrets management: Use Kubernetes Secrets or external vaults for credential management

Troubleshooting

Common issues and solutions

GPU out of memory errors

Symptoms: CUDA out of memory errors during inference

Solutions:

  • Reduce batch size

  • Increase GPU memory allocation

  • Enable model quantization

  • Use model parallelism across multiple GPUs

Slow inference performance

Symptoms: High latency and slow response times

Solutions:

  • Check GPU utilization metrics

  • Optimize model with TensorRT-LLM

  • Implement request batching

  • Review network connectivity between nodes

Auto scaling problems

Symptoms: Pods failing to start or scale properly

Solutions:

  • Verify resource quotas and limits

  • Check node resource availability

  • Review affinity and anti-affinity rules

  • Monitor cluster autoscaler logs

Diagnostic commands

Useful commands for troubleshooting:

# Check pod status and resource usage
kubectl top pods -l app=deepseek-inference

# View detailed pod information
kubectl describe pod -l app=deepseek-inference

# Check GPU utilization
kubectl exec -it <pod-name> -- nvidia-smi

# Monitor logs
kubectl logs -f -l app=deepseek-inference --tail=100

# Check HPA status
kubectl get hpa deepseek-hpa

# Verify service endpoints
kubectl get endpoints deepseek-inference-service