Best practices for deploying the DeepSeek full version across multiple nodes in ACK - Container Service for Kubernetes

This topic describes best practices for deploying the DeepSeek full version model across multiple nodes in Alibaba Cloud Container Service for Kubernetes (ACK) to achieve optimal performance, scalability, and reliability.

Background information

DeepSeek Overview

DeepSeek is a series of large language models. The full version models support both hybrid-thinking and thinking-only modes, providing advanced reasoning capabilities for complex tasks. These models are particularly suitable for scenarios requiring deep analytical thinking and problem-solving.

Supported DeepSeek models:

Hybrid-thinking models (disabled by default): deepseek-v3.2, deepseek-v3.2-exp, deepseek-v3.1
Thinking-only models: deepseek-r1, deepseek-r1-0528, deepseek-r1 distilled models

Note

DeepSeek models are currently only available in the Beijing region.

ACK distributed deployment benefits

Deploying DeepSeek models across multiple nodes in ACK provides several advantages:

Scalability: Dynamically scale compute resources based on workload demands
High availability: Distribute workloads across multiple nodes to eliminate single points of failure
Resource optimization: Efficiently utilize GPU resources across the cluster
Performance: Reduce latency through intelligent load balancing and proximity scheduling

Prerequisites

Create an ACK managed cluster (Pro edition) with GPU nodes. The cluster Kubernetes version must be 1.22 or later.
Ensure your cluster contains nodes with compatible NVIDIA GPUs (recommended: A10, V100, or A100). Use NVIDIA driver version 525 for optimal performance.
Ensure you have sufficient permissions to create and manage Kubernetes resources, including deployments, services, and persistent volumes.
Configure Object Storage Service (OSS) for model storage and caching. Create a bucket to store model weights and configuration files.
Set up proper network policies and security groups to allow inter-node communication and external access to the inference service.

Architecture design principles

Multi-node deployment strategy

When deploying DeepSeek models across multiple nodes, consider the following architectural approaches:

Horizontal Pod Autoscaler (HPA) approach

Deploy multiple replicas of the DeepSeek inference service across different nodes, allowing Kubernetes to automatically scale based on CPU, memory, or custom metrics.

Node affinity and anti-affinity

Use node affinity rules to ensure optimal distribution of DeepSeek pods across different nodes and availability zones:

affinity:
  nodeAffinity:
    requiredDuringSchedulingIgnoredDuringExecution:
      nodeSelectorTerms:
      - matchExpressions:
        - key: kubernetes.io/hostname
          operator: NotIn
          values:
          - node-1
          - node-2
  podAntiAffinity:
    preferredDuringSchedulingIgnoredDuringExecution:
    - weight: 100
      podAffinityTerm:
        labelSelector:
          matchLabels:
            app: deepseek-inference
        topologyKey: kubernetes.io/hostname

Resource management

Configure appropriate resource requests and limits for DeepSeek pods:

resources:
  requests:
    cpu: "4"
    memory: "16Gi"
    nvidia.com/gpu: "1"
  limits:
    cpu: "8"
    memory: "32Gi"
    nvidia.com/gpu: "1"

Deployment

Step 1: Prepare the DeepSeek model

Download the DeepSeek model files to your OSS bucket:

#!/bin/bash
# Download DeepSeek model from ModelScope
git lfs install
git clone https://www.modelscope.cn/deepseek-ai/DeepSeek-V3.2.git /tmp/deepseek-v3.2
# Upload to OSS
ossutil cp -r /tmp/deepseek-v3.2 oss://your-bucket/models/deepseek-v3.2/

Optimize the model for inference using TensorRT-LLM (optional but recommended for better performance):

cd /workspace/tensorrt_llm/examples/deepseek
python convert_checkpoint.py \
  --model_dir /models/deepseek-v3.2 \
  --output_dir /models/deepseek-v3.2-trtllm \
  --dtype float16

Step 2: Configure storage and caching

Create a Fluid Dataset for efficient model loading:

apiVersion: data.fluid.io/v1alpha1
kind: Dataset
metadata:
  name: deepseek-models
spec:
  mounts:
  - mountPoint: oss://your-bucket/models/
    name: models
    path: /
    options:
      fs.oss.endpoint: oss-cn-beijing.aliyuncs.com
    encryptOptions:
      - name: fs.oss.accessKeyId
        valueFrom:
          secretKeyRef:
            name: oss-secret
            key: accessKeyId
      - name: fs.oss.accessKeySecret
        valueFrom:
          secretKeyRef:
            name: oss-secret
            key: accessKeySecret

Configure JindoRuntime for caching:

apiVersion: data.fluid.io/v1alpha1
kind: JindoRuntime
metadata:
  name: deepseek-models
spec:
  replicas: 3
  tieredstore:
    levels:
      - mediumtype: MEM
        volumeType: emptyDir
        path: /dev/shm
        quota: 50Gi
        high: "0.95"
        low: "0.7"

Step 3: Deploy the inference service

Create a Kubernetes deployment with multiple replicas:

apiVersion: apps/v1
kind: Deployment
metadata:
  name: deepseek-inference
spec:
  replicas: 3
  selector:
    matchLabels:
      app: deepseek-inference
  template:
    metadata:
      labels:
        app: deepseek-inference
    spec:
      containers:
      - name: deepseek-server
        image: registry.cn-beijing.aliyuncs.com/your-registry/deepseek-inference:latest
        ports:
        - containerPort: 8000
        env:
        - name: MODEL_PATH
          value: "/models/deepseek-v3.2"
        - name: ENABLE_THINKING
          value: "true"
        resources:
          requests:
            cpu: "4"
            memory: "16Gi"
            nvidia.com/gpu: "1"
          limits:
            cpu: "8"
            memory: "32Gi"
            nvidia.com/gpu: "1"
        volumeMounts:
        - name: model-storage
          mountPath: /models
      volumes:
      - name: model-storage
        persistentVolumeClaim:
          claimName: deepseek-models
      affinity:
        podAntiAffinity:
          preferredDuringSchedulingIgnoredDuringExecution:
          - weight: 100
            podAffinityTerm:
              labelSelector:
                matchLabels:
                  app: deepseek-inference
              topologyKey: kubernetes.io/hostname

Create a load balancer service:

apiVersion: v1
kind: Service
metadata:
  name: deepseek-inference-service
spec:
  selector:
    app: deepseek-inference
  ports:
    - protocol: TCP
      port: 80
      targetPort: 8000
  type: LoadBalancer

Step 4: Configure auto scaling

Configure Horizontal Pod Autoscaler:

apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: deepseek-hpa
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: deepseek-inference
  minReplicas: 2
  maxReplicas: 10
  metrics:
  - type: Resource
    resource:
      name: cpu
      target:
        type: Utilization
        averageUtilization: 70
  - type: Resource
    resource:
      name: memory
      target:
        type: Utilization
        averageUtilization: 80

Set up custom metrics for GPU utilization (optional):

# Custom metric for GPU utilization
- type: Pods
  pods:
    metric:
      name: gpu_utilization
    target:
      type: AverageValue
      averageValue: "75"

Monitoring and optimization

Monitoring setup

Implement comprehensive monitoring for your DeepSeek deployment:

Prometheus metrics for CPU, memory, and GPU utilization
Grafana dashboards for real-time visualization
Centralized logging with Fluentd or Logstash
Distributed tracing with Jaeger for request flow analysis

Performance optimization techniques

Apply these optimization strategies for better performance:

Request batching

Configure batching to improve throughput:

env:
- name: MAX_BATCH_SIZE
  value: "8"
- name: BATCH_TIMEOUT_MS
  value: "100"

Model caching strategies

Use multi-tier caching to reduce model loading time:

RAM-based caching for frequently accessed model parts
NVMe SSD caching for larger model components
Redis or Memcached for cross-node sharing

Connection pooling

Implement connection pooling to reduce overhead:

env:
- name: CONNECTION_POOL_SIZE
  value: "50"
- name: KEEP_ALIVE_TIMEOUT
  value: "300"

Best practices

Resource planning

GPU allocation: Allocate dedicated GPUs per DeepSeek instance to avoid resource contention
Memory sizing: Reserve at least 2x model size in RAM for optimal performance
CPU cores: Allocate 4-8 CPU cores per GPU instance for preprocessing tasks

Scaling strategies

Gradual scaling: Scale incrementally based on actual load patterns rather than peak estimates
Zone distribution: Distribute replicas across multiple availability zones for fault tolerance
Warm-up periods: Implement gradual traffic increase during scaling events

Cost optimization

Spot instances: Use spot instances for non-critical inference workloads
Autoscaling boundaries: Set appropriate min/max limits to balance performance and cost
Idle shutdown: Implement automatic shutdown for low-usage periods

Security considerations

Network policies: Restrict pod-to-pod communication using Kubernetes Network Policies
TLS encryption: Enable TLS for all external communications
RBAC: Implement Role-Based Access Control for cluster resources
Secrets management: Use Kubernetes Secrets or external vaults for credential management

Troubleshooting

Common issues and solutions

GPU out of memory errors

Symptoms: CUDA out of memory errors during inference

Solutions:

Reduce batch size
Increase GPU memory allocation
Enable model quantization
Use model parallelism across multiple GPUs

Slow inference performance

Symptoms: High latency and slow response times

Solutions:

Check GPU utilization metrics
Optimize model with TensorRT-LLM
Implement request batching
Review network connectivity between nodes

Auto scaling problems

Symptoms: Pods failing to start or scale properly

Solutions:

Verify resource quotas and limits
Check node resource availability
Review affinity and anti-affinity rules
Monitor cluster autoscaler logs

Diagnostic commands

Useful commands for troubleshooting:

# Check pod status and resource usage
kubectl top pods -l app=deepseek-inference

# View detailed pod information
kubectl describe pod -l app=deepseek-inference

# Check GPU utilization
kubectl exec -it <pod-name> -- nvidia-smi

# Monitor logs
kubectl logs -f -l app=deepseek-inference --tail=100

# Check HPA status
kubectl get hpa deepseek-hpa

# Verify service endpoints
kubectl get endpoints deepseek-inference-service

Background information

DeepSeek Overview

ACK distributed deployment benefits

Prerequisites

Architecture design principles

Multi-node deployment strategy

Horizontal Pod Autoscaler (HPA) approach

Node affinity and anti-affinity

Resource management

Deployment

Step 1: Prepare the DeepSeek model

Step 2: Configure storage and caching

Step 3: Deploy the inference service

Step 4: Configure auto scaling

Monitoring and optimization

Monitoring setup

Performance optimization techniques

Request batching

Model caching strategies

Connection pooling

Best practices

Resource planning

Scaling strategies

Cost optimization

Security considerations

Troubleshooting

Common issues and solutions

GPU out of memory errors

Slow inference performance

Auto scaling problems

Diagnostic commands

Related topics