This topic describes best practices for deploying the DeepSeek full version model across multiple nodes in Alibaba Cloud Container Service for Kubernetes (ACK) to achieve optimal performance, scalability, and reliability.
Background information
DeepSeek Overview
DeepSeek is a series of large language models. The full version models support both hybrid-thinking and thinking-only modes, providing advanced reasoning capabilities for complex tasks. These models are particularly suitable for scenarios requiring deep analytical thinking and problem-solving.
Supported DeepSeek models:
Hybrid-thinking models (disabled by default): deepseek-v3.2, deepseek-v3.2-exp, deepseek-v3.1
Thinking-only models: deepseek-r1, deepseek-r1-0528, deepseek-r1 distilled models
DeepSeek models are currently only available in the Beijing region.
ACK distributed deployment benefits
Deploying DeepSeek models across multiple nodes in ACK provides several advantages:
Scalability: Dynamically scale compute resources based on workload demands
High availability: Distribute workloads across multiple nodes to eliminate single points of failure
Resource optimization: Efficiently utilize GPU resources across the cluster
Performance: Reduce latency through intelligent load balancing and proximity scheduling
Prerequisites
Create an ACK managed cluster (Pro edition) with GPU nodes. The cluster Kubernetes version must be 1.22 or later.
Ensure your cluster contains nodes with compatible NVIDIA GPUs (recommended: A10, V100, or A100). Use NVIDIA driver version 525 for optimal performance.
Ensure you have sufficient permissions to create and manage Kubernetes resources, including deployments, services, and persistent volumes.
Configure Object Storage Service (OSS) for model storage and caching. Create a bucket to store model weights and configuration files.
Set up proper network policies and security groups to allow inter-node communication and external access to the inference service.
Architecture design principles
Multi-node deployment strategy
When deploying DeepSeek models across multiple nodes, consider the following architectural approaches:
Horizontal Pod Autoscaler (HPA) approach
Deploy multiple replicas of the DeepSeek inference service across different nodes, allowing Kubernetes to automatically scale based on CPU, memory, or custom metrics.
Node affinity and anti-affinity
Use node affinity rules to ensure optimal distribution of DeepSeek pods across different nodes and availability zones:
affinity:
nodeAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
nodeSelectorTerms:
- matchExpressions:
- key: kubernetes.io/hostname
operator: NotIn
values:
- node-1
- node-2
podAntiAffinity:
preferredDuringSchedulingIgnoredDuringExecution:
- weight: 100
podAffinityTerm:
labelSelector:
matchLabels:
app: deepseek-inference
topologyKey: kubernetes.io/hostnameResource management
Configure appropriate resource requests and limits for DeepSeek pods:
resources:
requests:
cpu: "4"
memory: "16Gi"
nvidia.com/gpu: "1"
limits:
cpu: "8"
memory: "32Gi"
nvidia.com/gpu: "1"Deployment
Step 1: Prepare the DeepSeek model
Download the DeepSeek model files to your OSS bucket:
#!/bin/bash # Download DeepSeek model from ModelScope git lfs install git clone https://www.modelscope.cn/deepseek-ai/DeepSeek-V3.2.git /tmp/deepseek-v3.2 # Upload to OSS ossutil cp -r /tmp/deepseek-v3.2 oss://your-bucket/models/deepseek-v3.2/Optimize the model for inference using TensorRT-LLM (optional but recommended for better performance):
cd /workspace/tensorrt_llm/examples/deepseek python convert_checkpoint.py \ --model_dir /models/deepseek-v3.2 \ --output_dir /models/deepseek-v3.2-trtllm \ --dtype float16
Step 2: Configure storage and caching
Create a Fluid Dataset for efficient model loading:
apiVersion: data.fluid.io/v1alpha1 kind: Dataset metadata: name: deepseek-models spec: mounts: - mountPoint: oss://your-bucket/models/ name: models path: / options: fs.oss.endpoint: oss-cn-beijing.aliyuncs.com encryptOptions: - name: fs.oss.accessKeyId valueFrom: secretKeyRef: name: oss-secret key: accessKeyId - name: fs.oss.accessKeySecret valueFrom: secretKeyRef: name: oss-secret key: accessKeySecretConfigure JindoRuntime for caching:
apiVersion: data.fluid.io/v1alpha1 kind: JindoRuntime metadata: name: deepseek-models spec: replicas: 3 tieredstore: levels: - mediumtype: MEM volumeType: emptyDir path: /dev/shm quota: 50Gi high: "0.95" low: "0.7"
Step 3: Deploy the inference service
Create a Kubernetes deployment with multiple replicas:
apiVersion: apps/v1 kind: Deployment metadata: name: deepseek-inference spec: replicas: 3 selector: matchLabels: app: deepseek-inference template: metadata: labels: app: deepseek-inference spec: containers: - name: deepseek-server image: registry.cn-beijing.aliyuncs.com/your-registry/deepseek-inference:latest ports: - containerPort: 8000 env: - name: MODEL_PATH value: "/models/deepseek-v3.2" - name: ENABLE_THINKING value: "true" resources: requests: cpu: "4" memory: "16Gi" nvidia.com/gpu: "1" limits: cpu: "8" memory: "32Gi" nvidia.com/gpu: "1" volumeMounts: - name: model-storage mountPath: /models volumes: - name: model-storage persistentVolumeClaim: claimName: deepseek-models affinity: podAntiAffinity: preferredDuringSchedulingIgnoredDuringExecution: - weight: 100 podAffinityTerm: labelSelector: matchLabels: app: deepseek-inference topologyKey: kubernetes.io/hostnameCreate a load balancer service:
apiVersion: v1 kind: Service metadata: name: deepseek-inference-service spec: selector: app: deepseek-inference ports: - protocol: TCP port: 80 targetPort: 8000 type: LoadBalancer
Step 4: Configure auto scaling
Configure Horizontal Pod Autoscaler:
apiVersion: autoscaling/v2 kind: HorizontalPodAutoscaler metadata: name: deepseek-hpa spec: scaleTargetRef: apiVersion: apps/v1 kind: Deployment name: deepseek-inference minReplicas: 2 maxReplicas: 10 metrics: - type: Resource resource: name: cpu target: type: Utilization averageUtilization: 70 - type: Resource resource: name: memory target: type: Utilization averageUtilization: 80Set up custom metrics for GPU utilization (optional):
# Custom metric for GPU utilization - type: Pods pods: metric: name: gpu_utilization target: type: AverageValue averageValue: "75"
Monitoring and optimization
Monitoring setup
Implement comprehensive monitoring for your DeepSeek deployment:
Prometheus metrics for CPU, memory, and GPU utilization
Grafana dashboards for real-time visualization
Centralized logging with Fluentd or Logstash
Distributed tracing with Jaeger for request flow analysis
Performance optimization techniques
Apply these optimization strategies for better performance:
Request batching
Configure batching to improve throughput:
env:
- name: MAX_BATCH_SIZE
value: "8"
- name: BATCH_TIMEOUT_MS
value: "100"Model caching strategies
Use multi-tier caching to reduce model loading time:
RAM-based caching for frequently accessed model parts
NVMe SSD caching for larger model components
Redis or Memcached for cross-node sharing
Connection pooling
Implement connection pooling to reduce overhead:
env:
- name: CONNECTION_POOL_SIZE
value: "50"
- name: KEEP_ALIVE_TIMEOUT
value: "300"Best practices
Resource planning
GPU allocation: Allocate dedicated GPUs per DeepSeek instance to avoid resource contention
Memory sizing: Reserve at least 2x model size in RAM for optimal performance
CPU cores: Allocate 4-8 CPU cores per GPU instance for preprocessing tasks
Scaling strategies
Gradual scaling: Scale incrementally based on actual load patterns rather than peak estimates
Zone distribution: Distribute replicas across multiple availability zones for fault tolerance
Warm-up periods: Implement gradual traffic increase during scaling events
Cost optimization
Spot instances: Use spot instances for non-critical inference workloads
Autoscaling boundaries: Set appropriate min/max limits to balance performance and cost
Idle shutdown: Implement automatic shutdown for low-usage periods
Security considerations
Network policies: Restrict pod-to-pod communication using Kubernetes Network Policies
TLS encryption: Enable TLS for all external communications
RBAC: Implement Role-Based Access Control for cluster resources
Secrets management: Use Kubernetes Secrets or external vaults for credential management
Troubleshooting
Common issues and solutions
GPU out of memory errors
Symptoms: CUDA out of memory errors during inference
Solutions:
Reduce batch size
Increase GPU memory allocation
Enable model quantization
Use model parallelism across multiple GPUs
Slow inference performance
Symptoms: High latency and slow response times
Solutions:
Check GPU utilization metrics
Optimize model with TensorRT-LLM
Implement request batching
Review network connectivity between nodes
Auto scaling problems
Symptoms: Pods failing to start or scale properly
Solutions:
Verify resource quotas and limits
Check node resource availability
Review affinity and anti-affinity rules
Monitor cluster autoscaler logs
Diagnostic commands
Useful commands for troubleshooting:
# Check pod status and resource usage
kubectl top pods -l app=deepseek-inference
# View detailed pod information
kubectl describe pod -l app=deepseek-inference
# Check GPU utilization
kubectl exec -it <pod-name> -- nvidia-smi
# Monitor logs
kubectl logs -f -l app=deepseek-inference --tail=100
# Check HPA status
kubectl get hpa deepseek-hpa
# Verify service endpoints
kubectl get endpoints deepseek-inference-service