DeepSeek-R1 excels at math, coding, and reasoning tasks. DeepSeek open-sourced six dense distill models based on Llama and Qwen. This topic demonstrates fine-tuning DeepSeek-R1-Distill-Qwen-7B in PAI Model Gallery.
Supported models
Model Gallery supports LoRA SFT for the six distill models. The following table lists minimum configurations with default parameters:
|
Distill model |
Base model |
Training method |
Minimum configuration |
|
DeepSeek-R1-Distill-Qwen-1.5B |
LoRA supervised fine-tuning |
1 x A10 (24 GB video memory) |
|
|
DeepSeek-R1-Distill-Qwen-7B |
1 x A10 (24 GB video memory) |
||
|
DeepSeek-R1-Distill-Llama-8B |
1 x A10 (24 GB video memory) |
||
|
DeepSeek-R1-Distill-Qwen-14B |
1 x GU8IS (48 GB video memory) |
||
|
DeepSeek-R1-Distill-Qwen-32B |
2 x GU8IS (48 GB video memory) |
||
|
DeepSeek-R1-Distill-Llama-70B |
8 x GU100 (80 GB video memory) |
Train the model
-
Go to the Model Gallery page.
-
Log on to the PAI console.
-
In the upper-left corner, select a region.
-
In the left pane, click Workspaces. On the Workspaces page, click a workspace name.
-
In the left pane, choose QuickStart > Model Gallery.
-
-
On the Model Gallery page, click the DeepSeek-R1-Distill-Qwen-7B model card to go to the details page.
This page displays deployment, training details, SFT data format, and invocation methods.

-
Click Train in the upper-right corner and configure the following key parameters:
-
Dataset Configuration: Upload prepared data to an OSS bucket.
-
Computing Resources: Minimum configurations are listed in Supported models. Adjusting hyperparameters may require more memory.
-
Hyperparameters: Adjust these LoRA SFT hyperparameters based on your data and resources. For details, see Guide to fine-tuning LLMs.
Hyperparameter
Type
Default value
(for 7B model as an example)
Description
learning_rate
float
5e-6
Controls weight adjustment magnitude.
num_train_epochs
int
6
Number of training epochs (dataset iterations).
per_device_train_batch_size
int
2
Samples per GPU per iteration. Higher values increase efficiency and memory usage.
gradient_accumulation_steps
int
2
The number of gradient accumulation steps.
max_length
int
1024
Max tokens per sample.
lora_rank
int
8
LoRA dimension.
lora_alpha
int
32
LoRA scaling factor.
lora_dropout
float
0
LoRA dropout rate. Randomly drops neurons during training to prevent overfitting.
lorap_lr_ratio
float
16
LoRA+ learning rate ratio (λ = ηB/ηA). Uses different rates for adapter matrices A and B. Set to 0 for standard LoRA.
-
-
Click Train. The training page shows job status and logs.

-
On success, the model is registered in AI Asset Management - Models for deployment. See Register and manage models.
-
On failure, click
next to Status or check the Task log tab. For common errors, see FAQ and Model Gallery FAQ. -
Metric Curve shows the loss progression.

-
-
After training, click Deploy to create an EAS service. Invocation follows the original distill model. See the model detail page or One-click deployment of DeepSeek-V3 and DeepSeek-R1 models.

Billing
Model Gallery training uses DLC, billed by job duration. Resources stop automatically when jobs end. See Billing of Deep Learning Containers (DLC).
FAQ
Why does my Model Gallery training job fail?
-
Cause:
max_lengthtoo small. Data exceeding this limit is discarded:
Solution: Increase max_length. If too much data is discarded, training or validation datasets may become empty, causing failure:
-
Error:
failed to compose dlc job specs, resource limiting triggered, you are trying to use more GPU resources than the thresholdSolution: Training is limited to 2 simultaneous GPUs. Wait for ongoing jobs to finish, or submit a ticket to increase quota.
-
Error:
the specified vswitch vsw-**** cannot create the required resource ecs.gn7i-c32g1.8xlarge, zone not matchSolution: The requested instance type is unavailable in the current zone. Try one of these:
-
Leave vSwitch empty. DLC auto-selects one based on inventory.
-
Switch to a different instance type.
-
How do I download the trained model from Model Gallery?
Set the model output path to an OSS directory when creating the training job, then download the model from OSS.

How can I improve poor model performance after fine-tuning?
Try the following approaches:
-
Use a larger model with better baseline performance, such as DeepSeek or Qwen3 series with higher parameter counts.
-
Refine your prompts.
-
Increase
max_tokens. -
Break complex tasks into smaller subtasks for the model to handle separately.