All Products
Search
Document Center

Platform For AI:Fine-tune a Llama3-8B model

Last Updated:Sep 27, 2024

Data Science Workshop (DSW) of Platform for AI (PAI) is an interactive modeling platform on which you can perform custom model fine-tuning and optimize model performance. This topic describes how to fine-tune the parameters of a Llama 3 model in DSW to adapt the model to specific scenarios and improve the model performance on specific tasks. In this topic, a Meta-Llama-3-8B-Instruct model is used as an example.

Background information

Llama 3 is the latest model family in the Llama series provided by Meta in April 2024. Llama 3 is trained on more than 15 trillion tokens, which is approximately 7 times the size of the Llama 2 dataset. Llama 3 supports 8K tokens and an improved tokenizer that has a vocabulary size of 128K tokens. This ensures more precise and efficient processing of complex contexts and technical terms.

Llama3 provides pretrained and instruction-tuned versions of models in 8B and 70B sizes suitable for various scenarios.

  • 8B

    Llama 3 8B is suitable for efficient deployment and development based on consumer-grade GPUs. You can use Llama 3 8B in scenarios that require high response speed and cost-effectiveness.

    • Meta-Llama-3-8B: pretrained version

    • Meta-Llama-3-8B-Instruct: instruction-tuned version

  • 70B

    Llama 3 70B leverages a large-scale parameter size and is suitable for large-scale AI applications, advanced and complex tasks, and performance optimization tasks.

    • Meta-Llama-3-70B: pretrained version

    • Meta-Llama-3-70B-Instruct: instruction-tuned version

Prerequisites

  • A workspace is created. For more information, see Create a workspace.

  • A DSW instance is created. Take note of the following key parameters. For more information, see Create a DSW instance.

    • Instance type: We recommend that you use an instance whose GPU memory is at least 16 GB, such as the V100 GPU.

    • Python: Python 3.9 or later.

    • Image: In this example, the following image URL is used: dsw-registry-vpc. REGION.cr.aliyuncs.com/pai-training-algorithm/llm_deepspeed_peft:v0.0.3. Replace REGION with the ID of the region in which your DSW instance resides. Example: cn-hangzhou or cn-shanghai. The following table describes the region IDs.

      Region

      Region ID

      China (Hangzhou)

      cn-hangzhou

      China (Shanghai)

      cn-shanghai

      China (Beijing)

      cn-beijing

      China (Shenzhen)

      cn-shenzhen

  • Before you use the Llama 3 model, read the official Meta license.

    Note

    If you cannot access the web page, configure a proxy and try again.

Step 1: Download the model

Method 1: Download the model in DSW

  1. Go to the DSW development environment.

    1. Log on to the PAI console.

    2. In the top navigation bar, select the region in which the DSW instance resides.

    3. In the left-side navigation pane, click Workspaces. On the Workspaces page, click the name of the default workspace.

    4. In the left-side navigation pane, choose Model Development and Training > Interactive Modeling (DSW).

    5. Click Open in the Actions column of the DSW instance that you want to manage to go to the development environment of the DSW instance.

  2. On the Launcher tab, click Python 3 in the Notebook pane of the Quick Start section.

  3. Run the following code in the Notebook to download the model file. The system automatically selects an appropriate download address and downloads the model file to the current directory.

    ! pip install modelscope==1.12.0 transformers==4.37.0
    from modelscope.hub.snapshot_download import snapshot_download
    snapshot_download('LLM-Research/Meta-Llama-3-8B-Instruct', cache_dir='.', revision='master')

Method 2: Download the model in Meta

Go to the Meta website to apply for the model.

Note

If you cannot access the web page, configure a proxy and try again.

Step 2: Prepare a dataset

In this example, an English poetry dataset is used to fine-tune the Llama 3 model to improve the poetic expressiveness of the generated poems. Run the following command in the Notebook of DSW to download the training dataset required by the model:

!wget https://atp-modelzoo-sh.oss-cn-shanghai.aliyuncs.com/tutorial/llm_instruct/en_poetry_train.json

You can use a dataset that is suitable for your business scenario based on the format of the sample training dataset. You can improve the response accuracy of a large language model (LLM) for specific tasks by fine-tuning the LLM.

Step 3: Fine-tune the model

Lightweight LoRA training

In this example, the /ml/code/sft.py training script is used to perform lightweight Low-Rank Adaptation (LoRA) training on the model. After training, the system quantizes the model parameters to reduce the GPU memory required for inference.

When you run the accelerate launch command, the system uses the parameters to launch specific Python scripts and performs training based on the computing resources that are specified in the multi_gpu.yaml configuration file.

! accelerate launch --num_processes 1 --config_file /ml/code/multi_gpu.yaml /ml/code/sft.py \
    --model_name  ./LLM-Research/Meta-Llama-3-8B-Instruct/ \
    --model_type llama \
    --train_dataset_name chinese_medical_train_sampled.json \
    --num_train_epochs 3 \
    --batch_size 8 \
    --seq_length 128 \
    --learning_rate 5e-4 \
    --lr_scheduler_type linear \
    --target_modules k_proj o_proj q_proj v_proj \
    --output_dir lora_model/ \
    --apply_chat_template \
    --use_peft \
    --load_in_4bit \
    --peft_lora_r 32 \
    --peft_lora_alpha 32 

The following section describes the parameters used in this example. Modify the parameters based on your business requirements.

  • The accelerate launch command is used to launch and manage deep learning training scripts on multiple GPUs.

    • num_processes: the number of parallel processes. In this example, this parameter is set to 1 to disable multi-process parallel processing.

    • config_file/ml/code/multi_gpu.yaml: the path of the configuration file.

    • /ml/code/sft.py: the path of the Python script that you want to run.

  • To run the /ml/code/sft.py script, configure the following parameters:

    • --model_name./LLM-Research/Meta-Llama-3-8B-Instruct/: the path of the pretrained model.

    • --model_type: the type of the model. In this example, Llama is used.

    • --train_dataset_namechinese_medical_train_sampled.json: the path of the training dataset.

    • --num_train_epochs: the number of training epochs. In this example, set the parameter to 3.

    • --batch_size: the size of the batch. In this example, set the parameter to 8.

    • --seq_length: the length of the sequence. In this example, set the parameter to 128.

    • --learning_rate: the learning rate. In this example, set the parameter to 5e-4, which is equal to 0.0005.

    • --lr_scheduler_type: the type of the learning rate scheduler. In this example, set the parameter to linear.

    • --target_modules: the model sections on which you want to focus during fine-tuning. In this example, set the parameter to k_proj o_proj q_proj v_proj.

    • --output_dir: the output directory in which the fine-tuned model is saved. In this example, set the parameter to lora_model/.

    • --apply_chat_template: the chat template that you want to use during training.

    • --use_peft: Use Parameter-Efficient Fine-Tuning (PEFT) during training.

    • --load_in_4bit: Load the model weights with 4-bit precision to reduce memory consumption.

    • --peft_lora_r 32: If LoRA is used as part of the parameter efficient tuning method, this specifies a value of 32 for LoRA.

    • --peft_lora_alpha: the alpha value of LoRA. In this example, set the parameter to 32.

Fuse LoRA weights with the model

Run the following command to fuse the LoRA weights with the Llama 3 model:

! RANK=0 python /ml/code/convert.py \
    --model_name ./LLM-Research/Meta-Llama-3-8B-Instruct/ \
    --model_type llama \
    --output_dir trained_model/ \
    --adapter_dir lora_model/

The following section describes the parameters and values used in this example:

  • RANK=0: The RANK environment variable is used to specify the sequence number of the current process among all processes in distributed training. A value of 0 specifies that the current process is an independent process or serves as the main process in distributed training.

  • python /ml/code/convert.py: Run the convert.py script for weight conversion or other conversion tasks.

  • --model_name ./LLM-Research/Meta-Llama-3-8B-Instruct/: the path to the model.

  • --model_type llama: the type of the model. In this example, Llama is used.

  • --output_dir trained_model/: the output directory in which the converted model and weights are saved.

  • --adapter_dir lora_model/: the directory in which the LoRA adapter weights reside.

Step 4: Perform model inference

Run the following command to perform model inference and verify the fine-tuning effect. In this example, the model is required to generate a poem about spring:

import torch, transformers

# model_id = "./LLM-Research/Meta-Llama-3-8B-Instruct/"
model_id = "./trained_model/"
pipeline = transformers.pipeline(
    "text-generation",
    model=model_id,
    model_kwargs={"torch_dtype": torch.bfloat16},
    device="cuda",
)

messages = [
    {"role": "user", "content": "Write a poem on a topic 'spring' "},
]

prompt = pipeline.tokenizer.apply_chat_template(
        messages, 
        tokenize=False, 
        add_generation_prompt=True
)

terminators = [
    pipeline.tokenizer.eos_token_id,
    pipeline.tokenizer.convert_tokens_to_ids("<|eot_id|>")
]

outputs = pipeline(
    prompt,
    max_new_tokens=1024,
    eos_token_id=terminators,
    do_sample=True,
    temperature=0.6,
    top_p=0.9,
)
print(outputs[0]["generated_text"][len(prompt):])

The following section provides a sample model response, which indicates that the model can generate high-quality poems after fine-tuning.

Here's a poem on the topic of "Spring":

As winter's chill begins to fade,
The earth awakens from its shade,
And spring's sweet breath begins to blow,
Bringing life to all that's cold and slow.

The trees regain their vibrant hue,
And flowers bloom, both old and new,
Their petals dancing in the breeze,
As sunshine warms the world with ease.

The air is filled with sweet perfume,
As blossoms burst forth in their room,
And robins sing their morning song,
As spring's awakening is strong.

The world is fresh, and new, and bright,
As spring's warm light begins to take flight,
And all around, new life unfolds,
As winter's grip begins to grow old.

So let us bask in spring's warm rays,
And let our spirits soar and sway,
For in this season, we're reborn,
And all around, new life is sworn.

I hope you enjoy it!

Step 5. Deploy the model

You can upload the fine-tuned model weights to Object Storage Service (OSS) and deploy the fine-tuned Llama 3 model in Elastic Algorithm Service (EAS) of PAI. For more information, see Deploy LLM applications in EAS.

Appendix: Use Llama 3 in DSW Gallery

DSW Gallery provides Notebook use cases on Llama 3. You can use the case on a DSW instance based on your business requirements. For more information, see Notebook Gallery.

References

For more information about the versions of ChatLLM-WebUI, see Release notes for ChatLLM WebUI.