Use QuickStart to Fine-Tune and Deploy Llama 2 Models

This article uses llama-2-7b-chat as an example to describe how to use QuickStart to deploy a model as a service in Elastic Algorithm Service (EAS) and call the service.

The QuickStart module in the Platform for AI (PAI) console provides a no-code method to train and deploy machine learning models. QuickStart enables you to deploy Llama 2 models as an online service that can be accessed on a web UI or by calling APIs. QuickStart also supports customization of Llama 2 models before deployment. You can use custom datasets to fine-tune Llama 2 models based on your business requirements.

Background Information

Llama 2 is a collection of generative text models that are developed by Meta and primarily pre-trained on English datasets. Llama 2 models have a range of parameter sizes: 7 billion (7b), 13 billion(13b), and 70 billion (70b). Each category features a variation that is called Llama-2-chat, which is fine-tuned to enhance performance in dialogue scenarios.

The llama-2-7b-chat model provided by PAI is an adaptation of the Llama-2-7b-chat model provided by Hugging Face. llama-2-7b-chat is a large language model (LLM) that is built based on the Transformer architecture and trained by using diverse open-source datasets. You can use the model in a wide range of common English dialogue scenarios.

This article uses llama-2-7b-chat as an example to describe how to use QuickStart to deploy a model as a service in Elastic Algorithm Service (EAS) and call the service.

Limits

QuickStart is available in the following regions: China (Beijing), China (Shanghai), China (Hangzhou), China (Shenzhen), China (Ulanqab), and Singapore.

Note: To use QuickStart in the China (Ulanqab) region, contact your account manager.

Billing

If you use custom datasets to fine-tune a model in QuickStart, you are charged for the Object Storage Service (OSS) storage that is used to store the fine-tuned model. For more information, see Billing overview.
You are charged for model deployment in EAS. If you use the fine-tuning feature, you are also charged for model training in Deep Learning Containers (DLC). For more information, see Billing of EAS and Billing of general computing resources.

Prerequisites

PAI is activated and the default workspace is created. For more information, see Activate PAI and create a default workspace.
OSS is activated and an OSS bucket is created in the same region as your PAI workspace. For more information, see Get started with OSS.
The terms and conditions of using Llama models are read and accepted.

Procedure

The llama-2-7b-chat model is suitable for most non-professional scenarios. If you want to tailor the model to your business scenario or add domain-specific knowledge to the model, you can fine-tune the model to improve performance.

Before you consider fine-tuning, note that LLMs including llama-2-7b-chat can also acquire basic knowledge through conversational interactions. Fine-tuning in QuickStart is based on Low-Rank Adaptation of Large Language Models (LoRA), which significantly reduces training costs and time compared with other methods such as supervised fine-tuning (SFT).

Deploy a Model without Fine-Tuning

1. Log on to the PAI console. In the left-side navigation pane, click QuickStart.

2. Select a workspace and click Enter Quick Start.

3. On the Model List page, enter llama-2-7b-chat in the search box and click Search.

Note: You can select another model based on your business requirements. The llama-2-7b-chat model requires at least 64 GiB of memory and 24 GiB of GPU memory. Make sure that the computing resources of the workspace are sufficient. Otherwise, the deployment may fail.

4. Click the llama-2-7b-chat card. On the model details page, click Deploy.

5. In the lower part of the Deploy panel, click Deploy.

6. In the Billing Notification dialog box, click OK.

7. The Service details tab appears. When the Status of the service changes to In operation, the model is deployed as an inference service.

8. After you deploy the model, you can call the inference service.

a) On the Service details tab, navigate to the Web Application section and click View Web App.

b) Call the inference service by using the following methods.

Use the web UI: On the Chat tab, enter a sentence in the dialog box and click Send to start a conversation.

Use APIs: At the bottom of the Chat tab, click Use via API to view API call details.

Fine-Tune and Deploy a Model

1. Log on to the PAI console. In the left-side navigation pane, click QuickStart.

2. Select a workspace and click Enter Quick Start.

3. On the Model List page, enter llama-2-7b-chat in the search box and click Search.

4. Click the llama-2-7b-chat card. On the model details page, click Fine-tune.

5. Configure fine-tuning parameters.

By default, QuickStart uses the most common configurations of Computing resources and Hyper-parameters. You can modify the configurations based on your business requirements. The following table describes the parameters that are used in this example.

Parameter		Description
Job Configuration	Output Path	The path of the OSS bucket where the generated model file is stored. Note: If you configured a default storage path on the Workspace Details page, the path is automatically specified as the value of the Output Path parameter. For information about how to configure a default storage path for a workspace, see Manage workspaces.
Dataset Configuration	Training dataset	You can use the default training dataset that is provided by QuickStart. If you want to use a custom dataset, prepare the training data in the required format and upload the dataset by using the following options: • Dataset Selection: For more information, see Create and manage datasets. • OSS file or directory: For more information, see Get started with OSS. The training data must be in the JSON format. Each data record consists of a question, an answer, and an ID, which are specified by the `instruction`, `output`, and `id` fields, respectively. Example: `[ { "instruction": "Does the following text contain a global topic? Why do Americans rarely hold military parades?", "output": "Yes", "id": 0 }, { "instruction": "Does the following text contain a global topic? Breaking news! The timetable for the official vehicle reform of public institutions is released! ", "output": "No", "id": 1 } ]` In addition to the training dataset, we recommend that you prepare a separate validation dataset. The validation dataset is used to evaluate the model performance and optimize the fine-tuning parameters.

6. Click Fine-tune to submit the training job.

7. In the Billing Notification dialog box, click OK.

8. The Task details tab appears. When the Job Status changes to Success, the model training is completed.

The trained model is saved in the OSS path that you specify. You can view the OSS path in the Output Path field under the Basic Information section.

Note: If you use the default training dataset and the default configurations of the hyperparameters and computing resources, the training job can complete in 1.5 hours. If you use custom training data and configurations, the training duration may vary. Most training jobs can complete within a few hours.

9. Deploy the fine-tuned model.

The procedure for deploying a fine-tuned model is the same as the procedure for deploying a model that is not fine-tuned. For more information, see the Deploy a model without fine-tuning section of this article.

What to Do Next

On the QuickStart page, you can click Job Management to view the details of the training jobs and model deployment.

Community

Use QuickStart to Fine-Tune and Deploy Llama 2 Models

Background Information

Limits

Billing

Prerequisites

Procedure

Deploy a Model without Fine-Tuning

Fine-Tune and Deploy a Model

What to Do Next

Read previous post:

Read next post:

Alibaba Cloud Data Intelligence

You may also like

Comments

Alibaba Cloud Data Intelligence

Related Products

Platform For AI

Alibaba Cloud for Generative AI

AI Acceleration Solution

Conversational AI Service