In the era of large language models (LLMs), model evaluation is a key part in model selection and model optimization. Model evaluation is essential to accelerate the innovation and practice of Artificial Intelligence (AI). LLM evaluation of Platform for AI (PAI) supports various evaluation scenarios, such as comparative analysis of different basic models, fine-tuned versions of the same model, and quantitative versions of the same model. This article describes how to implement more comprehensive, accurate, and focused model evaluation based on specific dataset types for different user groups to achieve better results in the AI field.
With the significant improvement in the effectiveness of models in the era of LLMs, model evaluation has become increasingly important. Scientific and efficient model evaluation helps developers measure and compare the performance of different models in an efficient manner. It also guides developers to select and optimize models in an accurate manner. This accelerates AI innovation and application development. In this case, platform-based best practices are important for LLM evaluation.
This arrticle provides best practices for LLM evaluation in PAI to help AI developers to evaluate LLMs. This arrticle can help you build an evaluation process that reflects the actual performance of a model and meets specific requirements of the industry. This helps you achieve better results in the AI field. The following section provides the best practices described in this arrticle:
The LLM evaluation platform of PAI allows you to compare the model effectiveness in different LLM evaluation scenarios. Examples:
This article describes how to use custom datasets from enterprises together with commonly used public datasets, such as MMLU and C-Eval, to achieve more comprehensive, accurate, and focused model evaluation and find LLMs that meet your business requirements. In this article, enterprise developers and algorithm researchers are used to consider special requirements of different development groups. The best practices have the following characteristics:
LLM evaluation relies on QuickStart. QuickStart is free of charge. If you use QuickStart to evaluate models, you are charged for Deep Learning Containers (DLC) evaluation jobs. For more information, see Billing of DLC.
If you select custom datasets for model evaluation, you are charged for using Object Storage Service (OSS). For more information, see Billing overview.
In most cases, enterprises accumulate a wealth of data in specific domains. Enterprises need to make full use of the data to optimize algorithms by using LLMs. Enterprise developers often evaluate open source or fine-tuned LLMs based on custom datasets accumulated in specific domains to better understand the effectiveness of LLMs in the domains.
In model evaluation based on custom datasets, the LLM evaluation platform of PAI uses the standard text matching method in the NLP domain to calculate the matching score between the evaluation results and the actual results of a model. A higher score indicates a better model. This evaluation method helps evaluate whether the model is suitable for a business scenario based on scenario-specific data.
The following section highlights some key points during model evaluation. For more information, see Model evaluation.
1. Prepare a custom dataset.
[{"question": "Is it correct that Chinese invented papermaking?", "answer": "Yes"}]
[{"question": "Is it correct that Chinese invented gunpowder?", "answer": "Yes"}]
The question
field is used to identify the question column, and the answer
field is used to identify the answer column.
2. Select a model based on your business requirements.
In the left-side navigation pane of the PAI console, choose QuickStart > Model Gallery. Move the pointer over a model. The Evaluate button appears for models that can be evaluated.
In the left-side navigation pane of the PAI console, choose QuickStart > Model Gallery. Move the pointer over a model. The Evaluate button appears for models that can be evaluated. Fine-tune a model that can be evaluated. Then, in the left-side navigation pane of the PAI console, choose QuickStart > Model Gallery. On the page that appears, click Job Management. On the Training jobs tab of the Job Management page, find the desired training job and click the name of the training job in the Task name/ID column. On the page that appears, the Evaluate button appears in the upper-right corner.
Model evaluation supports all models of the AutoModelForCausalLM type in Hugging Face.
3. Create and run an evaluation job.
In the upper-right corner of the model details page, click Evaluate.
The following table describes the key parameters.
Parameter | Description |
---|---|
Dataset Source | Select the custom dataset that you created in Step 1. |
Result Output Path | The OSS path in which the final evaluation results are saved. |
Resource Group Type | Select a public resource group or general-purpose computing resource group based on your business requirements. |
Job Resource | If you set the Resource Group Type parameter to Public Resource Group, the system recommends job resources based on the specifications of your model. |
Click Submit.
4. View the evaluation results.
In the left-side navigation pane, choose QuickStart > Model Gallery. On the page that appears, click Job Management. On the Training jobs tab of the Job Management page, click the Evaluation Jobs tab. On the tab that appears, click View Report in the Operation column of the job when the value of the Status parameter changes to Succeeded. In the Custom Dataset Evaluation Result section of the Evaluation Report tab for the job details page, view the scores of the model based on the Recall-Oriented Understudy for Gisting Evaluation (ROUGE) and Bilingual Evaluation Understudy (BLEU) metrics.
The radar chart also displays the evaluation details of each data record in an evaluation file.
In the left-side navigation pane, choose Quick Start > Model Gallery. On the page that appears, click Job Management. On the Training jobs tab of the Job Management page, click the Evaluation Jobs tab. On the tab that appears, select the evaluation jobs that you want to compare and click Compare. On the Custom Dataset Evaluation Result tab of the Evaluation Result Comparison page, view the comparison results.
The following section provides evaluation result analysis:
The default evaluation metrics for custom datasets include rouge-1-f, rouge-1-p, rouge-1-r, rouge-2-f, rouge-2-p, rouge-2-r, rouge-l-f, rouge-l-p, rouge-l-r, bleu-1, bleu-2, bleu-3, and bleu-4.
The final evaluation results are saved to the path specified by the Result Output Path parameter.
Algorithm research is often built on public datasets. When algorithm researchers select open source or fine-tuned models, the algorithm researchers refer to the evaluation results of the models on authoritative public datasets. However, in the era of LLMs, various public datasets exist. Therefore, algorithm researchers need to spend a long period of time researching and selecting public datasets that are appropriate for their domain and familiarizing themselves with the evaluation process for each dataset. To help algorithm researchers, PAI integrates public datasets from multiple domains and reproduces the official evaluation metrics for each dataset to obtain the most accurate evaluation feedback and facilitate more efficient LLM research.
In model evaluation based on public datasets, the LLM evaluation platform of PAI classifies public datasets by domain to evaluate the comprehensive capabilities of LLMs, such as mathematical ability, knowledge ability, and reasoning ability. A higher score indicates a better model. This is the most common evaluation method in LLM evaluation.
The following section highlights some key points during model evaluation. For more information, see Model evaluation.
1. Description of public datasets:
Public datasets in PAI include MMLU, TriviaQA, HellaSwag, GSM8K, C-Eval, CMMLU, and TruthfulQA. Other public datasets are being added.
Dataset | Size | Data record | Domain |
---|---|---|---|
MMLU | 166 MB | 14042 | Knowledge |
TriviaQA | 14.3 MB | 17944 | Knowledge |
C-Eval | 1.55 MB | 12342 | Chinese |
CMMLU | 1.08 MB | 11582 | Chinese |
GSM8K | 4.17 MB | 1319 | Mathematics |
HellaSwag | 47.5 MB | 10042 | Inference |
TruthfulQA | 0.284 MB | 816 | Security |
2. Select a model based on your business requirements.
In the left-side navigation pane of the PAI console, choose QuickStart > Model Gallery. Move the pointer over a model. The Evaluate button appears for models that can be evaluated.
In the left-side navigation pane of the PAI console, choose QuickStart > Model Gallery. Move the pointer over a model. The Evaluate button appears for models that can be evaluated. Fine-tune a model that can be evaluated. Then, in the left-side navigation pane of the PAI console, choose QuickStart > Model Gallery. On the page that appears, click Job Management. On the Training jobs tab of the Job Management page, find the desired training job and click the name of the training job in the Task name/ID column. On the page that appears, the Evaluate button appears in the upper-right corner.
Model evaluation supports all models of the AutoModelForCausalLM type in Hugging Face.
3. Create and run an evaluation job.
In the upper-right corner of the model details page, click Evaluate.
The following table describes the key parameters.
Parameter | Description |
---|---|
Dataset Source | Select a public dataset. |
Result Output Path | The OSS path in which the final evaluation results are saved. |
Resource Group Type | Select a public resource group or general-purpose computing resource group based on your business requirements. |
Job Resource | If you set the Resource Group Type parameter to Public Resource Group, the system recommends job resources based on the specifications of your model. |
Click Submit.
4. View the evaluation results.
In the left-side navigation pane, choose QuickStart > Model Gallery. On the page that appears, click Job Management. On the Training jobs tab of the Job Management page, click the Evaluation Jobs tab. On the tab that appears, click View Report in the Operation column of the job when the value of the Status parameter changes to Succeeded. In the Evaluation Results of Public Datasets section of the Evaluation Report tab for the job details page, view the scores of the model in various domains and datasets.
In the left-side navigation pane, choose Quick Start > Model Gallery. On the page that appears, click Job Management. On the Training jobs tab of the Job Management page, click the Evaluation Jobs tab. On the tab that appears, select the evaluation jobs that you want to compare and click Compare. On the Evaluation Results of Public Datasets tab of the Evaluation Result Comparison page, view the comparison results.
The following section provides evaluation result analysis:
The final evaluation results are saved to the path specified by the Result Output Path parameter.
42 posts | 1 followers
FollowAlibaba Cloud Community - August 30, 2024
Alibaba Cloud Native - September 12, 2024
Data Geek - November 4, 2024
Farruh - July 18, 2024
Alibaba Container Service - July 24, 2024
Alibaba Cloud Community - November 22, 2024
42 posts | 1 followers
FollowFollow our step-by-step best practices guides to build your own business case.
Learn MoreAccelerate AI-driven business and AI model training and inference with Alibaba Cloud GPU technology
Learn MoreTop-performance foundation models from Alibaba Cloud
Learn MoreOffline SDKs for visual production, such as image segmentation, video segmentation, and character recognition, based on deep learning technologies developed by Alibaba Cloud.
Learn MoreMore Posts by Alibaba Cloud Data Intelligence