In the era of large language models (LLMs), model evaluation is a key part in model selection and model optimization. Model evaluation is essential to accelerate the innovation and practice of Artificial Intelligence (AI). LLM evaluation of Platform for AI (PAI) supports various evaluation scenarios, such as comparative analysis of different basic models, fine-tuned versions of the same model, and quantitative versions of the same model. This topic describes how to implement more comprehensive, accurate, and focused model evaluation based on specific dataset types for different user groups to achieve better results in the AI field.
Background information
Introduction
With the significant improvement in the effectiveness of models in the era of LLMs, model evaluation has become increasingly important. Scientific and efficient model evaluation helps developers measure and compare the performance of different models in an efficient manner. It also guides developers to select and optimize models in an accurate manner. This accelerates AI innovation and application development. In this case, platform-based best practices are important for LLM evaluation.
This topic provides best practices for LLM evaluation in PAI to help AI developers to evaluate LLMs. This topic can help you build an evaluation process that reflects the actual performance of a model and meets specific requirements of the industry. This helps you achieve better results in the AI field. The following section provides the best practices described in this topic:
Prepare and select a dataset for model evaluation.
Select an open source or fine-tuned model based on your business requirements.
Create an evaluation job and select appropriate evaluation metrics.
Interpret evaluation results in single-job or multi-job scenarios.
Platform characteristics
The LLM evaluation platform of PAI allows you to compare the model effectiveness in different LLM evaluation scenarios. Examples:
Comparison of basic models: Qwen2-7B-Instruct vs. Baichuan2-7B-Chat
Comparison of fine-tuned versions for the same model: comparison of different epoch versions of Qwen2-7B-Instruct trained on data in specific domains.
Comparison of quantitative versions for the same model: Qwen2-7B-Instruct-GPTQ-Int4 vs. Qwen2-7B-Instruct-GPTQ-Int8
This topic describes how to use custom datasets from enterprises together with commonly used public datasets, such as MMLU and C-Eval, to achieve more comprehensive, accurate, and focused model evaluation and find LLMs that meet your business requirements. In this topic, enterprise developers and algorithm researchers are used to consider special requirements of different development groups. The best practices have the following characteristics:
Provides a complete end-to-end evaluation link without code development, and supports one-click evaluation of mainstream open source LLMs and fine-tuned models.
Allows you to upload custom dataset files and provides more than 10 built-in general Natural Language Processing (NLP) evaluation metrics to immediately display the evaluations results without the need to develop evaluation scripts.
Supports model evaluation based on common public datasets of multiple domains, completely reproduces the official evaluation methods, and comprehensively displays the evaluation results in a radar chart without the need to download datasets and familiarize yourself with the evaluation process.
Supports multi-model and multi-job evaluation at the same time, compares and displays the evaluation results in charts, and displays the details of a single evaluation result to facilitate comprehensive comparison and analytics.
Makes the evaluation process open and transparent and allows the results to be reproduced. The evaluation code is open source in the eval-scope code repository that is jointly built with ModelScope. This allows you to view the details and reproduce evaluation results in a convenient manner.
Billing
LLM evaluation relies on QuickStart. QuickStart is free of charge. If you use QuickStart to evaluate models, you are charged for Deep Learning Containers (DLC) evaluation jobs. For more information, see Billing of DLC.
If you select custom datasets for model evaluation, you are charged for using Object Storage Service (OSS). For more information, see Billing overview.
Scenario 1: Evaluate models by using custom datasets for enterprise developers
In most cases, enterprises accumulate a wealth of data in specific domains. Enterprises need to make full use of the data to optimize algorithms by using LLMs. Enterprise developers often evaluate open source or fine-tuned LLMs based on custom datasets accumulated in specific domains to better understand the effectiveness of LLMs in the domains.
In model evaluation based on custom datasets, the LLM evaluation platform of PAI uses the standard text matching method in the NLP domain to calculate the matching score between the evaluation results and the actual results of a model. A higher score indicates a better model. This evaluation method helps evaluate whether the model is suitable for a business scenario based on scenario-specific data.
The following section highlights some key points during model evaluation. For more information, see Model evaluation.
Prepare a custom dataset.
Format description of a custom dataset:
If you perform model evaluation based on a custom dataset, you must prepare a dataset file in the JSONL format, such as llmuses_general_qa_test.jsonl of 76 KB. The following sample code provides an example of the dataset file:
[{"question": "Is it correct that Chinese invented papermaking?", "answer": "Yes"}] [{"question": "Is it correct that Chinese invented gunpowder?", "answer": "Yes"}]
The
question
field is used to identify the question column, and theanswer
field is used to identify the answer column.Upload a dataset file in the required format to OSS. For more information, see Upload objects.
Create a custom dataset based on the dataset file in OSS. For more information, see Create and manage datasets.
Select a model based on your business requirements.
Open source model
In the left-side navigation pane of the PAI console, choose QuickStart > Model Gallery. Move the pointer over a model. The Evaluate button appears for models that can be evaluated.
Fine-tuned model
In the left-side navigation pane of the PAI console, choose QuickStart > Model Gallery. Move the pointer over a model. The Evaluate button appears for models that can be evaluated. Fine-tune a model that can be evaluated. Then, in the left-side navigation pane of the PAI console, choose QuickStart > Model Gallery. On the page that appears, click Job Management. On the Training jobs tab of the Job Management page, find the desired training job and click the name of the training job in the Task name/ID column. On the page that appears, the Evaluate button appears in the upper-right corner.
Model evaluation supports all models of the AutoModelForCausalLM type in Hugging Face.
Create and run an evaluation job.
In the upper-right corner of the model details page, click Evaluate.
The following table describes the key parameters.
Parameter
Description
Dataset Source
Select the custom dataset that you created in Step 1.
Result Output Path
The OSS path in which the final evaluation results are saved.
Resource Group Type
Select a public resource group or general-purpose computing resource group based on your business requirements.
Job Resource
If you set the Resource Group Type parameter to Public Resource Group, the system recommends job resources based on the specifications of your model.
Click Submit.
View the evaluation results.
Evaluation results of a single evaluation job
In the left-side navigation pane, choose QuickStart > Model Gallery. On the page that appears, click Job Management. On the Training jobs tab of the Job Management page, click the Evaluation Jobs tab. On the tab that appears, click View Report in the Operation column of the job when the value of the Status parameter changes to Succeeded. In the Custom Dataset Evaluation Result section of the Evaluation Report tab for the job details page, view the scores of the model based on the Recall-Oriented Understudy for Gisting Evaluation (ROUGE) and Bilingual Evaluation Understudy (BLEU) metrics.
The radar chart also displays the evaluation details of each data record in an evaluation file.
Comparison results of multiple evaluation jobs
In the left-side navigation pane, choose Quick Start > Model Gallery. On the page that appears, click Job Management. On the Training jobs tab of the Job Management page, click the Evaluation Jobs tab. On the tab that appears, select the evaluation jobs that you want to compare and click Compare. On the Custom Dataset Evaluation Result tab of the Evaluation Result Comparison page, view the comparison results.
The following section provides evaluation result analysis:
The default evaluation metrics for custom datasets include rouge-1-f, rouge-1-p, rouge-1-r, rouge-2-f, rouge-2-p, rouge-2-r, rouge-l-f, rouge-l-p, rouge-l-r, bleu-1, bleu-2, bleu-3, and bleu-4.
rouge-n metrics are used to calculate the N-gram overlap. N indicates the number of consecutive words. rouge-1 and rouge-2 are the most commonly used metrics. rouge-1 corresponds to unigram, and rouge-2 corresponds to bigram. rouge-l metrics are based on the longest common subsequence (LCS).
BLEU is a popular measurement used to evaluate the machine translation quality. BLEU is scored by calculating the N-gram overlap between the machine translations and reference translations. bleu-n metrics are used to calculate the N-gram match.
The final evaluation results are saved to the path specified by the Result Output Path parameter.
Scenario 2: Evaluate models by using public datasets for algorithm researchers
Algorithm research is often built on public datasets. When algorithm researchers select open source or fine-tuned models, the algorithm researchers refer to the evaluation results of the models on authoritative public datasets. However, in the era of LLMs, various public datasets exist. Therefore, algorithm researchers need to spend a long period of time researching and selecting public datasets that are appropriate for their domain and familiarizing themselves with the evaluation process for each dataset. To help algorithm researchers, PAI integrates public datasets from multiple domains and reproduces the official evaluation metrics for each dataset to obtain the most accurate evaluation feedback and facilitate more efficient LLM research.
In model evaluation based on public datasets, the LLM evaluation platform of PAI classifies public datasets by domain to evaluate the comprehensive capabilities of LLMs, such as mathematical ability, knowledge ability, and reasoning ability. A higher score indicates a better model. This is the most common evaluation method in LLM evaluation.
The following section highlights some key points during model evaluation. For more information, see Model evaluation.
Description of public datasets:
Public datasets in PAI include MMLU, TriviaQA, HellaSwag, GSM8K, C-Eval, CMMLU, and TruthfulQA. Other public datasets are being added.
Dataset
Size
Data record
Domain
166 MB
14042
Knowledge
14.3 MB
17944
Knowledge
1.55 MB
12342
Chinese
1.08 MB
11582
Chinese
4.17 MB
1319
Mathematics
47.5 MB
10042
Inference
0.284 MB
816
Security
Select a model based on your business requirements.
Open source model
In the left-side navigation pane of the PAI console, choose QuickStart > Model Gallery. Move the pointer over a model. The Evaluate button appears for models that can be evaluated.
Fine-tuned model
In the left-side navigation pane of the PAI console, choose QuickStart > Model Gallery. Move the pointer over a model. The Evaluate button appears for models that can be evaluated. Fine-tune a model that can be evaluated. Then, in the left-side navigation pane of the PAI console, choose QuickStart > Model Gallery. On the page that appears, click Job Management. On the Training jobs tab of the Job Management page, find the desired training job and click the name of the training job in the Task name/ID column. On the page that appears, the Evaluate button appears in the upper-right corner.
Model evaluation supports all models of the AutoModelForCausalLM type in Hugging Face.
Create and run an evaluation job.
In the upper-right corner of the model details page, click Evaluate.
The following table describes the key parameters.
Parameter
Description
Dataset Source
Select a public dataset.
Result Output Path
The OSS path in which the final evaluation results are saved.
Resource Group Type
Select a public resource group or general-purpose computing resource group based on your business requirements.
Job Resource
If you set the Resource Group Type parameter to Public Resource Group, the system recommends job resources based on the specifications of your model.
Click Submit.
View the evaluation results.
Evaluation results of a single evaluation job
In the left-side navigation pane, choose QuickStart > Model Gallery. On the page that appears, click Job Management. On the Training jobs tab of the Job Management page, click the Evaluation Jobs tab. On the tab that appears, click View Report in the Operation column of the job when the value of the Status parameter changes to Succeeded. In the Evaluation Results of Public Datasets section of the Evaluation Report tab for the job details page, view the scores of the model in various domains and datasets.
Comparison results of multiple evaluation jobs
In the left-side navigation pane, choose Quick Start > Model Gallery. On the page that appears, click Job Management. On the Training jobs tab of the Job Management page, click the Evaluation Jobs tab. On the tab that appears, select the evaluation jobs that you want to compare and click Compare. On the Evaluation Results of Public Datasets tab of the Evaluation Result Comparison page, view the comparison results.
The following section provides evaluation result analysis:
The radar chart on the left displays the scores of the model in different domains. Each domain may have multiple datasets. For datasets that belong to the same domain, the LLM evaluation platform of PAI uses the average of the evaluation scores as the score of the model in the domain.
The radar chart on the right displays the scores of the model in each public dataset. For more information about the evaluation scope of each public dataset, see description of public datasets.
The final evaluation results are saved to the path specified by the Result Output Path parameter.