Evaluate LLMs with a few clicks - Platform For AI - Alibaba Cloud Documentation Center

Model Gallery integrates various pre-trained large language models (LLMs). This topic describes how to use the model evaluation feature to evaluate the capabilities of LLMs and obtain LLMs that can meet your business requirements on the Model Gallery page in the Platform for AI (PAI) console.

Overview

The model evaluation feature allows you to evaluate LLMs based on custom or public datasets.

Custom dataset-based evaluation includes:
- Rule-based evaluation uses Recall-Oriented Understudy for Gisting Evaluation (ROUGE) and Bilingual Evaluation Understudy (BLEU) metrics to calculate the difference between the predicted results of a model and the actual results.
- Judge model-based evaluation uses a judge model provided by PAI to score each question-response pair. The scores are used to judge model performance.
Public dataset-based evaluation loads multiple public datasets, performs model predictions, and provides an industry-standard evaluation reference based on the evaluation framework specific to each dataset.

Model evaluation supports all AutoModelForCausalLM models in Hugging Face.

Latest feature:

Use a judge model based on Qwen2 to score model responses in onpen-ended and complex scenarios. The feature is free for a limited period. You can try it in Create Evaluation Job > Professional Mode.

Scenarios

Model evaluation is an important part of model development. You can explore model evaluation applications based on your business requirements. You can use the model evaluation feature in the following scenarios:

Model benchmark test: Evaluate the common capabilities of a model based on public datasets and compare the evaluation results with industry models or benchmarks.
Evaluation of model capabilities in various domains: Apply a model to different domains and compare the pre-trained and fine-tuned results of the model. This way, you can evaluate the capabilities of the model to apply domain-specific knowledge.
Model regression test: Construct a regression test set, evaluate the performance of a model in actual business scenarios by using the model evaluation feature, and check whether the model meets deployment standards.

Prerequisites

An Object Storage Service (OSS) bucket is created. For more information, see Get started by using the OSS console.

Billing

When you use the model evaluation feature, you are charged for OSS storage and Deep Learning Containers (DLC) evaluation jobs. For more information, see Billing and Billing of DLC.

Data preparations

The model evaluation feature supports model evaluation based on custom datasets and public datasets such as C-Eval.

Public dataset: Public datasets are uploaded to and maintained in PAI. You can directly use the public datasets.
The public datasets include MMLU, TriviaQA, HellaSwag, GSM8K, C-Eval, and TruthfulQA. More public datasets will be integrated in the future.
Custom dataset: If you want to evaluate a model by using a custom evaluation file, upload the file in the JSONL format to OSS and create a custom dataset. For more information, see Upload objects and Create and manage datasets. The following sample code provides an example on the format of the file.
The question field is used to identify the question column, and the answer field is used to identify the answer column. You can also select a column on the evaluation page. If you use a judge model, the answer column is optional.
```
[{"question": "Is it correct that Chinese invented papermaking?", "answer": "Yes"}]
[{"question": "Is it correct that Chinese invented gunpowder?", "answer": "Yes"}]
```
Sample file: eval.jsonl. Note that the file is in Chinese.

Procedure

Select a model

To find a model that is suitable for your business, perform the following steps:

Go to the Model Gallery page.
1. Log on to the PAI console.
2. In the left-side navigation pane, click Workspaces. On the Workspaces page, find the workspace that you want to manage and click the name of the workspace. The Workspace Details page appears.
3. In the left-side navigation pane, choose QuickStart > Model Gallery to go to the Model Gallery page.
Find a model that is suitable for your business.
1. On the Model Gallery page, click a model to go to the Overview tab of the model details page.
2. On the Overview tab of the model details page, click Evaluate in the upper-right corner. The Evaluate button is displayed only for models that can be evaluated.
3. Click Job Management and click the training task. If a model can be evaluated, any fine-tuned model based on that model can also be evaluated.

Evaluate a model

You can evaluate a model by using the simple mode and the professional mode.

Simple mode

You can select a public or custom dataset to use the model evaluation feature. If you want to use the judge model, switch to the professional mode.

On the Create Evaluation Job page, configure the Job Name parameter.
Configure the Result Output Path parameter. Make sure that the directory you select is used only by the current evaluation job. Otherwise, the results of different evaluation jobs overwrite each other.
Select a dataset for model evaluation. You can select a custom dataset or a public dataset provided by PAI. The custom dataset must meet the format requirements described in Data preparations.
Select computing resources of the GPU instance type and click Submit in the lower-left corner. We recommend that you select an A10 or V100 GPU instance type. The Job Configuration tab of the details page of the evaluation job appears. Wait until the job is initialized and click the Evaluation Report tab to view the evaluation report.

Professional mode

You can select public datasets and a custom dataset for model evaluation. You can specify hyperparameters, use the judge model, and select multiple public datasets.

Click Switch to Professional Mode.
Select datasets. In professional mode, you can select public datasets and a custom dataset.
- You can select multiple public datasets.
- The custom dataset supports judge model evaluation and general metric evaluation.
- You can specify question and answer columns for a custom dataset. If you use the judge model, the answer column is optional.
- You can use a data file that meets the format requirements in OSS.
Configure the hyperparameters of the evaluated model.
In the lower-left corner, click Submit. The Job Configuration tab of the details page of the evaluation job appears. Wait until the job is initialized and click the Evaluation Report tab to view the evaluation report.

View evaluation results

Evaluation job list

On the Model Gallery page, click Job Management next to the search box.
On the Job Management page, click the Model Evaluation tab.

Evaluation results of a single evaluation job

In the job list on the Model Evaluation tab of the Job Management page, find the evaluation job that you want to manage and click View Report in the Operation column. On the Evaluation Report tab of the details page of the evaluation job, view the custom dataset score and public dataset score.

Evaluation results based on a custom dataset
If you select general metric evaluation for an evaluation job, the radar chart displays the scores of the model based on ROUGE and BLEU metrics.
- The default metrics for the custom dataset include rouge-1-f, rouge-1-p, rouge-1-r, rouge-2-f, rouge-2-p, rouge-2-r, rouge-l-f, rouge-l-p, rouge-l-r, bleu-1, bleu-2, bleu-3, and bleu-4.
  - ROUGE metrics:
    1. rouge-n metrics are used to calculate the N-gram overlap. N indicates the number of consecutive words. rouge-1 and rouge-2 are the most commonly used. rouge-1 corresponds to unigram, and rouge-2 corresponds to bigram.
      - rouge-1-p (Precision): the proportion of the unigrams in the system summary to the unigrams in the reference summary.
      - rouge-1-r (Recall): the proportion of the unigrams in the reference summary that appear in the system summary.
      - rouge-1-f (F-score): the harmonic average of precision and recall.
      - rouge-2-p (Precision): the proportion of the bigrams in the system summary to the bigrams in the reference summary.
      - rouge-2-r (Recall): the proportion of the bigrams in the reference summary that appear in the system summary.
      - rouge-2-f (F-score): the harmonic average of precision and recall.
    2. rouge-l metrics are based on the longest common subsequence (LCS).
      - rouge-l-p (Precision): the precision of the matching between the LCS-based system summary and the LCS-based reference summary.
      - rouge-l-r (Recall): the recall of the matching between the LCS-based system summary and the LCS-based reference summary.
      - rouge-l-f (F-score): the F-score of the matching between the LCS-based system summary and the LCS-based reference summary.
  - BLEU metrics:
    BLEU is a popular measurement used to evaluate the machine translation quality. BLEU is scored by calculating the N-gram overlap between the machine translations and reference translations.
    - bleu-1: unigram matching.
    - bleu-2: bigram matching.
    - bleu-3: trigram matching (three consecutive words).
    - bleu-4: 4-gram matching.
- If you use the judge model for the evaluation task, the metrics of the judge model scores are displayed through a list.
  - The judge model is fine-tuned base on Qwen2, performing on par with GPT-4 on open-source datasets such as Alighbench, and achieving superior evaluation results in some scenarios.
  - The page displays four statistical indicators for the scores given by the judge model for the evaluated model:
    - Mean: The average score given by the judge model to the generated results (excluding invalid scores), with a minimum value of 1 and a maximum value of 5. A higher mean indicates better model responses.
    - Median: The median score given by the judge model to the generated results (excluding invalid scores), with a minimum value of 1 and a maximum value of 5. A higher median indicates better model responses.
    - Standard Deviation: The standard deviation of the scores given by the judge model to the generated results (excluding invalid scores). When the mean and median are the same, a smaller standard deviation signifies a better model performance.
    - Skewness: The skewness of the score distribution (excluding invalid scores). Positive skewness suggests a longer tail on the right side (higher score range), while negative skewness indicates a longer tail on the left side (lower score range).
- Additionally, the bottom of the page displays detailed evaluation results for each data entry in the evaluation dataset.

Evaluation results based on public datasets
If you select public datasets for model evaluation, the radar chart displays the scores of the model on the public datasets.
- The radar chart on the left displays the scores of the model in different domains. Each domain may have multiple datasets. For datasets that belong to the same domain, the average of the evaluation scores is used as the score of the model in the domain.
- The radar chart on the right displays the scores of the model in each public dataset. For more information about the evaluation scope of each public dataset, see the official introduction of the dataset.

Comparison of evaluation results for multiple models

If you want to compare the evaluation results of multiple models, display the evaluation results of the models on the same page to facilitate comparison. In the evaluation job list on the Model Evaluation tab of the Job Management page, select the evaluation jobs that you want to manage and click Compare.

Comparison results of models based on custom datasets
Comparison results of models based on public datasets

Result analysis

The evaluation results of a model include the results based on custom datasets and the results based on public datasets.

Evaluation results based on custom datasets:
- Use the standard text matching method in the Natural Language Processing (NLP) domain to calculate the matching score between the ground truth and output of a model. A higher score indicates a better model.
- Use a judge model to evaluate the outputs of the assessed model can leverage the advantages of LLMs, allowing for a more accurate semantic evaluation of the quality of the model outputs. The higher the mean and median, and the smaller the standard deviation, the better the model performance.
- This evaluation method helps evaluate whether the model is suitable for your business scenario based on your scenario-specific data.
Evaluation results based on public datasets: Use open source datasets in various domains to evaluate the comprehensive capabilities of an LLM, such as mathematical capabilities and code capabilities. A higher score indicates a better model. This evaluation method is the most common method used for LLM evaluation. PAI is gradually integrating more public datasets based on the industry pace.

References

Apart from the console, you can use the PAI SDK for Python to evaluate models. For more information, see the following NoteBooks: