Deploy and call a RAG-based chatbot - Platform For AI - Alibaba Cloud Documentation Center

Retrieval-Augmented Generation (RAG) technology is used to retrieve related information from external knowledge bases and pass the information and user inputs to large language models (LLMs). This can enhance the Q&A capabilities of LLMs in specific domains. EAS provides scenario-based deployment methods that allow users to flexibly choose LLMs and vector databases to enable quick deployment of RAG-based LLM chatbots. This topic describes how to deploy a RAG-based LLM chatbot and how to perform model inference.

Step 1: Deploy the RAG service

Log on to the PAI console. On the page that appears, select a region and a workspace. Then, click Enter Elastic Algorithm Service (EAS).
On the Elastic Algorithm Service (EAS) page, click Deploy Service. In the Scenario-based Model Deployment section, click RAG-based Smart Dialogue Deployment.

On the RAG-based LLM Chatbot Deployment page, configure the parameters. The following tables describe the parameters.

Basic Information

Parameter	Description
Model Source	The source of the model. Valid values: Open Source Model: PAI provides a variety of preset open-source models, including Qwen, DeepSeek, Llama, ChatGLM, Baichuan, Falcon, Yi, Mistral, and Gemma. You can select and deploy a model with the appropriate parameter size. Custom Fine-tuned Model: PAI supports your fine-tuned models for specific scenarios.
Model Type	If you use an Open Source Model, select a model with the appropriate parameter size. If you use a Custom Fine-tuned Model, you need to specify the model type, parameter size, and precision.
Model Settings	If you use a Custom Fine-tuned Model, you need to specify the path of the model. The system reads the model configuration file from this path when deploying the model. Valid values: Note We recommend that you first run the fine-tuned model in Transformers of Huggingface to confirm that the output meets your expectations before you deploy the model as an EAS service. OSS: Select the OSS path in which the fine-tuned model file is stored. NAS: Select the NAS file system in which the fine-tuned model file is stored, the source path and the mount path. For more information about how to create an OSS bucket and NAS file system, see Get started by using the OSS console and Create a file system.

Resource Configuration

Parameter

Description

Resource Configuration

After you select a model, the system recommends appropriate resource configurations. If you switch to another specification, the model service may fail to start.

Inference Acceleration

Inference acceleration can be enabled for the Qwen, Llama2, ChatGLM, or Baichuan2 model that is deployed on A10 or GU30 instances. The acceleration feature is not billed. Valid values:

BladeLLM Inference Acceleration (recommended)
Open-source vLLM Inference Acceleration

Vector Database Settings

You can use one of the following types of vector database: Faiss, Elasticsearch, Hologres, OpenSearch, or RDS PostgreSQL. Select a type based on your business requirements.

FAISS

You can use Faiss to quickly build a local vector database in an EAS instance without the need to purchase or activate online vector databases.

Parameter	Description
Vector Database Type	Select FAISS.
OSS Path	The OSS path of the vector database. Select an OSS path in the current region. You can create an OSS path if no OSS path is available. For more information, see Get started by using the OSS console. Note If you use a Custom Fine-tuned Model, make sure that the OSS paths of the vector database and the model are different.

Elasticsearch

Specify the connection information of an Elasticsearch cluster. For information about how to create and prepare an Elasticsearch cluster, see Prepare a vector database by using Elasticsearch.

Parameter	Description
Vector Database Type	Select Elasticsearch.
Private Endpoint and Port	The private endpoint and port number of the Elasticsearch cluster. Format: `http://Private endpoint:Port number`. For information about how to obtain the private endpoint and port number of the Elasticsearch cluster, see View the basic information of a cluster.
Index Name	The name of the index. You can enter a new index name or an existing index name. If you use an existing index name, the index schema must meet the requirements of the RAG-based chatbot. For example, you can enter the name of the index that is automatically created when you deploy the RAG-based chatbot by using EAS.
Account	The logon name that you specified when you created the Elasticsearch cluster. Default logon name: elastic.
Password	The password that you configured when you created the Elasticsearch cluster. If you forget the password, see Reset the access password for an Elasticsearch cluster.

Hologres

Specify the connection information of a Hologres instance. To purchase a Hologres instance, see Purchase a Hologres instance.

Parameter	Description
Vector Database Type	Select Hologres.
Invocation Information	The host information of Select VPC. Go to the Instance Details page in the Hologres console. In the Network Information section, click Copy next to Select VPC to obtain the host before the domain name `:80`.
Database Name	The name of the database in the Hologres instance. For more information about how to create a database, see Create a database.
Account	The custom account that you created. For more information, see Create a custom account. In the Select Member Role section, select Examples of the Super Administrator (SuperUser).
Password	The password of the custom account that you created.
Table Name	The name of the table. You can enter a new table name or an existing table name. If you use an existing table name, the table schema must meet the requirements of the RAG-based chatbot. For example, you can enter the name of the Hologres table that is automatically created when you deploy the RAG-based chatbot by using EAS.

OpenSearch

Specify the connection information of an OpenSearch instance of Vector Search Edition. For information about how to create and prepare an OpenSearch instance, see Prepare an OpenSearch Vector Search Edition instance.

Parameter	Description
Vector Database Type	Select OpenSearch.
Endpoint	The public endpoint of the OpenSearch instance. You must first configure Internet access for the OpenSearch instance. For more information, see Prepare an OpenSearch Vector Search Edition instance.
Instance ID	Obtain the instance ID from the OpenSearch instance list.
Username	Enter the username and password of the OpenSearch instance.
Password	Enter the username and password of the OpenSearch instance.
Table Name	Enter the name of the index table of the OpenSearch instance. For information about how to prepare the index table, see Prepare an OpenSearch Vector Search Edition instance.

RDS PostgreSQL

Specify the connection information of the ApsaraDB RDS for PostgreSQL instance. For information about how to create and prepare an ApsaraDB RDS for PostgreSQL instance, see Prepare a vector database by using ApsaraDB RDS for PostgreSQL.

Parameter	Description
Vector Database Type	Select RDS PostgreSQL.
Host Address	The internal endpoint of the ApsaraDB RDS for PostgreSQL instance. You can log on to the ApsaraDB ApsaraDB RDS for PostgreSQL console and view the endpoint on the Database Connection page of the instance.
Port	The port number. Default value: 5432.
Database	The name of the database. For information about how to create a database and an account, see Create a database and an account. When you create an account, select Privileged Account for Account Type. When you create a database, select the created privileged account from the Authorized By drop-down list.
Table Name	The name of the database table.
Account	Specify the privileged account and password you created. Create a database and an account.
Password

VPC Configuration

Parameter	Description
VPC	If you want to use a Model Studio LLM or web search for Q&A, you must configure the VPC and enable Internet access. For more information, see Configure Internet access. Network requirements of a vector database: If you use Faiss to build a vector database, you do not need to configure the VPC. If you use Hologres, Elasticsearch, or RDS PostgreSQL to build a vector database, EAS supports access to the vector database over Internet or a VPC (recommended). VPC access requires that the VPC configured in EAS must be the same as the VPC configured in the vector database. If you use OpenSearch to build a vector database, EAS supports only Internet access. For more information about the configuration method, see Step 2: Prepare configuration items.
vSwitch
Security Group Name

Step 2: Test the chatbot on the Web UI

After you deploy the RAG-based chatbot, click View Web App in the Service Type column to go to the Web UI.

Follow the following steps to upload your knowledge base file on the Web UI and test the Q&A chatbot.

1. Configure parameters

On the Settings tab, modify Embedding-related parameters and the LLM that you want to use. We recommend that you use the default configurations.

Note

If you set the Embedding Type parameter to dashscope, you must configure Internet access for EAS and an API key for a Model Studio model. You are charged for calling a Model Studio model. For more information, see Billable items and pricing.

Index-related parameters:

Index Name: supports modification of the existing index name. You can select NEW from the drop-down list to add an index name.
EmbeddingType: supports huggingface and dashscope.
- huggingface: The system provides built-in Embedding models.
- dashscope: If you use Model Studio, the text-embedding-v2 model is used by default. For more information, see Embedding.
Embedding Dimension: the embedding dimension that has direct effect on the model performance. After you select an Embedding model, the system automatically configures this parameter.
Embedding Batch Size: the batch size.

2. Upload knowledge base files

On the Upload tab, you can upload knowledge base files. Then, the system automatically stores the knowledge base files to a vector database based on the RAG format. The following file formats are supported: .txt, .pdf, .xlsx, .xls, .csv, .docx, .doc, Markdown, and .html. The following upload methods are supported:

On the Files or Directory tab, upload one or more files from your on-premises machine.
On the Aliyun OSS tab, upload one or more files from an OSS bucket.

Before you upload files, you can modify the parameters. The following table describes the parameters.

Parameter	Description
Chunk Size	The size of each chunk. Unit: bytes. Default value: 500.
Chunk Overlap	The overlap between adjacent chunks. Default value: 10.
Process with MultiModal	If you use a multimodal LLM, turn on this feature. Then, you can use the multimodal LLM to process figures in a PDF, WORD, or .md file.
Process PDF with OCR	Use the OCR mode to parse PDF files.

3. Configure model inference parameters

On the Chat tab, select an index name, configure the Q&A strategy, and then perform a Q&A test. The following Q&A strategies are supported:

Retrieval: A vector database is retrieved to return top K similar results.
LLM: An LLM provides answers to questions.
Chat (Web Search): Bing is used to search web information. To use this strategy, you must Configure Internet access for EAS. Web search contains the following configuration parameters:
- Bing API Key: used to access Bing. For information about how to query the API key of Bing, see Bing Web Search API.
- Search Count: the number of web pages to be searched. Default value: 10.
- Language: the search language. The value can be zh-CN and en-US.
Chat (Knowledge Base): Fill the retrieved results and user question in the selected prompt template, and send them to the LLM to obtain the Q&A results.

The following table describes the parameters.

General parameters

Parameter	Description
Streaming Output	After you select Streaming Output, the system returns results in streaming mode.
Need Citation	Specifies whether to display the citation information in an answer.
Inference with multi-modal LLM	Specifies whether to display figures when you use a multimodal LLM.

Parameters related to a vector database
Retrieval Mode supports the following retrieval modes:
- Embedding Only: Use vector databases for information retrieval.
- Keyword Only: Use keywords for information retrieval.
- Hybrid: Use vector databases and keywords for information retrieval and return integrated results.
Note
In most complex scenarios, vector databases feature optimal performance during information retrieval. However, in some vertical scenarios in which the corpus is insufficient or exact match is required, the traditional sparse retrieval method is better than the vector database-based retrieval method. The sparse retrieval method performs searches by calculating the overlap of keywords between user queries and knowledge base files. This method is simple and efficient.
PAI offers keyword retrieval algorithms, such as BM25, to perform sparse retrieval operations. Vector database retrieval and keyword retrieval each have their own strengths and weaknesses. Therefore, the integrated results of both can enhance the overall retrieval accuracy and efficiency.
The Reciprocal Rank Fusion (RRF) algorithm calculates a combined score by performing a weighted sum of the rankings of each file across different retrieval methods. If you set the Retrieval Model parameter to Hybrid, PAI uses vector databases and keywords for information retrieval and returns integrated results by using the RRF algorithm.
LLM-related parameter
Temperature: controls the randomness of content generation. A low value indicates that the output is relatively fixed. A high value indicates that the output is diverse and creative.

Step 3: Call API operations to perform model inference

Important

The query and upload APIs allow you to change a knowledge base by changing the value of the index_name parameter. If you do not configure the index_name parameter, the default default_index knowledge base is used.

Obtain the invocation information

Click the name of the RAG-based chatbot to go to the Service Details page.
In the Basic Information section, click View Endpoint Information.
On the Public Endpoint tab of the Invocation Method dialogue box, obtain the service endpoint and token.

Upload knowledge base files

You can use an API to upload knowledge base files from your on-premises machine. You can use the returned value of the task_id parameter to query the upload status.

In the following example, replace <service_url> with the endpoint of the RAG-based chatbot and <service_token> with the token of the RAG-based chatbot. For more information about how to obtain the endpoint and token, see Obtain the invocation information.

Upload data

curl -X 'POST' '<service_url>api/v1/upload_data' -H 'Authorization: <service_token>' -H 'Content-Type: multipart/form-data' -F 'files=@<file_path>'
# Return: {"task_id": "****557733764fdb9fefa063538914da"}

Query the upload status

curl '<service_url>api/v1/get_upload_state?task_id=****557733764fdb9fefa063538914da' -H 'Authorization: <service_token>'
# Return: {"task_id":"****557733764fdb9fefa063538914da","status":"completed"}

Single-round conversation request

cURL command

Note: In the following example, replace <service_url> with the endpoint of the RAG-based chatbot and <service_token> with the token of the RAG-based chatbot. For more information about how to obtain the endpoint and token, see Obtain the invocation information.

Retrieval: api/v1/query/retrieval

curl -X 'POST'  '<service_url>api/v1/query/retrieval' -H 'Authorization: <service_token>' -H 'accept: application/json' -H 'Content-Type: application/json'  -d '{"question": "What is PAI?"}'

LLM: /api/v1/query/llm

curl -X 'POST'  '<service_url>api/v1/query/llm' -H 'Authorization: <service_token>' -H 'accept: application/json'  -H 'Content-Type: application/json'  -d '{"question": "What is PAI?"}'

You can add other adjustable inference parameters. Example: {"question":"What is PAI?", "temperature": 0.9}.

Chat (Knowledge Base): api/v1/query

curl -X 'POST'  '<service_url>api/v1/query' -H 'Authorization: <service_token>' -H 'accept: application/json'  -H 'Content-Type: application/json'  -d '{"question": "What is PAI?"}'

You can add other adjustable inference parameters. Example: {"question":"What is PAI?", "temperature": 0.9}.

Chat (Web Search): api/v1/query/search

curl --location '<service_url>api/v1/query/search' \
--header 'Authorization: <service_token>' \
--header 'Content-Type: application/json' \
--data '{"question":"China box office rankings", "stream": true}'

Python script

Note: In the following example, replace SERVICE_URL with the endpoint of the RAG-based chatbot and Authorization with the token of the RAG-based chatbot. For more information about how to obtain the endpoint and token, see Obtain the invocation information.

import requests

SERVICE_URL = 'http://xxxx.****.cn-beijing.pai-eas.aliyuncs.com/'
headers = {
    'accept': 'application/json',
    'Content-Type': 'application/json',
    'Authorization': 'MDA5NmJkNzkyMGM1Zj****YzM4M2YwMDUzZTdiZmI5YzljYjZmNA==',
}

def test_post_api_query(url):
    data = {
       "question":"What is PAI?" 
    }
    response = requests.post(url, headers=headers, json=data)

    if response.status_code != 200:
        raise ValueError(f'Error post to {url}, code: {response.status_code}')
    ans = dict(response.json())

    print(f"======= Question =======\n {data['question']}")
    if 'answer' in ans.keys():
        print(f"======= Answer =======\n {ans['answer']}")
    if 'docs' in ans.keys():
        print(f"======= Retrieved Docs =======\n {ans['docs']}\n\n")
 
# LLM 
test_post_api_query(SERVICE_URL + 'api/v1/query/llm')
# Retrieval
test_post_api_query(SERVICE_URL + 'api/v1/query/retrieval')
# RAG (Knowledge Base)
test_post_api_query(SERVICE_URL + 'api/v1/query')

Multi-round conversation request

LLM and Chat (Knowledge Base) support multi-round conversation requests. Sample code:

cURL command

The following sample code shows an example on how to initiate a multi-round conversation request in the RAG-based chatbot:

# Send the request. 
curl -X 'POST'  '<service_url>api/v1//query' -H 'Authorization: <service_token>' -H 'accept: application/json'  -H 'Content-Type: application/json'  -d '{"question": "What is PAI?"}'

# Provide the session ID returned for the request. This ID uniquely identifies a conversation in the conversation history. After the session ID is provided, the corresponding conversation is stored and is automatically included in subsequent requests to call an LLM.
curl -X 'POST'  '<service_url>api/v1//query' -H 'Authorization: <service_token>' -H 'accept: application/json'  -H 'Content-Type: application/json'  -d '{"question": "What are the advantages of PAI?","session_id": "ed7a80e2e20442eab****"}'

# Provide the chat_history parameter, which contains the conversation history between you and the chatbot. The parameter value is a list in which each element indicates a single round of conversation in the {"user":"Inputs","bot":"Outputs"} format. Multiple conversations are sorted in chronological order. 
curl -X 'POST'  '<service_url>api/v1//query' -H 'Authorization: <service_token>' -H 'accept: application/json'  -H 'Content-Type: application/json'  -d '{"question":"What are the features of PAI?", "chat_history": [{"user":"What is PAI?", "bot":"PAI is an Alibaba Cloud platform for AI......"}]}'

# If you provide both the session_id and chat_history parameters, the conversation history is appended to the conversation that corresponds to the specified session ID. 
curl -X 'POST'  '<service_url>api/v1//query' -H 'Authorization: <service_token>' -H 'accept: application/json'  -H 'Content-Type: application/json'  -d '{"question":"What are the features of PAI?", "chat_history": [{"user":"What is PAI?", "bot":"PAI is an Alibaba Cloud platform for AI......"}], "session_id": "1702ffxxad3xxx6fxxx97daf7c"}'

Python script

import requests

SERVICE_URL = 'http://xxxx.****.cn-beijing.pai-eas.aliyuncs.com'
headers = {
    'accept': 'application/json',
    'Content-Type': 'application/json',
    'Authorization': 'MDA5NmJkN****jNlMDgzYzM4M2YwMDUzZTdiZmI5YzljYjZmNA==',
}

def test_post_api_query_with_chat_history(url):
    # Round 1 query
    data = {
       "question": "What is PAI?"
    }
    response = requests.post(url, headers=headers, json=data)

    if response.status_code != 200:
        raise ValueError(f'Error post to {url}, code: {response.status_code}')
    ans = dict(response.json())
    print(f"=======Round 1: Question =======\n {data['question']}")
    if 'answer' in ans.keys():
        print(f"=======Round 1: Answer =======\n {ans['answer']} session_id: {ans['session_id']}")
    if 'docs' in ans.keys():
        print(f"=======Round 1: Retrieved Docs =======\n {ans['docs']}")
   
    # Round 2 query
    data_2 = {
       "question": "What are the advantages of PAI?",
       "session_id": ans['session_id']
    }
    response_2 = requests.post(url, headers=headers, json=data_2)

    if response.status_code != 200:
        raise ValueError(f'Error post to {url}, code: {response.status_code}')
    ans_2 = dict(response_2.json())
    print(f"=======Round 2: Question =======\n {data_2['question']}")
    if 'answer' in ans.keys():
        print(f"=======Round 2: Answer =======\n {ans_2['answer']} session_id: {ans_2['session_id']}")
    if 'docs' in ans.keys():
        print(f"=======Round 2: Retrieved Docs =======\n {ans['docs']}")
    print("\n")

# LLM
test_post_api_query_with_chat_history(SERVICE_URL + "api/v1/query/llm")
# RAG (Knowledge Base)
test_post_api_query_with_chat_history(SERVICE_URL + "api/v1/query")

Usage notes

This practice is subject to the maximum number of tokens of an LLM service and is designed to help you understand the basic retrieval feature of a RAG-based LLM chatbot.

The chatbot is limited by the server resource size of the LLM service and the default number of tokens. The conversation length supported by the chatbot is also limited.
If you do not need to perform multiple rounds of conversations, we recommended that you disable the with chat history feature of the chatbot on the web UI. This effectively reduces the possibility of reaching the limit.
Web UI: On the Chat tab of the web UI, clear Chat history.

References

You can also use EAS to deploy the following items:

You can deploy an LLM application that can be called by using the web UI or API operations. After the LLM application is deployed, use the LangChain framework to integrate enterprise knowledge bases into the LLM application to implement intelligent Q&A and automation features. For more information, see Quickly deploy open source LLMs in EAS.
You can deploy an AI video generation model service by using ComfyUI and Stable Video Diffusion models. This helps you complete tasks such as short video generation and animation on social media platforms. For more information, see Use ComfyUI to deploy an AI video generation model service.
You can deploy a model service based on Stable Diffusion WebUI by configuring a few parameters. For more information, see Use Stable Diffusion web UI to deploy an AI painting service.