Elastic Algorithm Service (EAS) provides simplified deployment methods for different scenarios. You can configure parameters to deploy a Retrieval-Augmented Generation (RAG)-based large language model (LLM) chatbot. This significantly shortens the service deployment time. When you use the chatbot to perform model inference, the chatbot effectively retrieves relevant information from the knowledge base and combines the retrieved information with answers from LLM applications to provide accurate and informative answers. This significantly improves the quality of Q&A and overall performance. The chatbot is suitable for Q&A, summarization, and other natural language processing (NLP) tasks that rely on specific knowledge bases. This topic describes how to deploy a RAG-based LLM chatbot and how to perform model inference.
Background information
LLM applications have limits in generating accurate and real-time responses. Therefore, LLM applications are not suitable for scenarios that require precise information, such as the customer service or Q&A scenario. To resolve these issues, the RAG technique is used to enhance the performance of LLM applications. This significantly improves the quality of Q&A, summarization, and other NLP tasks that rely on specific knowledge bases.
RAG improves the answer accuracy and increases the amount of information about answers by combining LLM applications such as Qwen with information retrieval components. When a query is initiated, RAG uses an information retrieval component to find documents or information fragments related to the query in the knowledge base, and integrates these retrieved contents with the original query into an LLM application. The LLM application uses its induction and generation capabilities to generate factual answers based on the latest information. You do not need to retrain the LLM application.
The chatbot that is deployed in EAS integrates LLM applications with RAG to overcome the limits of LLM applications in terms of accuracy and timeliness. This chatbot provides accurate and informative answers in various Q&A scenarios and helps improve the overall performance and user experience of NLP tasks.
Prerequisites
Virtual private cloud (VPC), vSwitch, and security group are created. For more information, see Create and manage a VPC and Create a security group.
NoteIf you use Facebook AI Similarity Search (Faiss) to build a vector database, VPC, vSwitch, and security group are not required.
An Object Storage Service (OSS) bucket or File Storage NAS (NAS) file system is created to store fine-tuned model files. This prerequisite must be met if you use a fine-tuned model to deploy the chatbot. For more information, see Get started by using the OSS console or Create a file system.
NoteIf you use Faiss to build a vector database, you must prepare an OSS bucket.
Usage notes
This practice is subject to the maximum number of tokens of an LLM service and is designed to help you understand the basic retrieval feature of a RAG-based LLM chatbot.
The chatbot is limited by the server resource size of the LLM service and the default number of tokens. The conversation length supported by the chatbot is also limited.
If you do not need to perform multiple rounds of conversations, we recommended that you disable the with chat history feature of the chatbot on the WebUI page. This effectively reduces the possibility of reaching the limit. For more information, see How do I disable the with chat history feature of the RAG-based chatbot?
Step 1: Deploy the RAG service
To deploy a RAG-based LLM chatbot and bind a vector database, take the following steps:
Log on to the PAI console. Select a region and a workspace. Then, click Enter Elastic Algorithm Service (EAS).
On the Elastic Algorithm Service (EAS) page, click Deploy Service. In the Scenario-based Model Deployment section, click RAG-based Smart Dialogue Deployment.
On the RAG-based LLM Chatbot Deployment page, configure the following key parameters.
Basic Information
Parameter
Description
Model Source
The source of the model. Valid values:
Open Source Model :PAI provides a variety of preset open-source models, including Qwen, Llama, ChatGLM, Baichuan, Falcon, Yi, Mistral, Gemma, and DeepSpeek. You can select and deploy a model with the appropriate parameter size.
Custom Fine-tuned Model :PAI supports your fine-tuned models for specific scenarios.
Model Type
If you use an Open Source Model, select a model with the appropriate parameter size.
If you use a Custom Fine-tuned Model, you need to specify the model type, parameter size, and precision.
Model Settings
If you use a Custom Fine-tuned Model, you need to specify the path of the model. The system reads the model configuration file from this path when deploying the model. Valid values:
NoteWe recommend that you first run the fine-tuned model in Transformers of Huggingface to confirm that the output meets your expectations before you deploy the model as an EAS service.
OSS: Select the OSS path in which the fine-tuned model file is stored.
NAS: Select the NAS file system in which the fine-tuned model file is stored, the source path and the mount path.
Resource Configuration
Parameter
Description
Resource Configuration
After you select a model, the system recommends appropriate resource configurations. If you switch to another specification, the model service may fail to start.
Inference Acceleration
Inference acceleration can be enabled for the Qwen, Llama2, ChatGLM, or Baichuan2 model that is deployed on A10 or GU30 instances. Valid values:
BladeLLM Inference Acceleration: The BladeLLM inference acceleration engine ensures high concurrency and low latency. You can use BladeLLM to accelerate LLM inference in a cost-effective manner.
Open-source vLLM Inference Acceleration
Vector Database Settings
You can use one of the following types of vector database: Faiss, Elasticsearch, Hologres, OpenSearch, or RDS PostgreSQL. Select a type based on your business requirements.
FAISS
You can use Faiss to quickly build a local vector database in an EAS instance without the need to purchase or activate online vector databases.
Parameter
Description
Vector Database Type
Select FAISS.
OSS Path
The OSS path of the vector database. Select an OSS path in the current region. You can create an OSS path if no OSS path is available. For more information, see Get started by using the OSS console.
NoteIf you use a Custom Fine-tuned Model, make sure that the OSS paths of the vector database and the model are different.
ElasticSearch
Specify the connection information of an Elasticsearch cluster. For information about how to create and prepare an Elasticsearch cluster, see Prepare a vector database by using Elasticsearch.
Parameter
Description
Vector Database Type
Select Elasticsearch.
Private Endpoint and Port
The private endpoint and port number of the Elasticsearch cluster. Format:
http://Private endpoint:Port number
. For information about how to obtain the private endpoint and port number of the Elasticsearch cluster, see View the basic information of a cluster.Index Name
The name of the index. You can enter a new index name or an existing index name. If you use an existing index name, the index schema must meet the requirements of the RAG-based chatbot. For example, you can enter the name of the index that is automatically created when you deploy the RAG-based chatbot by using EAS.
Account
The logon name that you specified when you created the Elasticsearch cluster. Default logon name: elastic.
Password
The password that you configured when you created the Elasticsearch cluster. If you forget the password, see Reset the access password for an Elasticsearch cluster.
Hologres
Specify the connection information of a Hologres instance. To purchase a Hologres instance, see Purchase a Hologres instance.
Parameter
Description
Vector Database Type
Select Hologres.
Invocation Information
The host information of Select VPC. Go to the Instance Details page in the Hologres console. In the Network Information section, click Copy next to Select VPC to obtain the host before the domain name
:80
.Database Name
The name of the database in the Hologres instance. For more information about how to create a database, see Create a database.
Account
The custom account that you created. For more information, see Create a custom account. In the Select Member Role section, select Examples of the Super Administrator (SuperUser).
Password
The password of the custom account that you created.
Table Name
The name of the table. You can enter a new table name or an existing table name. If you use an existing table name, the table schema must meet the requirements of the RAG-based chatbot. For example, you can enter the name of the Hologres table that is automatically created when you deploy the RAG-based chatbot by using EAS.
OpenSearch
Specify the connection information of an OpenSearch instance of Vector Search Edition. For information about how to create and prepare an OpenSearch instance, see Prepare an OpenSearch Vector Search Edition instance.
Parameter
Description
Vector Database Type
Select OpenSearch.
Endpoint
The public endpoint of the OpenSearch instance. You must first configure Internet access for the OpenSearch instance. For more information, see Prepare an OpenSearch Vector Search Edition instance.
Instance ID
Obtain the instance ID from the OpenSearch instance list.
Username
Enter the username and password of the OpenSearch instance.
Password
Table Name
Enter the name of the index table of the OpenSearch instance. For information about how to prepare the index table, see Prepare an OpenSearch Vector Search Edition instance.
RDS PostgreSQL
Specify the connection information of the ApsaraDB RDS for PostgreSQL instance. For information about how to create and prepare an ApsaraDB RDS for PostgreSQL instance, see Prepare a vector database by using ApsaraDB RDS for PostgreSQL.
Parameter
Description
Vector Database Type
Select RDS PostgreSQL.
Host Address
The internal endpoint of the ApsaraDB RDS for PostgreSQL instance. You can log on to the ApsaraDB ApsaraDB RDS for PostgreSQL console and view the endpoint on the Database Connection page of the instance.
Port
The port number. Default value: 5432.
Database
The name of the database. For information about how to create a database and an account, see Create a database and an account.
When you create an account, select Privileged Account for Account Type.
When you create a database, select the created privileged account from the Authorized By drop-down list.
Table Name
The name of the database table.
Account
Specify the privileged account and password you created. Create a database and an account.
Password
VPC Configuration
Parameter
Description
VPC
If you use Hologres, ElasticSearch, OpenSearch, or RDS PostgreSQL to build a vector database, select the VPC in which the vector database is deployed.
NoteIf you use OpenSearch to build a vector database, you can select a VPC that is different from the VPC in which the RAG application resides. However, make sure that the VPC can be accessed over the Internet and the associated Elastic IP address (EIP) is added to the public IP address whitelist of the OpenSearch instance. For more information, see Use the SNAT feature of an Internet NAT gateway to access the Internet and Configure the public access whitelist.
If you use Faiss to build a vector database, you do not need to configure the VPC.
vSwitch
Security Group Name
Click Deploy.
When the Service Status changes to Running, the RAG-based chatbot is deployed.
Step 2: Test the chatbot through WebUI
Follow the following steps to upload your knowledge base file on the WebUI page and test the Q&A chatbot.
1. Connect to the vector database
After you deploy the RAG-based chatbot, click View Web App in the Service Type column to enter the web UI.
Configure the embedding model. The system uses the embedding model to convert text chunks into vectors.
Embedding Model Name: Four models are available. By default, the optimal model is selected.
Embedding Dimension: This parameter has a direct impact on the performance of the model. After you select an embedding model, the system automatically configures this parameter.
Check whether the vector database is connected.
The system automatically recognizes and applies the vector database settings that are configured when you deploy the chatbot. If you use Hologres to build the vector database, click Connect Hologres to check whether the vector database in Hologres is connected. If the connection fails, check whether the vector database is correctly configured based Step 1. Then, reconnect the database.
2. Upload knowledge base files
Upload your knowledge base files. The system automatically stores the knowledge base in the PAI-RAG format to the vector database for retrieval. You can also use existing knowledge bases in the database, but the knowledge bases must meet the PAI-RAG format requirements. Otherwise, errors may occur during retrieval.
On the Upload tab, configure the chunk parameters.
The following parameters controls the granularity of document chunking and whether to enable Q&A extraction.
Parameter
Description
Chunk Size
The size of each chunk. Unit: bytes. Default value: 500.
Chunk Overlap
The overlap between adjacent chunks. Default value: 10.
Process with QA Extraction Model
Specifies whether to extract Q&A information. If you select Yes, the system automatically extracts questions and corresponding answers in pairs after knowledge files are uploaded. This way, more accurate answers are returned in data queries.
On the Files tab or Directory tab, upload one or more business data files. You can also upload a directory that contains the business data files. Supported file types: txt,. pdf, Excel (.xlsx or. xls),. csv, Word (.docx or. doc), Markdown, or. html. For example: rag_chatbot_test_doc.txt.
Click Upload. The system performs data cleansing and semantic-based chunking on the business data files before uploading the business data files. Data cleansing includes text extraction and hyperlink replacement.
3. Configure model inference parameters
On the Chat tab, configure Q&A policies.
Retrieval policies
Parameter | description |
Streaming Output | Specifies whether to return results in streaming mode. If you select Streaming Output, the results are returned in streaming mode. |
Retrieval Mode | The retrieval method. Valid values:
Note In most complex scenarios, vector database-based retrieval delivers good performance. However, in some vertical fields that lacks information or in scenarios in which accurate matching is required, vector database-based retrieval may not achieve the same effect as the traditional retrieval based on sparse and dense vectors. Retrieval based on sparse and dense vectors is simpler and more efficient by calculating the keyword overlap between user queries and knowledge files. PAI provides keyword-based retrieval algorithms, such as BM25, to perform retrieval based on sparse and dense vectors. Vector database-based retrieval and keyword-based retrieval have their own advantages and disadvantages. Combining the results of the two types of retrieval methods can improve the overall accuracy and efficiency. The reciprocal rank fusion (RRF) algorithm calculates the weighted sum value of ranks by which a file is sorted in different retrieval methods to obtain a total score. If you select Hybrid for the Keyword Model parameter, multimodal retrieval is used. In this case, PAI uses the RRF algorithm by default to combine results returned from the vector database-based retrieval and keyword-based retrieval. |
Reranker Type | Most vector databases compromise data accuracy to provide high computing efficiency. As a result, the top K results that are returned from the vector database may not be the most relevant. In this case, you can use one of the following rerank models to perform a higher-precision re-rank operation on the top K results that are returned from the vector database to obtain more relevant and accurate knowledge files.
|
Top K | The number of the most relevant results that are returned from the vector database. |
RAG (Retrieval + LLM) policies
PAI provides various prompt policies. You can select a predefined prompt template or specify a custom prompt template for better inference results. The retrieval-augmented generation (RAG) system fills the returned results and user query into a prompt template, and then submits the prompt to the LLM.
You can also configure the following parameters in RAG (Retrieval + LLM) mode: Streaming Output, Retrieval Mode, and Reranker Type. For more information, see the Retrieval policies tab of this section.
4. Perform model inference
Retrieval
The chatbot returns the top K relevant results from the vector database.
LLM
The chatbot uses only the LLM to generate an answer.
RAG (Retrieval + LLM)
The chatbot fills the returned results from the database and user query into a prompt template, and then submits the prompt to the LLM to generate an answer.
After you test the Q&A performance of the RAG-based chatbot on the web UI, you can call API operations provided by Platform for AI (PAI) to apply the RAG-based chatbot to your business system. For more information, see Step 3: Call API operations to perform model inference in this topic.
Step 3: Call API operations to perform model inference
Obtain the invocation information of the RAG-based chatbot.
Click the name of the RAG-based chatbot to go to the Service Details page.
In the Basic Information section, click View Endpoint Information.
On the Public Endpoint tab of the Invocation Method dialogue box, obtain the service endpoint and token.
Connect to the vector database through the WebUI and upload knowledge base files.
You can also upload knowledge base to the vector base based on the structure of a generated table, which conforms to the PAI-RAG format.
Call the service through APIs.
PAI allows you to call the RAG-based chatbot by using the following API operations in different query modes:
service/query/retrieval
in retrieval mode,service/query/llm
in LLM mode, andservice/query
in RAG mode. Sample code:cURL command
Initiate a single-round conversation request
Method 1: Call the
service/query/retrieval
operation.curl -X 'POST' '<service_url>service/query/retrieval' -H 'Authorization: <service_token>' -H 'accept: application/json' -H 'Content-Type: application/json' -d '{"question": "What is PAI?"}' # Replace <service_url> and <service_token> with the service endpoint and service token that you obtained in Step 1.
Method 2: Call the
/service/query/llm
operation.curl -X 'POST' '<service_url>service/query/llm' -H 'Authorization: <service_token>' -H 'accept: application/json' -H 'Content-Type: application/json' -d '{"question": "What is PAI?"}' # Replace <service_url> and <service_token> with the service endpoint and service token that you obtained in Step 1.
You can add other adjustable inference parameters such as
{"question":"What is PAI?", "temperature": 0.9}
.Method 3: Call the
service/query
operation.curl -X 'POST' '<service_url>service/query' -H 'Authorization: <service_token>' -H 'accept: application/json' -H 'Content-Type: application/json' -d '{"question": "What is PAI?"}' # Replace <service_url> and <service_token> with the service endpoint and service token that you obtained in Step 1.
You can add other adjustable inference parameters such as
{"question":"What is PAI?", "temperature": 0.9}
.
Initiate a multi-round conversational search request
You can initiate a multi-round conversational search request only in RAG and LLM query modes. The following sample code shows an example on how to initiate a multi-round conversational search request in RAG query mode:
# Send the request. curl -X 'POST' '<service_url>service/query' -H 'Authorization: <service_token>' -H 'accept: application/json' -H 'Content-Type: application/json' -d '{"question": "What is PAI?"}' # Provide the session ID returned for the request. This ID uniquely identifies a conversation in the conversation history. After the session ID is provided, the corresponding conversation is stored and is automatically included in subsequent requests to call an LLM. curl -X 'POST' '<service_url>service/query' -H 'Authorization: <service_token>' -H 'accept: application/json' -H 'Content-Type: application/json' -d '{"question": "What are the benefits of PAI?","session_id": "ed7a80e2e20442eab****"}' # Provide the chat_history parameter, which contains the conversation history between you and the chatbot. The parameter value is a list in which each element indicates a single round of conversation in the {"user":"Inputs","bot":"Outputs"} format. Multiple conversations are sorted in chronological order. curl -X 'POST' '<service_url>service/query' -H 'Authorization: <service_token>' -H 'accept: application/json' -H 'Content-Type: application/json' -d '{"question":"What are the features of PAI?", "chat_history": [{"user":"What is PAI", "bot":"PAI is an AI platform provided by Alibaba Cloud..."}]}' # If you provide both the session_id and chat_history parameters, the conversation history is appended to the conversation that corresponds to the specified session ID. curl -X 'POST' '<service_url>service/query' -H 'Authorization: <service_token>' -H 'accept: application/json' -H 'Content-Type: application/json' -d '{"question":"What are the features of PAI?", "chat_history": [{"user":"What is PAI", "bot":"PAI is an AI platform provided by Alibaba Cloud..."}], "session_id": "1702ffxxad3xxx6fxxx97daf7c"}'
Python
The following sample code shows an example on how to initiate a single-round conversational search request:
import requests EAS_URL = 'http://xxxx.****.cn-beijing.pai-eas.aliyuncs.com' headers = { 'accept': 'application/json', 'Content-Type': 'application/json', 'Authorization': 'MDA5NmJkNzkyMGM1Zj****YzM4M2YwMDUzZTdiZmI5YzljYjZmNA==', } def test_post_api_query_llm(): url = EAS_URL + '/service/query/llm' data = { "question":"What is PAI?" } response = requests.post(url, headers=headers, json=data) if response.status_code != 200: raise ValueError(f'Error post to {url}, code: {response.status_code}') ans = dict(response.json()) print(f"======= Question =======\n {data['question']}") print(f"======= Answer =======\n {ans['answer']} \n\n") def test_post_api_query_retrieval(): url = EAS_URL + '/service/query/retrieval' data = { "question":"What is PAI?" } response = requests.post(url, headers=headers, json=data) if response.status_code != 200: raise ValueError(f'Error post to {url}, code: {response.status_code}') ans = dict(response.json()) print(f"======= Question =======\n {data['question']}") print(f"======= Answer =======\n {ans['docs']}\n\n") def test_post_api_query_rag(): url = EAS_URL + '/service/query' data = { "question":"What is PAI?" } response = requests.post(url, headers=headers, json=data) if response.status_code != 200: raise ValueError(f'Error post to {url}, code: {response.status_code}') ans = dict(response.json()) print(f"======= Question =======\n {data['question']}") print(f"======= Answer =======\n {ans['answer']}") print(f"======= Retrieved Docs =======\n {ans['docs']}\n\n") # LLM test_post_api_query_llm() # Retrieval test_post_api_query_retrieval() # RAG (Retrieval + LLM) test_post_api_query_rag()
Set the EAS_URL parameter to the endpoint of the RAG-based chatbot. Make sure to remove the forward slash (
/
) at the end of the endpoint. Set the Authorization parameter to the token of the RAG-based chatbot.Initiate a multi-round conversational search request
You can initiate a multi-round conversational search request only in RAG (Retrieval + LLM) and LLM query modes. Sample code:
import requests EAS_URL = 'http://xxxx.****.cn-beijing.pai-eas.aliyuncs.com' headers = { 'accept': 'application/json', 'Content-Type': 'application/json', 'Authorization': 'MDA5NmJkN****jNlMDgzYzM4M2YwMDUzZTdiZmI5YzljYjZmNA==', } def test_post_api_query_llm_with_chat_history(): url = EAS_URL + '/service/query/llm' # Round 1 query data = { "question":"What is PAI?" } response = requests.post(url, headers=headers, json=data) if response.status_code != 200: raise ValueError(f'Error post to {url}, code: {response.status_code}') ans = dict(response.json()) print(f"=======Round 1: Question =======\n {data['question']}") print(f"=======Round 1: Answer =======\n {ans['answer']} session_id: {ans['session_id']} \n") # Round 2 query data_2 = { "question": "What are the benefits of PAI?", "session_id": ans['session_id'] } response_2 = requests.post(url, headers=headers, json=data_2) if response.status_code != 200: raise ValueError(f'Error post to {url}, code: {response.status_code}') ans_2 = dict(response_2.json()) print(f"=======Round 2: Question =======\n {data_2['question']}") print(f"=======Round 2: Answer =======\n {ans_2['answer']} session_id: {ans_2['session_id']} \n\n") def test_post_api_query_rag_with_chat_history(): url = EAS_URL + '/service/query' # Round 1 query data = { "question":"What is PAI?" } response = requests.post(url, headers=headers, json=data) if response.status_code != 200: raise ValueError(f'Error post to {url}, code: {response.status_code}') ans = dict(response.json()) print(f"=======Round 1: Question =======\n {data['question']}") print(f"=======Round 1: Answer =======\n {ans['answer']} session_id: {ans['session_id']}") print(f"=======Round 1: Retrieved Docs =======\n {ans['docs']}\n") # Round 2 query data = { "question":"What are the features of PAI?", "session_id": ans['session_id'] } response = requests.post(url, headers=headers, json=data) if response.status_code != 200: raise ValueError(f'Error post to {url}, code: {response.status_code}') ans = dict(response.json()) print(f"=======Round 2: Question =======\n {data['question']}") print(f"=======Round 2: Answer =======\n {ans['answer']} session_id: {ans['session_id']}") print(f"=======Round 2: Retrieved Docs =======\n {ans['docs']}") # LLM test_post_api_query_llm_with_chat_history() # RAG (Retrieval + LLM) test_post_api_query_rag_with_chat_history()
Set the EAS_URL parameter to the endpoint of the RAG-based chatbot. Make sure to remove the forward slash (
/
) at the end of the endpoint. Set the Authorization parameter to the token of the RAG-based chatbot.
References
You can also use EAS to deploy the following items:
You can deploy an LLM application that can be called by using the web UI or API operations. After the LLM application is deployed, use the LangChain framework to integrate enterprise knowledge bases into the LLM application to implement intelligent Q&A and automation features. For more information, see Quickly deploy open source LLMs in EAS.
You can deploy an AI video generation model service by using ComfyUI and Stable Video Diffusion models. This helps you complete tasks such as short video generation and animation on social media platforms. For more information, see Use ComfyUI to deploy an AI video generation model service.
You can deploy a model service based on Stable Diffusion WebUI by configuring a few parameters. For more information, see Use Stable Diffusion web UI to deploy an AI painting service.
FAQ
How do I disable the with chat history feature of the RAG-based chatbot?
On the web UI page of the RAG-based chatbot, uncheck With Chat History.