Elastic Algorithm Service (EAS) provides simplified deployment methods for different scenarios. You can configure parameters to deploy a Retrieval-Augmented Generation (RAG)-based large language model (LLM) chatbot. This significantly shortens the service deployment time. When you use the chatbot to perform model inference, the chatbot effectively retrieves relevant information from the knowledge base and combines the retrieved information with answers from LLM applications to provide accurate and informative answers. This significantly improves the quality of Q&A and overall performance. The chatbot is suitable for Q&A, summarization, and other natural language processing (NLP) tasks that rely on specific knowledge bases. This article describes how to deploy a RAG-based LLM chatbot and how to perform model inference.
LLM applications have limitations in generating accurate and real-time responses. Therefore, LLM applications are not suitable for scenarios that require precise information, such as the customer service or Q&A scenario. To resolve these issues, the RAG technique is used to enhance the performance of LLM applications. This significantly improves the quality of Q&A, summarization, and other NLP tasks that rely on specific knowledge bases.
RAG improves the answer accuracy and increases the amount of information about answers by combining LLM applications such as Tongyi Qianwen with information retrieval components. When a query is initiated, RAG uses an information retrieval component to find documents or information fragments related to the query in the knowledge base, and integrates these retrieved contents with the original query into an LLM application. The LLM application uses its induction and generation capabilities to generate factual answers based on the latest information. You do not need to retrain the LLM application.
The chatbot that is deployed in EAS integrates LLM applications with RAG to overcome the limitations of LLM applications in terms of accuracy and timeliness. This chatbot provides accurate and informative answers in various Q&A scenarios and helps improve the overall performance and user experience of NLP tasks.
Note: If you use Facebook AI Similarity Search (Faiss) to build a vector database, the preceding prerequisites are not required.
The vector database and EAS must be deployed in the same region.
You can use one of the following services to build a vector database: Faiss, Elasticsearch, Hologres, and AnalyticDB PostgreSQL. When you build a vector database, save the required parameter configurations for connecting to the vector database.
Faiss streamlines the process of building an on-premises vector database. You do not need to purchase or activate the service.
1. Create an Alibaba Cloud Elasticsearch cluster. For more information, see Create an Alibaba Cloud Elasticsearch cluster.
2. Click the name of the instance to go to the Basic Information page. Copy the values of the Internal Endpoint and Internal Port parameters and save them to your on-premises machine.
1. Purchase a Hologres instance and create a database. For more information, see Purchase a Hologres instance and Create a database. You must save the name of the database to your on-premises machine.
2. View the invocation information in the Hologres console.
:80
in the endpoint to your on-premises machine.3. In the left-side navigation pane, click Account Management to create a custom account. Save the account and password to your on-premises machine. This information is used for subsequent connections to the Hologres instance. For information about how to create a custom account, see the "Create a custom account" section in Manage users.
Set the Select Member Role parameter to Examples of the Super Administrator (SuperUser).
1. Create an instance in the AnalyticDB for PostgreSQL console. For more information, see Create an instance.
Set the Vector Engine Optimization parameter to Enabled.
2. Click the name of the instance to go to the Basic Information page. In the Database Connection Information section, copy the internal and public endpoints of the instance and save them to your on-premises machine.
Note:
3. Create a database account. Save the database account and password to your on-premises machine. This information is used for subsequent connections to the database. For more information, see Create a database account.
4. Configure a whitelist that consists of trusted IP addresses. For more information, see Configure an IP address whitelist.
1. Go to the EAS-Online Model Services page.
2. On the Elastic Algorithm Service (EAS) page, click Deploy Service. In the Scenario-based Model Deployment section, click RAG-based Smart Dialogue Deployment.
3. On the RAG-based LLM Chatbot Deployment page, configure the parameters. The following tables describe the key parameters in different sections.
Parameter | Description |
Service Name | Enter the name of the service. |
Model Source | Valid values: Open Source Model and Custom Fine-tuned Model. |
Model Type | Select a model type based on your business requirements. If you set Model Source to Custom Fine-tuned Model, you must configure the parameter quantity and precision for the model type. |
Model Settings | If you set Model Source to Custom Fine-tuned Model, you must configure the path in which the fine-tuned model file is stored. Valid values: Note: Make sure that the model file format is compatible with Hugging Face transformers. • Mount OSS: Select the OSS path in which the fine-tuned model file is stored. • Mount NAS: Select the NAS file system in which the fine-tuned model file is stored and the source path of the NAS file system. |
Parameter | Description |
Resource Configuration | •If you set Model Source to Open Source Model, the system automatically selects an instance type based on the selected model type as the default value. • If you set Model Source to Custom Fine-tuned Model, you need to select an instance type that matches the model. For more information, see Deploy LLM applications in EAS. |
Inference Acceleration | Inference acceleration can be enabled for the Qwen, Llama2, ChatGLM, or Baichuan2 model that is deployed on A10 or GU30 instances. Valid values: • BladeLLM Inference Acceleration: The BladeLLM inference acceleration engine ensures high concurrency and low latency. You can use BladeLLM to accelerate LLM inference in a cost-effective manner. • Open-source vLLM Inference Acceleration |
Select a service to build a vector database based on your business requirements.
Parameter | Description |
Vector Database Type | Select FAISS. |
Database Folder Name | Enter the name of the database folder. Example: /code. |
Index Folder Name | Enter the name of the index folder. Example: faiss_index. |
Parameter | Description |
Vector Database Type | Select ElasticSearch. |
Private Endpoint and Port | Enter the private endpoint and port number that you obtained in Step 1. Specify the parameter in the http://Private endpoint:Port number format. |
Index Name | Enter the name of the index. |
Account | Enter the logon name that you configured when you created the Elasticsearch cluster in Step 1. |
Password | Enter the logon password that you configured when you created the Elasticsearch cluster in Step 1. |
Parameter | Description |
Vector Database Type | Select Hologres. |
Invocation Information | Enter the Hologres invocation information that you obtained in Step 1. |
Database | Enter the name of the database that you created in Step 1. |
Account | Enter the custom account that you created in Step 1. |
Password | Enter the password of the custom account that you created in Step 1. |
Database Table | Enter the name of the database table. Example: test_table. |
Parameter | Description |
Vector Database Type | Select AnalyticDB. |
Database Endpoint | Enter the public endpoint of the database that you obtained in Step 1. Note If the instance resides in the same VPC as EAS, you need only the internal endpoint. |
Database Name | To view the name of the database, log on to the database. For more information, see Connect to a database. |
Account | Enter the database account that you created in Step 1. |
Password | Enter the password of the database that you created in Step 1. |
Database Folder Name | Enter the name of the database folder. Example: test_db. |
Delete Table | Select a policy for processing the existing database table that has the same name. Valid values: • Delete: deletes the existing database table that has the same name and creates another table. If no table that has the same name exists, a new table is directly created. • Do not delete: retains the existing database table that has the same name and appends the data to the existing database table. |
Parameter | Description |
VPC | • If you use Hologres, AnalyticDB for PostgreSQL, Elasticsearch, or Milvus to build a vector database, select the VPC in which the vector database is deployed. • If you use Faiss to build a vector database, you do not need to configure the VPC. |
vSwitch | |
Security Group Name |
4. Click Deploy.
If the value in the Service Status column changes to Running, the RAG-based chatbot is deployed.
This section describes how to debug the RAG-based chatbot on the web UI. After you test the Q&A performance of the RAG-based chatbot on the web UI, you can call API operations provided by Platform for AI (PAI) to apply the RAG-based chatbot to your business system. For more information, see Step 4: Call API operations to perform model inference in this article.
1. After you deploy the RAG-based chatbot, click View Web App in the Service Type column to enter the web UI.
2. Configure the machine learning model.
3. Check whether the vector database is connected.
The system automatically recognizes and applies the vector database settings that are configured when you deploy the chatbot. The settings cannot be modified. If you use Hologres to build the vector database, click Connect Hologres to check whether the vector database in Hologres is connected.
On the Upload tab, upload the specified business data files.
1. Configure semantic-based chunking parameters.
File type | Example | Description |
text | rag_chatbot_test_doc.txt | Configure the following parameters to control the granularity for splitting files into chunks. • Chunk Size: the size of each chunk. Default value: 200. Unit: bytes.• Chunk Overlap: the portion of overlap between adjacent chunks. Default value: 0. |
On the Chat tab, configure Q&A policies for retrieval-based queries.
Parameter | Description |
Top K | The number of the most relevant results that are returned from the vector database. |
Similarity Distance Threshold | A smaller value indicates a higher level of similarity between vectors. If the similarity distance of two vectors is less than the threshold, the two vectors are similar. We recommend that you retain the default value of this parameter. |
Re-Rank Model | Most vector databases compromise data accuracy to provide high computing efficiency. As a result, the top K results that are returned from the vector database may not be the most relevant. In this case, you can use the open source model BAAI/bge-reranker-base or BAAI/bge-reranker-large to perform a higher-precision re-rank operation on the top K results that are returned from the vector database to obtain more relevant and accurate knowledge files. |
Keyword Retrieval | • Embedding Only: Vector database-based retrieval is used. • Keyword Ensemble: Multimodal retrieval is used. Note: In most complex scenarios, vector database-based retrieval delivers good performance. However, in some vertical fields in which corpora are scarce or in scenarios in which accurate matching is required, vector database-based retrieval may not be as good as the traditional retrieval based on sparse and dense vectors. Retrieval based on sparse and dense vectors is simpler and more efficient by calculating the keyword overlap between user queries and knowledge files. PAI provides keyword-based retrieval algorithms, such as BM25, to perform retrieval based on sparse and dense vectors. Vector database-based retrieval and keyword-based retrieval have their own advantages and disadvantages. Combining the results of the two types of retrieval can improve the overall accuracy and efficiency. The reciprocal rank fusion (RRF) algorithm calculates the weighted sum value of ranks by which a file is sorted in different retrieval methods to obtain a total score. If you select Keyword Ensembled for Keyword Retrieval, keyword-based retrieval is used. In this case, PAI uses the RRF algorithm by default to combine results returned from the vector database-based retrieval and keyword-based retrieval. |
On the Chat tab, configure Q&A policies for RAG-based queries. The RAG-based chatbot uses the retrieval method and is empowered by LLM applications. PAI provides various prompt policies. You can select a predefined prompt template or specify a custom prompt template for better inference results.
The chatbot returns the top K relevant results from the vector database.
The chatbot uses only the LLM application to generate an answer.
The chatbot enters the results returned from the vector database and the query into the selected prompt template and sends the template to the LLM application to provide an answer.
1. Obtain the invocation information of the RAG-based chatbot.
2. Connect to the vector database on the web UI and upload business data files. For more information, see 1. Configure the RAG-based chatbot and 2. Upload specified business data files in this article.
3. Call the RAG-based chatbot by using APIs.
PAI allows you to call the RAG-based chatbot by using the chat/retrieval
, chat/llm,
or chat/rag
API. Sample code:
chat/retrieval
curl -X 'POST' '<service_url>chat/retrieval' -H 'Authorization: <service_token>' -H 'accept: application/json' -H 'Content-Type: application/json' -d '{"question": "What is PAI?","score_threshold": 900, "vector_topk": 3}'
# Replace <service_url> with the service endpoint that you obtained in Step 1, and <service_token> with the service token that you obtained in Step 1.
chat/llm
curl -X 'POST' '<service_url>chat/llm' -H 'Authorization: <service_token>' -H 'accept: application/json' -H 'Content-Type: application/json' -d '{"question": "What is PAI?"}'
# Replace <service_url> with the service endpoint that you obtained in Step 1, and <service_token> with the service token that you obtained in Step 1.
You can add other adjustable inference parameters, such as {"question":"What is PAI?", "topk": 3, "topp": 0.8, "temperature": 0.9}
.
chat/rag
curl -X 'POST' '<service_url>chat/rag' -H 'Authorization: <service_token>' -H 'accept: application/json' -H 'Content-Type: application/json' -d '{"question": "What is PAI?","score_threshold": 900, "vector_topk": 3}'
# Replace <service_url> with the service endpoint that you obtained in Step 1, and <service_token> with the service token that you obtained in Step 1.
You can add other adjustable inference parameters, such as {"question":"What is PAI?", "score_threshold": 900, "vector_topk": 3, "topk": 3, "topp": 0.8, "temperature": 0.9}
.
import requests
EAS_URL = 'http://chatbot-langchain.xx.cn-beijing.pai-eas.aliyuncs.com'
def test_post_api_chat():
url = EAS_URL + '/chat/retrieval'
# url = EAS_URL + '/chat/llm'
# url = EAS_URL + '/chat/rag'
headers = {
'accept': 'application/json',
'Content-Type': 'application/json',
'Authorization': 'xxxxx==',
}
data = {
"question":"What is PAI?", "score_threshold": 900, "vector_topk": 3
}
# The chat/llm and chat/rag APIs support other adjustable inference parameters.
"""
data = {
"question":"What is PAI?", "topk": 3, "topp": 0.8, "temperature": 0.9
}
"""
response = requests.post(url, headers=headers, json=data)
if response.status_code != 200:
raise ValueError(f'Error post to {url}, code: {response.status_code}')
ans = response.json()
return ans['response']
print(test_post_api_chat())
Set EAS_URL to the endpoint of the RAG-based chatbot and Authorization to the token of the RAG-based chatbot.
E2E Development and Usage of LLM Data Processing + Model Training + Model Inference
Hologres Technology: Extreme Analysis Performance of JSON Semi-structured Data
35 posts | 1 followers
FollowAlibaba Cloud Data Intelligence - June 17, 2024
Alibaba Cloud Community - August 18, 2023
ApsaraDB - July 13, 2023
Alibaba Cloud Community - January 4, 2024
Alibaba Cloud Community - September 6, 2024
Alibaba Cloud Indonesia - January 30, 2024
35 posts | 1 followers
FollowA platform that provides enterprise-level data modeling services based on machine learning algorithms to quickly meet your needs for data-driven operations.
Learn MoreOffline SDKs for visual production, such as image segmentation, video segmentation, and character recognition, based on deep learning technologies developed by Alibaba Cloud.
Learn MoreAccelerate innovation with generative AI to create new business success
Learn MoreAccelerate AI-driven business and AI model training and inference with Alibaba Cloud GPU technology
Learn MoreMore Posts by Alibaba Cloud Data Intelligence