This topic describes how to deploy a web application based on the open source model Tongyi Qianwen (Qwen) and perform model inference on the web page or by using API operations in Elastic Algorithm Service (EAS) of Platform for AI (PAI).
Background information
Tongyi Qianwen-7b (Qwen-7B) is a 7 billion-parameter model of the Tongyi Qianwen foundation model series that is developed by Alibaba Cloud. Qwen-7B is a large language model (LLM) that is based on Transformer and trained on ultra-large-scale pre-training data. The pre-training data covers a wide range of data types, including a large number of texts, professional books, and code. In addition, the LLM AI assistant Qwen-7B-Chat is developed by using the alignment mechanism based on Qwen-7B.
Prerequisites
EAS is activated. The default workspace and pay-as-you-go resources are created. For more information, see Activate PAI and create a default workspace.
Deploy Qwen-7B
Perform the following steps to deploy Qwen-7B as an AI-powered web application.
Log on to the PAI console. Select a region on the top of the page. Then, select the desired workspace and click Enter Elastic Algorithm Service (EAS).
Click Deploy Service. In the Custom Model Deployment section, click Custom Deployment.
On the Custom Deployment page, configure the required parameters. The following table describes key parameters.
Parameter
Description
Service Name
The name of the service. In this example, the service name qwen_demo is specified.
Deployment Mode
Select Deploy Web App by Using Image.
Image Configuration
Click PAI Image, select modelscope-inference from the image drop-down list, and then select 1.8.1 from the image version drop-down list.
Command
python app.py
Port Number
8000
Environment Variables
Click Add and configure the following environment variables:
MODEL_ID: qwen/Qwen-7B-Chat
TASK: chat
REVISION: v1.0.5
For information about the related configurations, see the description of Qwen-7B-Chat on the ModelScope website.
Resource Type
Select Public Resources.
Deployment Resources
Click GPU and select the ml.gu7i.c16m60.1-gu30 instance type.
NoteIn this example, the training requires an instance of the GPU type that has at least 20 GB of memory. We recommend that you use ml.gu7i.c16m60.1-gu30 to reduce costs.
Additional System Disk
Additional System Disk: 100. Unit: GB.
Click Deploy. Go to the Elastic Algorithm Service (EAS) page. When the Service Status changes to Running, the model is deployed.
NoteIn most cases, deployment requires approximately 5 minutes to complete. The amount of time that is required to complete a deployment varies based on the resource availability, service load, and configuration.
Perform model inference
After you deploy the model, you can perform model inference by using different methods.
Perform model inference on the web UI
Find the service that you want to view and click View Web App in the Service Type column.
Perform model inference on the web UI.
Perform model inference by using online debugging
Click Online Debugging in the Actions column of the service that you want to view. The Online Debugging tab appears.
In the Body section, specify the request in the JSON format and click Send Request. The response is returned in the Debugging Information section on the right side.
NoteIn this example, the debugging information is in the list format. The
input
field is the input content, and thehistory
field is the history dialogue. The body is a list that contains two sections. The first section is the question, and the second section is the answer to the question.You can start the inference by entering a request without the
history
field. Example:{"input": "Where is the provincial capital of Zhejiang?"}
The service returns the result that contains the
history
field. Example:Status Code: 200 Content-Type: application/json Date: Mon, 14 Aug 2023 12:01:45 GMT Server: envoy Vary: Accept-Encoding X-Envoy-Upstream-Service-Time: 511 Body: {"response":"The provincial capital of Zhejiang is Hangzhou. ","history":[["Where is the provincial capital of Zhejiang?","The provincial capital of Zhejiang is Hangzhou."]]}
You can include the history field in the following request to perform a continuous conversation. Example:
{"input": "What about Jiangsu?", "history": [["Where is the provincial capital of Zhejiang?","The provincial capital of Zhejiang is Hangzhou."]]}
The service returns the result. Example:
Status Code: 200 Content-Type: application/json Date: Mon, 14 Aug 2023 12:01:23 GMT Server: envoy Vary: Accept-Encoding X-Envoy-Upstream-Service-Time: 522 Body: {"response":"The provincial capital of Jiangsu is Nanjing.","history":[["Where is the provincial capital of Zhejiang?","The provincial capital of Zhejiang is Hangzhou."],[ "What about Jiangsu?","The provincial capital of Jiangsu is Nanjing."]]}
Perform model inference by using APIs
You can call the service by calling API operations.
On the service details page, click View Endpoint Information. In the Invocation Method dialog box, obtain the values of the Public Endpoint and Token parameters.
Call the service based on the information that you obtained in the terminal. Example:
curl -d '{"input": "What about Jiangsu?", "history": [["Where is the provincial capital of Zhejiang?", "The provincial capital of Zhejiang is Hangzhou."]]}' -H "Authorization: xxx" http://xxxx.com
The service returns the result. Example:
{"response":"The provincial capital of Jiangsu is Nanjing.","history":[["Where is the provincial capital of Zhejiang?","The provincial capital of Zhejiang is Hangzhou."],["What about Jiangsu?","The provincial capital of Jiangsu is Nanjing."]]}
Send an HTTP request to the service based on your business requirements. For more information about debugging, refer to the SDK that is provided by PAI in the Deploy inference services topic. Sample Python code:
import requests
import json
data = {"input": "Who are you?"}
response = requests.post(url='http://qwen-demo.16623xxxxx.cn-hangzhou.pai-eas.aliyuncs.com/',
headers={"Authorization": "yourtoken"},
data=json.dumps(data))
print(response.text)
data = {"input": "What can you do?", "history": json.load (response.text)["history"]}
response = requests.post(url='http://qwen-demo.16623xxxxx.cn-hangzhou.pai-eas.aliyuncs.com/',
headers={"Authorization": "yourtoken"},
data=json.dumps(data))
print(response.text)
Perform model inference in streaming mode
On the service details page, click View Endpoint Information. In the Invocation Method dialog box, obtain the values of the Public Endpoint and Token parameters.
In the terminal, run the following Python code to send a streaming request based on the information that you obtained.
#encoding=utf-8 from websockets.sync.client import connect import os import platform def clear_screen(): if platform.system() == "Windows": os.system("cls") else: os.system("clear") def print_history(history): print("Welcome to the Qwen-7B model. Start the conversation by entering a content. Press clear to clear the conversation history and stop to terminate the program.") for pair in history: print(f"\nUser: {pair[0]}\nQwen-7B: {pair[1]}") def main(): history, response = [], '' clear_screen() print_history(history) with connect("<service_url>", additional_headers={"Authorization": "<token>"}) as websocket: while True: query = input("\nUser: ") if query.strip() == "stop": break websocket.send(query) while True: msg = websocket.recv() if msg == '<EOS>': break clear_screen() print_history(history) print(f"\nUser: {query}") print("\nQwen-7B: ", end="") print(msg) response = msg history.append((query, response)) if __name__ == "__main__": main()
Replace <service_url> with the endpoint that you obtained in Step 1 and replace http in the endpoint with ws.
Replace <token> with the service token that you obtained in Step 1.
References
For more information about EAS, see Overview of online model services EAS.