This topic describes how to deploy a web application based on the open source model Tongyi Qianwen (Qwen) and perform model inference on the web page or by using API operations in Elastic Algorithm Service (EAS) of Platform for AI (PAI).
Background information
Tongyi Qianwen-7b (Qwen-7B) is a 7 billion-parameter model of the Tongyi Qianwen foundation model series that is developed by Alibaba Cloud. Qwen-7B is a large language model (LLM) that is based on Transformer and trained on ultra-large-scale pre-training data. The pre-training data covers a wide range of data types, including a large number of texts, professional books, and code. In addition, the LLM AI assistant Qwen-7B-Chat is developed by using the alignment mechanism based on Qwen-7B.
Deploy Qwen-7B
Perform the following steps to deploy Qwen-7B as an AI-powered web application.
Log on to the PAI console. Select a region on the top of the page. Then, select the desired workspace and click Enter Elastic Algorithm Service (EAS).
Click Deploy Service. In the Custom Model Deployment section, click Custom Deployment.
On the Custom Deployment page, configure the required parameters. The following table describes key parameters.
Parameter | Description |
Service Name | The name of the service. In this example, the service name qwen_demo is specified. |
Deployment Mode | Select Deploy Web App by Using Image. |
Image Configuration | Click PAI Image, select modelscope-inference from the image drop-down list, and then select 1.8.1 from the image version drop-down list. |
Command | python app.py
|
Port Number | 8000 |
Environment Variables | Click Add and configure the following environment variables: MODEL_ID: qwen/Qwen-7B-Chat TASK: chat REVISION: v1.0.5 For information about the related configurations, see the description of Qwen-7B-Chat on the ModelScope website. |
Resource Type | Select Public Resources. |
Deployment Resources | Click GPU and select the ml.gu7i.c16m60.1-gu30 instance type. Note In this example, the training requires an instance of the GPU type that has at least 20 GB of memory. We recommend that you use ml.gu7i.c16m60.1-gu30 to reduce costs. |
Additional System Disk | Additional System Disk: 100. Unit: GB. |
Click Deploy. Go to the Elastic Algorithm Service (EAS) page. When the Service Status changes to Running, the model is deployed.
Note
In most cases, deployment requires approximately 5 minutes to complete. The amount of time that is required to complete a deployment varies based on the resource availability, service load, and configuration.
Perform model inference
After you deploy the model, you can perform model inference by using different methods.
Perform model inference on the web UI
Find the service that you want to view and click View Web App in the Service Type column.
Perform model inference on the web UI.
Perform model inference by using online debugging
Click Online Debugging in the Actions column of the service that you want to view. The Online Debugging tab appears.
In the Body section, specify the request in the JSON format and click Send Request. The response is returned in the Debugging Information section on the right side.
Note
In this example, the debugging information is in the list format. The input
field is the input content, and the history
field is the history dialogue. The body is a list that contains two sections. The first section is the question, and the second section is the answer to the question.
You can start the inference by entering a request without the history
field. Example:
{"input": "Where is the provincial capital of Zhejiang?"}
The service returns the result that contains the history
field. Example:
Status Code: 200
Content-Type: application/json
Date: Mon, 14 Aug 2023 12:01:45 GMT
Server: envoy
Vary: Accept-Encoding
X-Envoy-Upstream-Service-Time: 511
Body: {"response":"The provincial capital of Zhejiang is Hangzhou. ","history":[["Where is the provincial capital of Zhejiang?","The provincial capital of Zhejiang is Hangzhou."]]}
You can include the history field in the following request to perform a continuous conversation. Example:
{"input": "What about Jiangsu?", "history": [["Where is the provincial capital of Zhejiang?","The provincial capital of Zhejiang is Hangzhou."]]}
The service returns the result. Example:
Status Code: 200
Content-Type: application/json
Date: Mon, 14 Aug 2023 12:01:23 GMT
Server: envoy
Vary: Accept-Encoding
X-Envoy-Upstream-Service-Time: 522
Body: {"response":"The provincial capital of Jiangsu is Nanjing.","history":[["Where is the provincial capital of Zhejiang?","The provincial capital of Zhejiang is Hangzhou."],[ "What about Jiangsu?","The provincial capital of Jiangsu is Nanjing."]]}
Perform model inference by using APIs
You can call the service by calling API operations.
On the service details page, click View Endpoint Information. In the Invocation Method dialog box, obtain the values of the Public Endpoint and Token parameters.

Call the service based on the information that you obtained in the terminal. Example:
curl -d '{"input": "What about Jiangsu?", "history": [["Where is the provincial capital of Zhejiang?", "The provincial capital of Zhejiang is Hangzhou."]]}' -H "Authorization: xxx" http://xxxx.com
The service returns the result. Example:
{"response":"The provincial capital of Jiangsu is Nanjing.","history":[["Where is the provincial capital of Zhejiang?","The provincial capital of Zhejiang is Hangzhou."],["What about Jiangsu?","The provincial capital of Jiangsu is Nanjing."]]}
Send an HTTP request to the service based on your business requirements. For more information about debugging, refer to the SDK that is provided by PAI in the Deploy inference services topic. Sample Python code:
import requests
import json
data = {"input": "Who are you?"}
response = requests.post(url='http://qwen-demo.16623xxxxx.cn-hangzhou.pai-eas.aliyuncs.com/',
headers={"Authorization": "yourtoken"},
data=json.dumps(data))
print(response.text)
data = {"input": "What can you do?", "history": json.load (response.text)["history"]}
response = requests.post(url='http://qwen-demo.16623xxxxx.cn-hangzhou.pai-eas.aliyuncs.com/',
headers={"Authorization": "yourtoken"},
data=json.dumps(data))
print(response.text)
Perform model inference in streaming mode
On the service details page, click View Endpoint Information. In the Invocation Method dialog box, obtain the values of the Public Endpoint and Token parameters.

In the terminal, run the following Python code to send a streaming request based on the information that you obtained.
from websockets.sync.client import connect
import os
import platform
def clear_screen():
if platform.system() == "Windows":
os.system("cls")
else:
os.system("clear")
def print_history(history):
print("Welcome to the Qwen-7B model. Start the conversation by entering a content. Press clear to clear the conversation history and stop to terminate the program.")
for pair in history:
print(f"\nUser: {pair[0]}\nQwen-7B: {pair[1]}")
def main():
history, response = [], ''
clear_screen()
print_history(history)
with connect("<service_url>", additional_headers={"Authorization": "<token>"}) as websocket:
while True:
query = input("\nUser: ")
if query.strip() == "stop":
break
websocket.send(query)
while True:
msg = websocket.recv()
if msg == '<EOS>':
break
clear_screen()
print_history(history)
print(f"\nUser: {query}")
print("\nQwen-7B: ", end="")
print(msg)
response = msg
history.append((query, response))
if __name__ == "__main__":
main()
References
For more information about EAS, see Overview of online model services EAS.