Elastic Algorithm Service (EAS) of Platform for AI (PAI) is an online model service for online inference scenarios, which provides a one-click solution for automatic deployment and application of LLMs. EAS allows you to deploy multiple open source LLM applications in an efficient manner and supports the following deployment methods: standard deployment, BladeLLM-based accelerated deployment, and vLLM-based accelerated deployment. BladeLLM and vLLM-based accelerated deployment can ensure high concurrency and low latency. This topic describes how to deploy and call an LLM in EAS and FAQ.
Prerequisites
PAI is activated and a default workspace is created. For more information, see Activate PAI and create a default workspace.
If you use a Resource Access Management (RAM) user to deploy the model, make sure that the RAM user has the permissions to use EAS. For more information, see Grant the permissions that are required to use EAS.
Deploy an EAS service
Go to the Elastic Algorithm Service (EAS) page.
Log on to the PAI console.
In the left-side navigation pane, click Workspaces. On the Workspaces page, click the name of the workspace to which you want to deploy the model.
In the left-side navigation pane, choose Model Deployment > Elastic Algorithm Service (EAS) to go to the Elastic Algorithm Service (EAS) page.
On the Elastic Algorithm Service (EAS) page, click Deploy Service. In the Scenario-based Model Deployment section of the Deploy Service page, select LLM Deployment.
On the LLM Deployment page, configure the parameters described in the following table. Use the default values for other parameters.
Parameter
Description
Basic Information
Service Name
Specify a name for the service. Example: llm_demo001.
Version
Set this parameter to Open-source Model Quick Deployment.
Model Type
Set this parameter to qwen2.5-7b-instruct. EAS provides various model types to meet your business requirements, such as DeepSeek-R1, Qwen2-VL, and Meta-Llama-3.2-1B.
Deployment Method
Set this parameter to Standard Deployment. Do not use any accelerated framework.
Resource Deployment
Resource Type
Set this parameter to Public Resources.
Deployment Resources
After you select a model type, the system automatically recommends a proper instance type.
Click Deploy. The model deployment requires approximately five minutes.
Use WebUI to perform inference
Find the deployed service and click View Web App in the Service Type column.
Test the inference performance on the WebUI page.
Enter a sentence in the input text box and click Send to start a conversation. Sample input:
Provide a learning plan for personal finance
.
FAQ
How do I switch to another open source LLM?
EAS provides the following open source LLMs: DeepSeek-R1, Llama, UI-TARS, QVQ, Gemma2, and Baichuan2. To switch between these models, perform the following steps:
On the Elastic Algorithm Service (EAS) page, find the service that you want to update and click Update in the Actions column.
On the LLM Deployment page, set the Model Type parameter to the desired open source LLM. The system automatically updates the value of the Deployment Resources parameter.
Click Update.
How do I improve concurrency and reduce latency for the inference service?
EAS provides BladeLLM and vLLM, which are inference acceleration engines that you can use to ensure high concurrency and low latency. To use the inference acceleration engines, perform the following steps:
On the Elastic Algorithm Service (EAS) page, find the service that you want to update and click Update in the Actions column.
In the Basic Information section of the LLM Deployment page, set the Deployment Method parameter to Accelerated deployment and select BladeLLM or vLLM.
Click Update.
You can also set the Version parameter to High-performance Deployment during LLM deployment. High-performance deployment refers to fast deployment based on the BladeLLM engine developed by PAI. For more information, see Get started with BladeLLM.
How do I mount a custom model?
When you set the Version parameter to High-performance Deployment during LLM deployment, you can mount a custom model. Only Qwen and Llama text models can be deployed, including the open source, fine-tuned, and quantized versions. In this example, Object Storage Service (OSS) is used to mount a custom model.
Upload the model and related configuration files to your OSS bucket. For information about how to create a bucket and upload objects, see Create buckets and Upload objects.
The following figure shows a sample of the model files that you need to prepare:
The config.json file must be uploaded. You must configure the config.json file based on the Huggingface model format. For more information about the sample file, see config.json.
On the Elastic Algorithm Service (EAS) page, find the service that you want to update and click Update in the Actions column.
On the LLM Deployment page, specify the following parameters and click Update.
Parameter
Description
Basic Information
Version
Set this parameter to High-performance Deployment.
Image Version
Set this parameter to blade-llm:0.9.0.
Model Settings
Set this parameter to Custom Model and click OSS. Select the OSS path in which the custom model is stored.
Resource Deployment
Deployment Resources
Select an instance type. For more information, see Limits.
How do I call API operations to perform inference?
Invocation methods vary based on the deployment mode. You can select an appropriate invocation method based on your deployment option.
Standard deployment
Retrieve the service endpoint and token.
Navigate to EAS, select a workspace, and access EAS.
Click the name of the desired service to view its details page.
In the Basic Information section, click View Call Information. On the Public Endpoint Call tab, retrieve the service token and endpoint.
To call API operations to perform inference, use one of the following methods:
Use HTTP
Non-streaming mode
The client sends the following types of standard HTTP requests when curl commands are run.
STRING requests
curl $host -H 'Authorization: $authorization' --data-binary @chatllm_data.txt -v
Replace $authorization with the service token. Replace $host with the service endpoint. The chatllm_data.txt file is a plain text file that contains the prompt, such as
what is the capital of Canada?
Structured requests
curl $host -H 'Authorization: $authorization' -H "Content-type: application/json" --data-binary @chatllm_data.json -v -H "Connection: close"
Use the chatllm_data.json file to configure inference parameters. The following sample code provides a format example of the chatllm_data.json file:
{ "max_new_tokens": 4096, "use_stream_chat": false, "prompt": "What is the capital of Canada?", "system_prompt": "Act like you are a knowledgeable assistant who can provide information on geography and related topics.", "history": [ [ "Can you tell me what's the capital of France?", "The capital of France is Paris." ] ], "temperature": 0.8, "top_k": 10, "top_p": 0.8, "do_sample": true, "use_cache": true }
The following table describes the parameters in the preceding code. Configure the parameters based on your business requirements.
Parameter
Description
Default value
max_new_tokens
The maximum number of output tokens.
2048
use_stream_chat
Specifies whether to return the output tokens in streaming mode.
true
prompt
The user prompt.
""
system_prompt
The system prompt.
""
history
The dialogue history. The value is in the List[Tuple(str, str)] format.
[()]
temperature
The randomness of the model output. A larger value specifies a higher randomness. The value 0 specifies a fixed output. The value is of the Float type and ranges from 0 to 1.
0.95
top_k
The number of outputs selected from the generated results.
30
top_p
The probability threshold of outputs selected from the generated results. The value is of the Float type and ranges from 0 to 1.
0.8
do_sample
Specifies whether to enable output sampling.
true
use_cache
Specifies whether to enable KV cache.
true
You can also implement your client based on the Python requests package. You can use the
--prompt
parameter to specify the request content, such aspython xxx.py --prompt "What is the capital of Canada?"
.import argparse import json from typing import Iterable, List import requests def post_http_request(prompt: str, system_prompt: str, history: list, host: str, authorization: str, max_new_tokens: int = 2048, temperature: float = 0.95, top_k: int = 1, top_p: float = 0.8, langchain: bool = False, use_stream_chat: bool = False) -> requests.Response: headers = { "User-Agent": "Test Client", "Authorization": f"{authorization}" } if not history: history = [ ( "San Francisco is a", "city located in the state of California in the United States. \ It is known for its iconic landmarks, such as the Golden Gate Bridge \ and Alcatraz Island, as well as its vibrant culture, diverse population, \ and tech industry. The city is also home to many famous companies and \ startups, including Google, Apple, and Twitter." ) ] pload = { "prompt": prompt, "system_prompt": system_prompt, "top_k": top_k, "top_p": top_p, "temperature": temperature, "max_new_tokens": max_new_tokens, "use_stream_chat": use_stream_chat, "history": history } if langchain: pload["langchain"] = langchain response = requests.post(host, headers=headers, json=pload, stream=use_stream_chat) return response def get_response(response: requests.Response) -> List[str]: data = json.loads(response.content) output = data["response"] history = data["history"] return output, history if __name__ == "__main__": parser = argparse.ArgumentParser() parser.add_argument("--top-k", type=int, default=4) parser.add_argument("--top-p", type=float, default=0.8) parser.add_argument("--max-new-tokens", type=int, default=2048) parser.add_argument("--temperature", type=float, default=0.95) parser.add_argument("--prompt", type=str, default="How can I get there?") parser.add_argument("--langchain", action="store_true") args = parser.parse_args() prompt = args.prompt top_k = args.top_k top_p = args.top_p use_stream_chat = False temperature = args.temperature langchain = args.langchain max_new_tokens = args.max_new_tokens host = "<Public endpoint of the EAS service>" authorization = "<Public token of the EAS service>" print(f"Prompt: {prompt!r}\n", flush=True) # System prompts can be included in the requests. system_prompt = "Act like you are programmer with \ 5+ years of experience." # Dialogue history can be included in the client request. The client manages the dialogue history to implement multi-round dialogues. In most cases, information from the previous round of dialogue is used. The information is in the List[Tuple(str, str)] format. history = [] response = post_http_request( prompt, system_prompt, history, host, authorization, max_new_tokens, temperature, top_k, top_p, langchain=langchain, use_stream_chat=use_stream_chat) output, history = get_response(response) print(f" --- output: {output} \n --- history: {history}", flush=True) # The server returns a JSON response that includes the inference result and dialogue history. def get_response(response: requests.Response) -> List[str]: data = json.loads(response.content) output = data["response"] history = data["history"] return output, history
Take note of the following parameters:
Set the host parameter to the service endpoint.
Set the authorization parameter to the service token.
Streaming mode
In streaming mode, the HTTP SSE method is used. You can use the
--prompt
parameter to specify the request content, such aspython xxx.py --prompt "What is the capital of Canada?"
.import argparse import json from typing import Iterable, List import requests def clear_line(n: int = 1) -> None: LINE_UP = '\033[1A' LINE_CLEAR = '\x1b[2K' for _ in range(n): print(LINE_UP, end=LINE_CLEAR, flush=True) def post_http_request(prompt: str, system_prompt: str, history: list, host: str, authorization: str, max_new_tokens: int = 2048, temperature: float = 0.95, top_k: int = 1, top_p: float = 0.8, langchain: bool = False, use_stream_chat: bool = False) -> requests.Response: headers = { "User-Agent": "Test Client", "Authorization": f"{authorization}" } if not history: history = [ ( "San Francisco is a", "city located in the state of California in the United States. \ It is known for its iconic landmarks, such as the Golden Gate Bridge \ and Alcatraz Island, as well as its vibrant culture, diverse population, \ and tech industry. The city is also home to many famous companies and \ startups, including Google, Apple, and Twitter." ) ] pload = { "prompt": prompt, "system_prompt": system_prompt, "top_k": top_k, "top_p": top_p, "temperature": temperature, "max_new_tokens": max_new_tokens, "use_stream_chat": use_stream_chat, "history": history } if langchain: pload["langchain"] = langchain response = requests.post(host, headers=headers, json=pload, stream=use_stream_chat) return response def get_streaming_response(response: requests.Response) -> Iterable[List[str]]: for chunk in response.iter_lines(chunk_size=8192, decode_unicode=False, delimiter=b"\0"): if chunk: data = json.loads(chunk.decode("utf-8")) output = data["response"] history = data["history"] yield output, history if __name__ == "__main__": parser = argparse.ArgumentParser() parser.add_argument("--top-k", type=int, default=4) parser.add_argument("--top-p", type=float, default=0.8) parser.add_argument("--max-new-tokens", type=int, default=2048) parser.add_argument("--temperature", type=float, default=0.95) parser.add_argument("--prompt", type=str, default="How can I get there?") parser.add_argument("--langchain", action="store_true") args = parser.parse_args() prompt = args.prompt top_k = args.top_k top_p = args.top_p use_stream_chat = True temperature = args.temperature langchain = args.langchain max_new_tokens = args.max_new_tokens host = "" authorization = "" print(f"Prompt: {prompt!r}\n", flush=True) system_prompt = "Act like you are programmer with \ 5+ years of experience." history = [] response = post_http_request( prompt, system_prompt, history, host, authorization, max_new_tokens, temperature, top_k, top_p, langchain=langchain, use_stream_chat=use_stream_chat) for h, history in get_streaming_response(response): print( f" --- stream line: {h} \n --- history: {history}", flush=True)
Take note of the following parameters:
Set the host parameter to the service endpoint.
Set the authorization parameter to the service token.
Use WebSocket
The WebSocket protocol can efficiently handle the dialogue history. You can use the WebSocket method to connect to the service and perform one or more rounds of dialogue. Sample code:
import os import time import json import struct from multiprocessing import Process import websocket round = 5 questions = 0 def on_message_1(ws, message): if message == "<EOS>": print('pid-{} timestamp-({}) receives end message: {}'.format(os.getpid(), time.time(), message), flush=True) ws.send(struct.pack('!H', 1000), websocket.ABNF.OPCODE_CLOSE) else: print("{}".format(time.time())) print('pid-{} timestamp-({}) --- message received: {}'.format(os.getpid(), time.time(), message), flush=True) def on_message_2(ws, message): global questions print('pid-{} --- message received: {}'.format(os.getpid(), message)) # end the client-side streaming if message == "<EOS>": questions = questions + 1 if questions == 5: ws.send(struct.pack('!H', 1000), websocket.ABNF.OPCODE_CLOSE) def on_message_3(ws, message): print('pid-{} --- message received: {}'.format(os.getpid(), message)) # end the client-side streaming ws.send(struct.pack('!H', 1000), websocket.ABNF.OPCODE_CLOSE) def on_error(ws, error): print('error happened: ', str(error)) def on_close(ws, a, b): print("### closed ###", a, b) def on_pong(ws, pong): print('pong:', pong) # stream chat validation test def on_open_1(ws): print('Opening Websocket connection to the server ... ') params_dict = {} params_dict['prompt'] = """Show me a golang code example: """ params_dict['temperature'] = 0.9 params_dict['top_p'] = 0.1 params_dict['top_k'] = 30 params_dict['max_new_tokens'] = 2048 params_dict['do_sample'] = True raw_req = json.dumps(params_dict, ensure_ascii=False).encode('utf8') # raw_req = f"""To open a Websocket connection to the server: """ ws.send(raw_req) # end the client-side streaming # multi-round query validation test def on_open_2(ws): global round print('Opening Websocket connection to the server ... ') params_dict = {"max_new_tokens": 6144} params_dict['temperature'] = 0.9 params_dict['top_p'] = 0.1 params_dict['top_k'] = 30 params_dict['use_stream_chat'] = True params_dict['prompt'] = "Hello!" params_dict = { "system_prompt": "Act like you are programmer with 5+ years of experience." } raw_req = json.dumps(params_dict, ensure_ascii=False).encode('utf8') ws.send(raw_req) params_dict['prompt'] = "Please write a sorting algorithm in Python." raw_req = json.dumps(params_dict, ensure_ascii=False).encode('utf8') ws.send(raw_req) params_dict['prompt'] = "Please convert the programming language to Java." raw_req = json.dumps(params_dict, ensure_ascii=False).encode('utf8') ws.send(raw_req) params_dict['prompt'] = "Please introduce yourself." raw_req = json.dumps(params_dict, ensure_ascii=False).encode('utf8') ws.send(raw_req) params_dict['prompt'] = "Please summarize the dialogue above." raw_req = json.dumps(params_dict, ensure_ascii=False).encode('utf8') ws.send(raw_req) # Langchain validation test. def on_open_3(ws): global round print('Opening Websocket connection to the server ... ') params_dict = {} # params_dict['prompt'] = """To open a Websocket connection to the server: """ params_dict['prompt'] = """Can you tell me what's the MNN?""" params_dict['temperature'] = 0.9 params_dict['top_p'] = 0.1 params_dict['top_k'] = 30 params_dict['max_new_tokens'] = 2048 params_dict['use_stream_chat'] = False params_dict['langchain'] = True raw_req = json.dumps(params_dict, ensure_ascii=False).encode('utf8') ws.send(raw_req) authorization = "" host = "ws://" + "" def single_call(on_open_func, on_message_func, on_clonse_func=on_close): ws = websocket.WebSocketApp( host, on_open=on_open_func, on_message=on_message_func, on_error=on_error, on_pong=on_pong, on_close=on_clonse_func, header=[ 'Authorization: ' + authorization], ) # setup ping interval to keep long connection. ws.run_forever(ping_interval=2) if __name__ == "__main__": for i in range(5): p1 = Process(target=single_call, args=(on_open_1, on_message_1)) p2 = Process(target=single_call, args=(on_open_2, on_message_2)) p3 = Process(target=single_call, args=(on_open_3, on_message_3)) p1.start() p2.start() p3.start() p1.join() p2.join() p3.join()
Take note of the following parameters:
Set the authorization parameter to the service token.
Set the host parameter to the service endpoint. Replace the http prefix in the endpoint with ws.
Use the use_stream_chat parameter to specify whether the client generates output in streaming mode. Default value: True.
Refer to the on_open_2 function in the preceding code to implement a multi-round dialogue.
BladeLLM-based accelerated deployment
View the endpoint and token of the service.
On the Elastic Algorithm Service (EAS) page, find the desired service and click Invocation Method in the Service Type column.
In the Invocation Method dialog box, you can view the endpoint and token of the service.
Run the following code in the terminal to call the service and obtain the generated text in streaming mode:
# Call EAS service curl -X POST \ -H "Content-Type: application/json" \ -H "Authorization: AUTH_TOKEN_FOR_EAS" \ -d '{"prompt":"What is the capital of Canada?", "stream":"true"}' \ <service_url>/v1/completions
Take note of the following parameters:
Authorization: the token of your service.
<service_url>: the endpoint of your service.
Sample response:
data: {"id":"91f9a28a-f949-40fb-b720-08ceeeb2****","choices":[{"finish_reason":"","index":0,"logprobs":null,"text":" The"}],"object":"text_completion","usage":{"prompt_tokens":7,"completion_tokens":1,"total_tokens":8},"error_info":null} data: {"id":"91f9a28a-f949-40fb-b720-08ceeeb2****","choices":[{"finish_reason":"","index":0,"logprobs":null,"text":" capital"}],"object":"text_completion","usage":{"prompt_tokens":7,"completion_tokens":2,"total_tokens":9},"error_info":null} data: {"id":"91f9a28a-f949-40fb-b720-08ceeeb2****","choices":[{"finish_reason":"","index":0,"logprobs":null,"text":" of"}],"object":"text_completion","usage":{"prompt_tokens":7,"completion_tokens":3,"total_tokens":10},"error_info":null} data: {"id":"91f9a28a-f949-40fb-b720-08ceeeb2****","choices":[{"finish_reason":"","index":0,"logprobs":null,"text":" Canada"}],"object":"text_completion","usage":{"prompt_tokens":7,"completion_tokens":4,"total_tokens":11},"error_info":null} data: {"id":"91f9a28a-f949-40fb-b720-08ceeeb2****","choices":[{"finish_reason":"","index":0,"logprobs":null,"text":" is"}],"object":"text_completion","usage":{"prompt_tokens":7,"completion_tokens":5,"total_tokens":12},"error_info":null} data: {"id":"91f9a28a-f949-40fb-b720-08ceeeb2****","choices":[{"finish_reason":"","index":0,"logprobs":null,"text":" Ottawa"}],"object":"text_completion","usage":{"prompt_tokens":7,"completion_tokens":6,"total_tokens":13},"error_info":null} data: {"id":"91f9a28a-f949-40fb-b720-08ceeeb2****","choices":[{"finish_reason":"","index":0,"logprobs":null,"text":"."}],"object":"text_completion","usage":{"prompt_tokens":7,"completion_tokens":7,"total_tokens":14},"error_info":null} data: {"id":"91f9a28a-f949-40fb-b720-08ceeeb2****","choices":[{"finish_reason":"stop","index":0,"logprobs":null,"text":""}],"object":"text_completion","usage":{"prompt_tokens":7,"completion_tokens":8,"total_tokens":15},"error_info":null} data: [DONE]
vLLM-based accelerated deployment
To view the service access address and token:
On the Model Online Service (EAS) page, click the Service Method column of the desired service, then select Call Information.
In the Call Information dialog box, note the service access address and token.
In the terminal, run the following code to call the service:
Python
from openai import OpenAI ##### API configuration ##### openai_api_key = "<EAS API KEY>" openai_api_base = "<EAS API Endpoint>/v1" client = OpenAI( api_key=openai_api_key, base_url=openai_api_base, ) models = client.models.list() model = models.data[0].id print(model) def main(): stream = True chat_completion = client.chat.completions.create( messages=[ { "role": "user", "content": [ { "type": "text", "text": "What is the capital of Canada?", } ], } ], model=model, max_completion_tokens=2048, stream=stream, ) if stream: for chunk in chat_completion: print(chunk.choices[0].delta.content, end="") else: result = chat_completion.choices[0].message.content print(result) if __name__ == "__main__": main()
Take note of the following parameters:
<EAS API KEY>: Set this parameter to the service token that you obtained.
<EAS API Endpoint>: Set this parameter to the service endpoint that you obtained.
CLI
curl -X POST <service_url>/v1/chat/completions -d '{ "model": "Qwen2.5-7B-Instruct", "messages": [ { "role": "system", "content": [ { "type": "text", "text": "You are a helpful and harmless assistant." } ] }, { "role": "user", "content": "What is the capital of Canada?" } ] }' -H "Content-Type: application/json" -H "Authorization: <your-token>"
Take note of the following parameters:
<service_url>: Set this parameter to the service endpoint that you obtained.
<your-token>: Set this parameter to the service token that you obtained.
References
For more information about EAS, see EAS overview.
After you use the LangChain framework on the WebUI page, the knowledge base is also available when you use API. We recommend that you store your knowledge base in an on-premises vector database. For more information, see RAG-based LLM chatbot.
For more information about the versions of ChatLLM-WebUI, see Release notes for ChatLLM WebUI.