使用EAS部署LLM大語言模型 - Platform For AI

EAS（Elastic Algorithm Service）是PAI為線上推理情境提供的模型線上服務，為自動化部署和應用LLM大語言模型提供了一鍵式解決方案。通過EAS，您可以輕鬆部署多種開源大模型服務應用，同時支援標準部署、加速部署：BladeLLM和加速部署：vLLM三種部署方式，使用加速部署，您可以體驗到高並發和低延遲的技術優勢。本文為您介紹如何通過EAS一鍵部署和調用LLM大語言模型，以及常見的問題和解決方案。

前提條件

已開通PAI並建立預設工作空間，詳情請參見開通並建立預設工作空間。
如果使用RAM使用者來部署模型，需要為RAM使用者授予EAS的系統管理權限，詳情請參見雲產品依賴與授權：EAS。

部署EAS服務

進入模型線上服務頁面。
1. 登入PAI控制台。
2. 在左側導覽列單擊工作空間列表，在工作空間列表頁面中單擊待操作的工作空間名稱，進入對應工作空間內。
3. 在工作空間頁面的左側導覽列選擇模型部署>模型線上服務（EAS），進入模型線上服務（EAS）頁面。
在模型線上服務（EAS）頁面，單擊部署服務，然後在情境化模型部署地區，單擊LLM大語言模型部署。

在部署LLM大語言模型頁面，配置以下關鍵參數，其他參數使用預設配置。

參數		描述
基本資料	服務名稱	自訂服務名稱。本方案使用的樣本值為：llm_demo001。
	版本選擇	選擇開源模型一鍵部署。
	模型類別	本方案選擇qwen2.5-7b-instruct。EAS還提供了多種模型類別可供選擇，以滿足您的不同需求，例如DeepSeek-R1、Qwen2-VL、Meta-Llama-3.2-1B等。
	部署方式	選擇標準部署，不使用任何加速架構。
資源部署	資源類型	選擇公用資源。
資源部署	部署資源	選擇模型類別後，系統會自動推薦適合的資源規格。

單擊部署，大約等待5分鐘後即可完成模型部署。

啟動WebUI進行模型推理

單擊目標服務服務方式列下的查看Web應用。
在WebUI頁面，進行模型推理驗證。
在ChatLLM-WebUI頁面的文字框中輸入對話內容，例如請提供一個理財學習計劃，單擊Send，即可開始對話。

常見問題及解決方案

如何切換其他的開源大模型？

您可以在EAS上快速從第三方拉起DeepSeek-R1、Llama、UI-TARS、QVQ、gemma2、以及baichuan2等開源大模型檔案，參考以下操作步驟切換並部署這些模型：

單擊目標服務操作列下的更新。
更新模型類別為其他開源大模型，系統將同步更新資源規格。
單擊更新。

如何提升推理並發且降低延遲？

EAS支援BladeLLM和vLLM的推理加速引擎，可以協助您一鍵享受高並發和低延時的技術紅利。具體操作步驟如下：

單擊目標服務操作列下的更新。
在基本資料地區，更新部署方式為加速部署，並選擇加速框為BladeLLM或vLLM。
單擊更新。

您也可以在部署LLM大語言模型時，將部署版本選擇高效能部署，基於PAI自研的BladeLLM引擎進行快速部署。具體操作，請參見BladeLLM快速入門。

如何掛載自訂模型？

當部署版本選擇高效能部署時，支援掛載自訂模型。僅支援部署Qwen和Llama全系列文本模型，包括開源版本、微調版本以及量化後的版本。以OSS掛載為例，具體操作步驟如下：

將自訂模型及相關設定檔上傳到您自己的OSS Bucket目錄中，關於如何建立儲存空間和上傳檔案，詳情請參見控制台建立儲存空間和控制台上傳檔案。
需要準備的模型檔案範例如下：
其中設定檔中必須包含config.json檔案，您需要按照Huggingface或的模型格式配置Config檔案。樣本檔案詳情，請參見config.json。
單擊目標服務操作列下的更新。

在部署LLM大語言模型頁面，配置以下參數，參數配置完成後，單擊更新。

參數		描述
基本資料	版本選擇	選擇高效能部署。
	鏡像版本	選擇blade-llm:0.9.0。
	模型配置	選擇自訂模型，單擊OSS，並選擇自訂模型所在的OSS儲存路徑。
資源部署	部署資源	參考使用限制，選擇資源規格。

如何使用API進行模型推理？

根據您所採用的部署方式，調用方法會有所不同。請依據您的具體部署版本，選取合適的調用方法。

標準部署

擷取服務訪問地址和Token。
1. 訪問模型線上服務（EAS），選擇工作空間後，進入EAS。
2. 單擊目標服務名稱，進入服務詳情頁面。
3. 在基本資料地區單擊查看調用資訊，在公網地址調用頁簽擷取服務Token和訪問地址。

啟動API進行模型推理。

使用HTTP方式調用服務

非流式調用

用戶端使用標準的HTTP格式，使用命令列調用時，支援發送以下兩種類型的請求：

發送String類型的請求
```
curl $host -H 'Authorization: $authorization' --data-binary @chatllm_data.txt -v
```
其中：$authorization需替換為服務Token；$host：需替換為服務訪問地址；chatllm_data.txt：該檔案為包含問題的純文字檔案，例如加拿大的首都是哪裡？。

發送結構化類型的請求

curl $host -H 'Authorization: $authorization' -H "Content-type: application/json" --data-binary @chatllm_data.json -v -H "Connection: close"

使用chatllm_data.json檔案來設定推理參數，chatllm_data.json檔案的內容格式如下：

{
  "max_new_tokens": 4096,
  "use_stream_chat": false,
  "prompt": "What is the capital of Canada?",
  "system_prompt": "Act like you are a knowledgeable assistant who can provide information on geography and related topics.",
  "history": [
    [
      "Can you tell me what's the capital of France?",
      "The capital of France is Paris."
    ]
  ],
  "temperature": 0.8,
  "top_k": 10,
  "top_p": 0.8,
  "do_sample": true,
  "use_cache": true
}

參數說明如下，請酌情添加或刪除。

參數	描述	預設值
max_new_tokens	產生輸出token的最大長度，單位為個。	2048
use_stream_chat	是否使用流式輸出形式。	true
prompt	使用者的Prompt。	""
system_prompt	系統Prompt。	""
history	對話的記錄，類型為List[Tuple(str, str)]。	[()]
temperature	用於調節模型輸出結果的隨機性，值越大隨機性越強，0值為固定輸出。Float類型，區間為0~1。	0.95
top_k	從產生結果中選擇候選輸出的數量。	30
top_p	從產生結果中按百分比選擇輸出結果。Float類型，區間為0~1。	0.8
do_sample	開啟輸出採樣。	true
use_cache	開啟KV Cache。	true

您可以使用Python的requests庫來構建自己的用戶端，範例程式碼如下。您可以通過命令列參數--prompt來指定請求的內容，例如：python xxx.py --prompt "What is the capital of Canada?"。

import argparse
import json
from typing import Iterable, List

import requests

def post_http_request(prompt: str,
                      system_prompt: str,
                      history: list,
                      host: str,
                      authorization: str,
                      max_new_tokens: int = 2048,
                      temperature: float = 0.95,
                      top_k: int = 1,
                      top_p: float = 0.8,
                      langchain: bool = False,
                      use_stream_chat: bool = False) -> requests.Response:
    headers = {
        "User-Agent": "Test Client",
        "Authorization": f"{authorization}"
    }
    if not history:
        history = [
            (
                "San Francisco is a",
                "city located in the state of California in the United States. \
                It is known for its iconic landmarks, such as the Golden Gate Bridge \
                and Alcatraz Island, as well as its vibrant culture, diverse population, \
                and tech industry. The city is also home to many famous companies and \
                startups, including Google, Apple, and Twitter."
            )
        ]
    pload = {
        "prompt": prompt,
        "system_prompt": system_prompt,
        "top_k": top_k,
        "top_p": top_p,
        "temperature": temperature,
        "max_new_tokens": max_new_tokens,
        "use_stream_chat": use_stream_chat,
        "history": history
    }
    if langchain:
        pload["langchain"] = langchain
    response = requests.post(host, headers=headers,
                             json=pload, stream=use_stream_chat)
    return response

def get_response(response: requests.Response) -> List[str]:
    data = json.loads(response.content)
    output = data["response"]
    history = data["history"]
    return output, history

if __name__ == "__main__":
    parser = argparse.ArgumentParser()
    parser.add_argument("--top-k", type=int, default=4)
    parser.add_argument("--top-p", type=float, default=0.8)
    parser.add_argument("--max-new-tokens", type=int, default=2048)
    parser.add_argument("--temperature", type=float, default=0.95)
    parser.add_argument("--prompt", type=str, default="How can I get there?")
    parser.add_argument("--langchain", action="store_true")

    args = parser.parse_args()

    prompt = args.prompt
    top_k = args.top_k
    top_p = args.top_p
    use_stream_chat = False
    temperature = args.temperature
    langchain = args.langchain
    max_new_tokens = args.max_new_tokens

    host = "EAS服務公網地址"
    authorization = "EAS服務公網Token"

    print(f"Prompt: {prompt!r}\n", flush=True)
    # 在用戶端請求中可設定語言模型的system prompt。
    system_prompt = "Act like you are programmer with \
                5+ years of experience."

    # 用戶端請求中可設定對話的歷史資訊，用戶端維護目前使用者的對話記錄，用於實現多輪對話。通常情況下可以使用上一輪對話返回的histroy資訊，history格式為List[Tuple(str, str)]。
    history = []
    response = post_http_request(
        prompt, system_prompt, history,
        host, authorization,
        max_new_tokens, temperature, top_k, top_p,
        langchain=langchain, use_stream_chat=use_stream_chat)
    output, history = get_response(response)
    print(f" --- output: {output} \n --- history: {history}", flush=True)

# 服務端返回JSON格式的響應結果，包含推理結果與對話歷史。
def get_response(response: requests.Response) -> List[str]:
    data = json.loads(response.content)
    output = data["response"]
    history = data["history"]
    return output, history

其中：

host：配置為服務訪問地址。
authorization：配置為服務Token。

流式調用

流式調用使用HTTP SSE方式，其他設定與非流式相同，代碼參考如下。您可以通過命令列參數--prompt來指定請求的內容，例如python xxx.py --prompt "What is the capital of Canada?"。

import argparse
import json
from typing import Iterable, List

import requests


def clear_line(n: int = 1) -> None:
    LINE_UP = '\033[1A'
    LINE_CLEAR = '\x1b[2K'
    for _ in range(n):
        print(LINE_UP, end=LINE_CLEAR, flush=True)


def post_http_request(prompt: str,
                      system_prompt: str,
                      history: list,
                      host: str,
                      authorization: str,
                      max_new_tokens: int = 2048,
                      temperature: float = 0.95,
                      top_k: int = 1,
                      top_p: float = 0.8,
                      langchain: bool = False,
                      use_stream_chat: bool = False) -> requests.Response:
    headers = {
        "User-Agent": "Test Client",
        "Authorization": f"{authorization}"
    }
    if not history:
        history = [
            (
                "San Francisco is a",
                "city located in the state of California in the United States. \
                It is known for its iconic landmarks, such as the Golden Gate Bridge \
                and Alcatraz Island, as well as its vibrant culture, diverse population, \
                and tech industry. The city is also home to many famous companies and \
                startups, including Google, Apple, and Twitter."
            )
        ]
    pload = {
        "prompt": prompt,
        "system_prompt": system_prompt,
        "top_k": top_k,
        "top_p": top_p,
        "temperature": temperature,
        "max_new_tokens": max_new_tokens,
        "use_stream_chat": use_stream_chat,
        "history": history
    }
    if langchain:
        pload["langchain"] = langchain
    response = requests.post(host, headers=headers,
                             json=pload, stream=use_stream_chat)
    return response


def get_streaming_response(response: requests.Response) -> Iterable[List[str]]:
    for chunk in response.iter_lines(chunk_size=8192,
                                     decode_unicode=False,
                                     delimiter=b"\0"):
        if chunk:
            data = json.loads(chunk.decode("utf-8"))
            output = data["response"]
            history = data["history"]
            yield output, history


if __name__ == "__main__":
    parser = argparse.ArgumentParser()
    parser.add_argument("--top-k", type=int, default=4)
    parser.add_argument("--top-p", type=float, default=0.8)
    parser.add_argument("--max-new-tokens", type=int, default=2048)
    parser.add_argument("--temperature", type=float, default=0.95)
    parser.add_argument("--prompt", type=str, default="How can I get there?")
    parser.add_argument("--langchain", action="store_true")
    args = parser.parse_args()

    prompt = args.prompt
    top_k = args.top_k
    top_p = args.top_p
    use_stream_chat = True
    temperature = args.temperature
    langchain = args.langchain
    max_new_tokens = args.max_new_tokens

    host = ""
    authorization = ""

    print(f"Prompt: {prompt!r}\n", flush=True)
    system_prompt = "Act like you are programmer with \
                5+ years of experience."
    history = []
    response = post_http_request(
        prompt, system_prompt, history,
        host, authorization,
        max_new_tokens, temperature, top_k, top_p,
        langchain=langchain, use_stream_chat=use_stream_chat)

    for h, history in get_streaming_response(response):
        print(
            f" --- stream line: {h} \n --- history: {history}", flush=True)

其中：

host：配置為服務訪問地址。
authorization：配置為服務Token。

使用WebSocket方式調用服務

為了更好地維護使用者對話資訊，您也可以使用WebSocket方式保持與服務的串連完成單輪或多輪對話，程式碼範例如下：

import os
import time
import json
import struct
from multiprocessing import Process

import websocket

round = 5
questions = 0


def on_message_1(ws, message):
    if message == "<EOS>":
        print('pid-{} timestamp-({}) receives end message: {}'.format(os.getpid(),
              time.time(), message), flush=True)
        ws.send(struct.pack('!H', 1000), websocket.ABNF.OPCODE_CLOSE)
    else:
        print("{}".format(time.time()))
        print('pid-{} timestamp-({}) --- message received: {}'.format(os.getpid(),
              time.time(), message), flush=True)


def on_message_2(ws, message):
    global questions
    print('pid-{} --- message received: {}'.format(os.getpid(), message))
    # end the client-side streaming
    if message == "<EOS>":
        questions = questions + 1
        if questions == 5:
            ws.send(struct.pack('!H', 1000), websocket.ABNF.OPCODE_CLOSE)


def on_message_3(ws, message):
    print('pid-{} --- message received: {}'.format(os.getpid(), message))
    # end the client-side streaming
    ws.send(struct.pack('!H', 1000), websocket.ABNF.OPCODE_CLOSE)


def on_error(ws, error):
    print('error happened: ', str(error))


def on_close(ws, a, b):
    print("### closed ###", a, b)


def on_pong(ws, pong):
    print('pong:', pong)

# stream chat validation test
def on_open_1(ws):
    print('Opening Websocket connection to the server ... ')
    params_dict = {}
    params_dict['prompt'] = """Show me a golang code example: """
    params_dict['temperature'] = 0.9
    params_dict['top_p'] = 0.1
    params_dict['top_k'] = 30
    params_dict['max_new_tokens'] = 2048
    params_dict['do_sample'] = True
    raw_req = json.dumps(params_dict, ensure_ascii=False).encode('utf8')
    # raw_req = f"""To open a Websocket connection to the server: """

    ws.send(raw_req)
    # end the client-side streaming


# multi-round query validation test
def on_open_2(ws):
    global round
    print('Opening Websocket connection to the server ... ')
    params_dict = {"max_new_tokens": 6144}
    params_dict['temperature'] = 0.9
    params_dict['top_p'] = 0.1
    params_dict['top_k'] = 30
    params_dict['use_stream_chat'] = True
    params_dict['prompt'] = "您好！"
    params_dict = {
        "system_prompt":
        "Act like you are programmer with 5+ years of experience."
    }
    raw_req = json.dumps(params_dict, ensure_ascii=False).encode('utf8')
    ws.send(raw_req)
    params_dict['prompt'] = "請使用Python，編寫一個排序演算法"
    raw_req = json.dumps(params_dict, ensure_ascii=False).encode('utf8')
    ws.send(raw_req)
    params_dict['prompt'] = "請轉寫成java語言的實現"
    raw_req = json.dumps(params_dict, ensure_ascii=False).encode('utf8')
    ws.send(raw_req)
    params_dict['prompt'] = "請介紹一下你自己？"
    raw_req = json.dumps(params_dict, ensure_ascii=False).encode('utf8')
    ws.send(raw_req)
    params_dict['prompt'] = "請總結上述對話"
    raw_req = json.dumps(params_dict, ensure_ascii=False).encode('utf8')
    ws.send(raw_req)


# Langchain validation test.
def on_open_3(ws):
    global round
    print('Opening Websocket connection to the server ... ')

    params_dict = {}
    # params_dict['prompt'] = """To open a Websocket connection to the server: """
    params_dict['prompt'] = """Can you tell me what's the MNN?"""
    params_dict['temperature'] = 0.9
    params_dict['top_p'] = 0.1
    params_dict['top_k'] = 30
    params_dict['max_new_tokens'] = 2048
    params_dict['use_stream_chat'] = False
    params_dict['langchain'] = True
    raw_req = json.dumps(params_dict, ensure_ascii=False).encode('utf8')
    ws.send(raw_req)


authorization = ""
host = "ws://" + ""


def single_call(on_open_func, on_message_func, on_clonse_func=on_close):
    ws = websocket.WebSocketApp(
        host,
        on_open=on_open_func,
        on_message=on_message_func,
        on_error=on_error,
        on_pong=on_pong,
        on_close=on_clonse_func,
        header=[
            'Authorization: ' + authorization],
    )

    # setup ping interval to keep long connection.
    ws.run_forever(ping_interval=2)


if __name__ == "__main__":
    for i in range(5):
        p1 = Process(target=single_call, args=(on_open_1, on_message_1))
        p2 = Process(target=single_call, args=(on_open_2, on_message_2))
        p3 = Process(target=single_call, args=(on_open_3, on_message_3))

        p1.start()
        p2.start()
        p3.start()

        p1.join()
        p2.join()
        p3.join()

其中：

authorization：配置為服務Token。
host：配置為服務訪問地址。並將訪問地址中前端的http替換為ws。
use_stream_chat：通過該請求參數來控制用戶端是否為流式輸出。預設值為True，表示服務端返迴流式資料。
參考上述範例程式碼中的on_open_2函數的實現方法實現多輪對話。

加速部署：BladeLLM

查看服務訪問地址和Token。
1. 在模型線上服務（EAS）頁面，單擊目標服務的服務方式列下的調用資訊。
2. 在調用資訊對話方塊，查看服務訪問地址和Token。

在終端中執行以下代碼調用服務，流式地擷取產生文本。

# Call EAS service
curl -X POST \
    -H "Content-Type: application/json" \
    -H "Authorization: AUTH_TOKEN_FOR_EAS" \
    -d '{"prompt":"What is the capital of Canada?", "stream":"true"}' \
    <service_url>/v1/completions

其中：

Authorization：配置為上述步驟擷取的服務Token。
<service_url>：替換為上述步驟擷取的服務訪問地址。

返回結果樣本如下：

data: {"id":"91f9a28a-f949-40fb-b720-08ceeeb2****","choices":[{"finish_reason":"","index":0,"logprobs":null,"text":" The"}],"object":"text_completion","usage":{"prompt_tokens":7,"completion_tokens":1,"total_tokens":8},"error_info":null}

data: {"id":"91f9a28a-f949-40fb-b720-08ceeeb2****","choices":[{"finish_reason":"","index":0,"logprobs":null,"text":" capital"}],"object":"text_completion","usage":{"prompt_tokens":7,"completion_tokens":2,"total_tokens":9},"error_info":null}

data: {"id":"91f9a28a-f949-40fb-b720-08ceeeb2****","choices":[{"finish_reason":"","index":0,"logprobs":null,"text":" of"}],"object":"text_completion","usage":{"prompt_tokens":7,"completion_tokens":3,"total_tokens":10},"error_info":null}

data: {"id":"91f9a28a-f949-40fb-b720-08ceeeb2****","choices":[{"finish_reason":"","index":0,"logprobs":null,"text":" Canada"}],"object":"text_completion","usage":{"prompt_tokens":7,"completion_tokens":4,"total_tokens":11},"error_info":null}

data: {"id":"91f9a28a-f949-40fb-b720-08ceeeb2****","choices":[{"finish_reason":"","index":0,"logprobs":null,"text":" is"}],"object":"text_completion","usage":{"prompt_tokens":7,"completion_tokens":5,"total_tokens":12},"error_info":null}

data: {"id":"91f9a28a-f949-40fb-b720-08ceeeb2****","choices":[{"finish_reason":"","index":0,"logprobs":null,"text":" Ottawa"}],"object":"text_completion","usage":{"prompt_tokens":7,"completion_tokens":6,"total_tokens":13},"error_info":null}

data: {"id":"91f9a28a-f949-40fb-b720-08ceeeb2****","choices":[{"finish_reason":"","index":0,"logprobs":null,"text":"."}],"object":"text_completion","usage":{"prompt_tokens":7,"completion_tokens":7,"total_tokens":14},"error_info":null}

data: {"id":"91f9a28a-f949-40fb-b720-08ceeeb2****","choices":[{"finish_reason":"stop","index":0,"logprobs":null,"text":""}],"object":"text_completion","usage":{"prompt_tokens":7,"completion_tokens":8,"total_tokens":15},"error_info":null}

data: [DONE]

加速部署：vLLM

查看服務訪問地址和Token。
1. 在模型線上服務（EAS）頁面，單擊目標服務的服務方式列下的調用資訊。
2. 在調用資訊對話方塊，查看服務訪問地址和Token。

在終端中執行以下代碼調用服務。

Python

from openai import OpenAI

##### API 配置 #####
openai_api_key = "<EAS API KEY>"
openai_api_base = "<EAS API Endpoint>/v1"

client = OpenAI(
    api_key=openai_api_key,
    base_url=openai_api_base,
)

models = client.models.list()
model = models.data[0].id
print(model)


def main():

    stream = True

    chat_completion = client.chat.completions.create(
        messages=[
            {
                "role": "user",
                "content": [
                    {
                        "type": "text",
                        "text": "加拿大的首都在哪裡？",
                    }
                ],
            }
        ],
        model=model,
        max_completion_tokens=2048,
        stream=stream,
    )

    if stream:
        for chunk in chat_completion:
            print(chunk.choices[0].delta.content, end="")
    else:
        result = chat_completion.choices[0].message.content
        print(result)


if __name__ == "__main__":
    main()

其中：

<EAS API KEY>：替換為已查詢的服務Token。
<EAS API Endpoint>：替換為已查詢的服務訪問地址。

命令列

curl -X POST <service_url>/v1/chat/completions -d '{
    "model": "Qwen2.5-7B-Instruct",
    "messages": [
        {
            "role": "system",
            "content": [
                {
                    "type": "text",
                    "text": "You are a helpful and harmless assistant."
                }
            ]
        },
        {
            "role": "user",
            "content": "加拿大的首都在哪裡?"
        }
    ]
}' -H "Content-Type: application/json" -H "Authorization: <your-token>"

其中：

<service_url>：替換為已查詢的服務訪問地址。
<your-token>：替換為已查詢的服務Token。

Platform For AI：5分鐘使用EAS一鍵部署LLM大語言模型應用