通過EAS部署LLM大語言模型 - Platform For AI

EAS提供了情境化部署方式，您只需配置幾個參數，即可一鍵部署流行的開源LLM大語言模型服務應用，以獲得大模型的推理能力。本文為您介紹如何通過EAS一鍵部署和調用LLM大語言模型服務，以及常見的問題和解決方案。

背景資訊

隨著ChatGPT和通義千問等大模型在業界的廣泛應用，基於LLM大語言模型的推理應用成為當前熱門的應用之一。EAS能夠輕鬆部署套件括Llama3、Qwen、Llama2、ChatGLM、Baichuan、Yi-6B、Mistral-7B及Falcon-7B在內的多種開源大模型服務應用。此外，部署在EAS上的LLM大語言模型服務不僅支援WebUI和API調用方式，還支援通過LangChain整合企業自有業務資料，從而產生基於本地知識庫的定製答案。

LangChain功能介紹：
LangChain是一個開源的架構，可以讓AI開發人員將像GPT-4這樣的大語言模型（LLM）和外部資料結合起來，從而在儘可能少消耗計算資源的情況下，獲得更好的效能和效果。
LangChain工作原理：
將一個大的資料來源，比如一個20頁的PDF檔案，分成各個區塊，並通過嵌入模型（比如BGE、text2vec等）將它們轉換為數值向量，然後把這些向量儲存到一個專門的向量資料庫裡。
LangChain首先將使用者上傳的知識庫進行自然語言處理，並作為大模型的知識庫儲存在本地。每次推理時，會首先在本地知識庫中尋找與輸入問題相近的文字區塊（chunk），並將知識庫答案與使用者輸入的問題一起輸入大模型，產生基於本地知識庫的定製答案。

前提條件

如果您有部署自訂模型的需求，您需要完成以下準備工作：

準備自訂模型檔案及相關設定檔，需要準備的模型檔案範例如下：
其中設定檔中必須包含config.json檔案，您需要按照Huggingface或的模型格式配置Config檔案。樣本檔案詳情，請參見config.json。
建立Object Storage Service儲存空間（Bucket）或NAS檔案系統，用來存放自訂模型檔案，您也可以將模型檔案註冊為PAI的AI資產，方便管理和維護。以OSS為例，具體操作，請參見控制台快速入門。
將自訂模型檔案及相關設定檔上傳到OSS儲存空間（Bucket）中。具體操作，請參見控制台快速入門。

使用限制

目前，推理加速引擎僅支援Qwen2-7b、Qwen1.5-1.8b、Qwen1.5-7b、Qwen1.5-14b、llama3-8b、llama2-7b、llama2-13b、chatglm3-6b、baichuan2-7b、baichuan2-13b、falcon-7b、yi-6b、mistral-7b-instruct-v0.2、gemma-2b-it、gemma-7b-it、deepseek-coder-7b-instruct-v1.5模型。
僅無推理加速的EAS服務支援使用Langchain功能。

部署EAS服務

支援以下兩種部署方式：

方式一：情境化模型部署（推薦）

登入PAI控制台，在頁面上方選擇目標地區，並在右側選擇目標工作空間，然後單擊進入EAS。
在模型線上服務（EAS）頁面，單擊部署服務，然後在情境化模型部署地區，單擊LLM大語言模型部署。

在部署LLM大語言模型頁面，配置以下關鍵參數。

參數		描述
基本資料	服務名稱	自訂模型服務名稱。
	模型來源	支援配置以下兩種模型：開源公用模型自持微調模型
	模型類別	當模型來源選擇開源公用模型時，支援使用的模型類別包括Qwen、Llama、ChatGLM、Baichuan、Falcon、Yi、Mistral、Gemma以及DeepSeek等。當模型來源選擇自持微調模型時，您需要選擇與模型相匹配的大模型類別、參數量和精度。
	模型配置	當模型來源選擇自持微調模型時，您需要選擇模型儲存位置。以Object Storage Service為例，配置類型選擇Object Storage Service，並配置模型檔案所在的OSS儲存路徑。
資源配置	資源配置選擇	當使用開源公用模型時，選擇模型類別後，系統會自動推薦適合的資源規格。當使用自持微調模型時，模型類別配置完成後，系統將自動設定資源規格。您也可以根據模型參數量，自行選擇相匹配的資源規格，詳情請參見如何切換其他的開源大模型。
資源配置	推理加速	當模型類別選擇Qwen2-7b、Qwen1.5-1.8b、Qwen1.5-7b、Qwen1.5-14b、llama3-8b、llama2-7b、llama2-13b、chatglm3-6b、baichuan2-7b、baichuan2-13b、falcon-7b、yi-6b、mistral-7b-instruct-v0.2、gemma-2b-it、gemma-7b-it、deepseek-coder-7b-instruct-v1.5時，支援使用推理加速功能。取值如下：無加速 PAI-BladeLLM自定推理加速開源架構vllm推理加速說明使用推理加速功能時，部署好的EAS服務將不能使用LangChain功能。

單擊部署。

方式二：自訂模型部署

登入PAI控制台，在頁面上方選擇目標地區，並在右側選擇目標工作空間，然後單擊進入EAS。
單擊部署服務，然後在自訂模型部署地區，單擊自訂部署。

在自訂部署頁面，配置以下關鍵參數，其他參數配置說明，請參見服務部署：控制台。

參數		描述
基本資料	服務名稱	自訂模型服務名稱。
環境資訊	部署方式	選擇鏡像部署，並選中開啟Web應用複選框。
	鏡像配置	在官方鏡像列表中選擇chat-llm-webui>chat-llm-webui:3.0。說明由於版本迭代迅速，部署時鏡像版本選擇最高版本即可。如果您想使用推理加速功能，鏡像版本配置如下：說明使用推理加速功能時，部署好的EAS服務將不能使用LangChain功能。 chat-llm-webui:3.0-vllm：使用vLLM推理加速引擎。 chat-llm-webui:3.0-blade：使用BladeLLM推理加速引擎。
	模型配置	如果您有掛載自訂模型的需求，需要進行模型配置。以OSS掛載為例，配置以下參數： OSS：選擇自訂模型檔案所在的Object Storage Service路徑。例如：`oss://bucket-test/data-oss/`。掛載路徑：配置為`/data`。是否唯讀：開關關閉。
	運行命令	配置鏡像版本後，系統會自動設定運行命令`python webui/webui_server.py --port=8000 --model-path=Qwen/Qwen1.5-7B-Chat`和連接埠號碼，該命令預設拉起通義千問-7B參數量的大模型。請參閱更多參數配置說明，以瞭解在運行命令中可支援的配置選項。如果您需要一鍵拉起其他更多開源大模型，可以替換為指定開源大模型的運行命令，詳情請參見如何切換其他的開源大模型。如果部署自訂模型，您需要在運行命令中增加以下參數： --model-path：配置為`/data`。需要與模型配置中的掛載路徑保持一致。 --model-type：模型類型。不同類型的模型的運行命令配置樣本，請參見運行命令。
資源部署	資源類型	選擇公用資源。
資源部署	部署資源	資源規格必須選擇GPU類型，預設拉起通義千問-7B參數量的大模型時，資源規格推薦使用ml.gu7i.c16m60.1-gu30（性價比最高）。在部署其他開源大模型時，您需要選擇與模型參數量相匹配的資源規格，如何選擇資源規格，請參見如何切換其他的開源大模型。

更多參數配置說明

參數	描述	預設值
--model-path	設定預置模型名或自訂模型路徑。樣本1：載入預置模型，您可以使用EAS預置的meta-llama/Llama-2-*系列模型（包括：7b-hf，7b-chat-hf，13b-hf，13b-chat-hf等）。例如 `python webui/webui_server.py --port=8000 --model-path=meta-llama/Llama-2-7b-chat-hf`。樣本2：載入本地自訂模型，例如 `python webui/webui_server.py --port=8000 --model-path=/llama2-7b-chat`。	服務的預設模型為meta-llama/Llama-2-7b-chat-hf。
--cpu	如需使用CPU完成模型推理可使用此命令列參數。例如：`python webui/webui_server.py --port=8000 --cpu`。	預設使用GPU做模型推理。
--precision	設定llama2模型的精度：支援使用fp32、fp16等精度，例如`python webui/webui_server.py --port=8000 --precision=fp32`。	系統根據GPU顯存大小自動設定7b模型使用的精度。
--port	指定WebUI服務的監聽連接埠。樣本：`python webui/webui_server.py --port=8000`。	8000
--api-only	僅使用API方式啟動服務。預設情況下，部署服務會同時啟動WebUI與API Server。樣本：`python webui/webui_server.py --api-only`。	False
--no-api	僅使用WebUI方式啟動服務。預設情況下，部署服務會同時啟動WebUI與API Server。樣本：`python webui/webui_server.py --no-api`。	False
--max-new-tokens	產生輸出token的最大長度，單位為個。樣本：`python api/api_server.py --port=8000 --max-new-tokens=1024`。	2048
--temperature	用於調節模型輸出結果的隨機性，值越大隨機性越強，0值為固定輸出。Float類型，區間為0~1。樣本：`python api/api_server.py --port=8000 --max_length=0.8`。	0.95
--max_round	推理時可支援的歷史對話輪數。樣本：`python api/api_server.py --port=8000 --max_round=10`。	5
--top_k	從產生結果中選擇候選輸出的數量，正整數。樣本：`python api/api_server.py --port=8000 --top_k=10`。	None
--top_p	從產生結果中按百分比選擇輸出結果。Float類型，區間為0~1。樣本：`python api/api_server.py --port=8000 --top_p=0.9`。	None
--no-template	Llama2、Falcon等模型會提供預設的Prompt模板，如果不設定該參數，會使用預設的Prompt模板，如果設定了該參數，您需要指定自己的模板。樣本：`python api/api_server.py --port=8000 --no-template`。	使用預設的Prompt模板
--log-level	選擇日誌輸出等級，日誌等級分為DEBUG、INFO、WARNING和ERROR。樣本：`python api/api_server.py --port=8000 --log-level=DEBUG`。	INFO
--export-history-path	EAS LLM服務支援後台匯出對話記錄。啟動服務時，需要通過命令列參數指定匯出路徑。通常情況下，該路徑是一個OSS的掛載路徑。EAS服務會將1小時內的對話記錄匯出到一個檔案中。樣本：`python api/api_server.py --port=8000 --export-history-path=/your_mount_path`。	預設不開啟
--export-interval	設定倒數記錄的時間周期，單位為秒。例如，設定`--export-interval=3600`時，表示將最近1小時的對話記錄匯入到一個檔案中。	3600
`--backend`	為EAS配置推理加速引擎，取值如下： PAI-BladeLLM自動推理加速：配置為`--backend=blade`。開源架構vllm推理加速：配置為`--backend=vllm`。說明僅模型類別選擇Qwen2-7b、Qwen1.5-1.8b、Qwen1.5-7b、Qwen1.5-14b、llama3-8b、llama2-7b、llama2-13b、chatglm3-6b、baichuan2-7b、baichuan2-13b、falcon-7b、yi-6b、mistral-7b-instruct-v0.2、gemma-2b-it、gemma-7b-it、deepseek-coder-7b-instruct-v1.5時，支援使用推理加速功能。	預設無加速

運行命令

模型類型	運行命令
Llama2	`python webui/webui_server.py --port=8000 --model-path=/data --model-type=llama2`
ChatGLM2	`python webui/webui_server.py --port=8000 --model-path=/data --model-type=chatglm2`
ChatGLM3	`python webui/webui_server.py --port=8000 --model-path=/data --model-type=chatglm3`
Qwen（通義千問）	`python webui/webui_server.py --port=8000 --model-path=/data --model-type=qwen`
ChatGLM	`python webui/webui_server.py --port=8000 --model-path=/data --model-type=chatglm`
Falcon-7B	`python webui/webui_server.py --port=8000 --model-path=/data --model-type=falcon`

單擊部署。

調用EAS服務

通過WebUI調用EAS服務

單擊目標服務服務方式列下的查看Web應用。
在WebUI頁面，進行模型推理驗證。
在ChatLLM-WebUI頁面的文字框中輸入對話內容，例如請提供一個理財學習計劃，單擊Send，即可開始對話。
使用LangChain整合您自己的業務資料，產生基於本地知識庫的定製答案。
1. 在WebUI頁面上方的Tab頁選擇LangChain。
2. 在WebUI頁面左下角，按照介面操作指引拉取自訂資料，支援配置.txt、.md、.docx、.pdf格式的檔案。
  例如上傳README.md檔案，單擊左下角的Vectorstore knowledge，返回如下結果表明自訂資料載入成功。
3. 在WebUI頁面底部輸入框中，輸入業務資料相關的問題進行對話即可。
  例如在輸入框中輸入如何安裝deepspeed，單擊Send，即可開始對話。

通過API調用EAS服務

擷取服務訪問地址和Token。
1. 進入模型線上服務（EAS）頁面，詳情請參見部署EAS服務。
2. 在該頁面中，單擊目標服務名稱進入服務詳情頁面。
3. 在基本資料地區單擊查看調用資訊，在公網地址調用頁簽擷取服務Token和訪問地址。

啟動API進行模型推理。

使用HTTP方式調用服務

非流式調用

用戶端使用標準的HTTP格式，使用curl命令調用時，支援發送以下兩種類型的請求：

發送String類型的請求
```
curl $host -H 'Authorization: $authorization' --data-binary @chatllm_data.txt -v
```
其中：$authorization需替換為服務Token，$host：需替換為服務訪問地址，chatllm_data.txt：該檔案為包含問題的純文字檔案。

發送結構化類型的請求

curl $host -H 'Authorization: $authorization' -H "Content-type: application/json" --data-binary @chatllm_data.json -v -H "Connection: close"

使用chatllm_data.json檔案來設定推理參數，chatllm_data.json檔案的內容格式如下：

{
  "max_new_tokens": 4096,
  "use_stream_chat": false,
  "prompt": "How to install it?",
  "system_prompt": "Act like you are programmer with 5+ years of experience.",
  "history": [
    [
      "Can you tell me what's the bladellm?",
      "BladeLLM is an framework for LLM serving, integrated with acceleration techniques like quantization, ai compilation, etc. , and supporting popular LLMs like OPT, Bloom, LLaMA, etc."
    ]
  ],
  "temperature": 0.8,
  "top_k": 10,
  "top_p": 0.8,
  "do_sample": true,
  "use_cache": true
}

參數說明如下，請酌情添加或刪除。

參數	描述	預設值
max_new_tokens	產生輸出token的最大長度，單位為個。	2048
use_stream_chat	是否使用流式輸出形式。	true
prompt	使用者的Prompt。	""
system_prompt	系統Prompt。	""
history	對話的記錄，類型為List[Tuple(str, str)]。	[()]
temperature	用於調節模型輸出結果的隨機性，值越大隨機性越強，0值為固定輸出。Float類型，區間為0~1。	0.95
top_k	從產生結果中選擇候選輸出的數量。	30
top_p	從產生結果中按百分比選擇輸出結果。Float類型，區間為0~1。	0.8
do_sample	開啟輸出採樣。	true
use_cache	開啟KV Cache。	true

您也可以基於Python的requests包實現自己的用戶端，範例程式碼如下：

import argparse
import json
from typing import Iterable, List

import requests

def post_http_request(prompt: str,
                      system_prompt: str,
                      history: list,
                      host: str,
                      authorization: str,
                      max_new_tokens: int = 2048,
                      temperature: float = 0.95,
                      top_k: int = 1,
                      top_p: float = 0.8,
                      langchain: bool = False,
                      use_stream_chat: bool = False) -> requests.Response:
    headers = {
        "User-Agent": "Test Client",
        "Authorization": f"{authorization}"
    }
    if not history:
        history = [
            (
                "San Francisco is a",
                "city located in the state of California in the United States. \
                It is known for its iconic landmarks, such as the Golden Gate Bridge \
                and Alcatraz Island, as well as its vibrant culture, diverse population, \
                and tech industry. The city is also home to many famous companies and \
                startups, including Google, Apple, and Twitter."
            )
        ]
    pload = {
        "prompt": prompt,
        "system_prompt": system_prompt,
        "top_k": top_k,
        "top_p": top_p,
        "temperature": temperature,
        "max_new_tokens": max_new_tokens,
        "use_stream_chat": use_stream_chat,
        "history": history
    }
    if langchain:
        pload["langchain"] = langchain
    response = requests.post(host, headers=headers,
                             json=pload, stream=use_stream_chat)
    return response

def get_response(response: requests.Response) -> List[str]:
    data = json.loads(response.content)
    output = data["response"]
    history = data["history"]
    return output, history

if __name__ == "__main__":
    parser = argparse.ArgumentParser()
    parser.add_argument("--top-k", type=int, default=4)
    parser.add_argument("--top-p", type=float, default=0.8)
    parser.add_argument("--max-new-tokens", type=int, default=2048)
    parser.add_argument("--temperature", type=float, default=0.95)
    parser.add_argument("--prompt", type=str, default="How can I get there?")
    parser.add_argument("--langchain", action="store_true")

    args = parser.parse_args()

    prompt = args.prompt
    top_k = args.top_k
    top_p = args.top_p
    use_stream_chat = False
    temperature = args.temperature
    langchain = args.langchain
    max_new_tokens = args.max_new_tokens

    host = "EAS服務公網地址"
    authorization = "EAS服務公網Token"

    print(f"Prompt: {prompt!r}\n", flush=True)
    # 在用戶端請求中可設定語言模型的system prompt。
    system_prompt = "Act like you are programmer with \
                5+ years of experience."

    # 用戶端請求中可設定對話的歷史資訊，用戶端維護目前使用者的對話記錄，用於實現多輪對話。通常情況下可以使用上一輪對話返回的histroy資訊，history格式為List[Tuple(str, str)]。
    history = []
    response = post_http_request(
        prompt, system_prompt, history,
        host, authorization,
        max_new_tokens, temperature, top_k, top_p,
        langchain=langchain, use_stream_chat=use_stream_chat)
    output, history = get_response(response)
    print(f" --- output: {output} \n --- history: {history}", flush=True)

# 服務端返回JSON格式的響應結果，包含推理結果與對話歷史。
def get_response(response: requests.Response) -> List[str]:
    data = json.loads(response.content)
    output = data["response"]
    history = data["history"]
    return output, history

其中：

host：配置為服務訪問地址。
authorization：配置為服務Token。

流式調用

流式調用使用HTTP SSE方式，其他設定方式與非流式相同，代碼參考如下：

import argparse
import json
from typing import Iterable, List

import requests


def clear_line(n: int = 1) -> None:
    LINE_UP = '\033[1A'
    LINE_CLEAR = '\x1b[2K'
    for _ in range(n):
        print(LINE_UP, end=LINE_CLEAR, flush=True)


def post_http_request(prompt: str,
                      system_prompt: str,
                      history: list,
                      host: str,
                      authorization: str,
                      max_new_tokens: int = 2048,
                      temperature: float = 0.95,
                      top_k: int = 1,
                      top_p: float = 0.8,
                      langchain: bool = False,
                      use_stream_chat: bool = False) -> requests.Response:
    headers = {
        "User-Agent": "Test Client",
        "Authorization": f"{authorization}"
    }
    if not history:
        history = [
            (
                "San Francisco is a",
                "city located in the state of California in the United States. \
                It is known for its iconic landmarks, such as the Golden Gate Bridge \
                and Alcatraz Island, as well as its vibrant culture, diverse population, \
                and tech industry. The city is also home to many famous companies and \
                startups, including Google, Apple, and Twitter."
            )
        ]
    pload = {
        "prompt": prompt,
        "system_prompt": system_prompt,
        "top_k": top_k,
        "top_p": top_p,
        "temperature": temperature,
        "max_new_tokens": max_new_tokens,
        "use_stream_chat": use_stream_chat,
        "history": history
    }
    if langchain:
        pload["langchain"] = langchain
    response = requests.post(host, headers=headers,
                             json=pload, stream=use_stream_chat)
    return response


def get_streaming_response(response: requests.Response) -> Iterable[List[str]]:
    for chunk in response.iter_lines(chunk_size=8192,
                                     decode_unicode=False,
                                     delimiter=b"\0"):
        if chunk:
            data = json.loads(chunk.decode("utf-8"))
            output = data["response"]
            history = data["history"]
            yield output, history


if __name__ == "__main__":
    parser = argparse.ArgumentParser()
    parser.add_argument("--top-k", type=int, default=4)
    parser.add_argument("--top-p", type=float, default=0.8)
    parser.add_argument("--max-new-tokens", type=int, default=2048)
    parser.add_argument("--temperature", type=float, default=0.95)
    parser.add_argument("--prompt", type=str, default="How can I get there?")
    parser.add_argument("--langchain", action="store_true")
    args = parser.parse_args()

    prompt = args.prompt
    top_k = args.top_k
    top_p = args.top_p
    use_stream_chat = True
    temperature = args.temperature
    langchain = args.langchain
    max_new_tokens = args.max_new_tokens

    host = ""
    authorization = ""

    print(f"Prompt: {prompt!r}\n", flush=True)
    system_prompt = "Act like you are programmer with \
                5+ years of experience."
    history = []
    response = post_http_request(
        prompt, system_prompt, history,
        host, authorization,
        max_new_tokens, temperature, top_k, top_p,
        langchain=langchain, use_stream_chat=use_stream_chat)

    for h, history in get_streaming_response(response):
        print(
            f" --- stream line: {h} \n --- history: {history}", flush=True)

其中：

host：配置為服務訪問地址。
authorization：配置為服務Token。

使用WebSocket方式調用服務

為了更好地維護使用者對話資訊，您也可以使用WebSocket方式保持與服務的串連完成單輪或多輪對話，程式碼範例如下：

import os
import time
import json
import struct
from multiprocessing import Process

import websocket

round = 5
questions = 0


def on_message_1(ws, message):
    if message == "<EOS>":
        print('pid-{} timestamp-({}) receives end message: {}'.format(os.getpid(),
              time.time(), message), flush=True)
        ws.send(struct.pack('!H', 1000), websocket.ABNF.OPCODE_CLOSE)
    else:
        print("{}".format(time.time()))
        print('pid-{} timestamp-({}) --- message received: {}'.format(os.getpid(),
              time.time(), message), flush=True)


def on_message_2(ws, message):
    global questions
    print('pid-{} --- message received: {}'.format(os.getpid(), message))
    # end the client-side streaming
    if message == "<EOS>":
        questions = questions + 1
        if questions == 5:
            ws.send(struct.pack('!H', 1000), websocket.ABNF.OPCODE_CLOSE)


def on_message_3(ws, message):
    print('pid-{} --- message received: {}'.format(os.getpid(), message))
    # end the client-side streaming
    ws.send(struct.pack('!H', 1000), websocket.ABNF.OPCODE_CLOSE)


def on_error(ws, error):
    print('error happened: ', str(error))


def on_close(ws, a, b):
    print("### closed ###", a, b)


def on_pong(ws, pong):
    print('pong:', pong)

# stream chat validation test
def on_open_1(ws):
    print('Opening Websocket connection to the server ... ')
    params_dict = {}
    params_dict['prompt'] = """Show me a golang code example: """
    params_dict['temperature'] = 0.9
    params_dict['top_p'] = 0.1
    params_dict['top_k'] = 30
    params_dict['max_new_tokens'] = 2048
    params_dict['do_sample'] = True
    raw_req = json.dumps(params_dict, ensure_ascii=False).encode('utf8')
    # raw_req = f"""To open a Websocket connection to the server: """

    ws.send(raw_req)
    # end the client-side streaming


# multi-round query validation test
def on_open_2(ws):
    global round
    print('Opening Websocket connection to the server ... ')
    params_dict = {"max_new_tokens": 6144}
    params_dict['temperature'] = 0.9
    params_dict['top_p'] = 0.1
    params_dict['top_k'] = 30
    params_dict['use_stream_chat'] = True
    params_dict['prompt'] = "您好！"
    params_dict = {
        "system_prompt":
        "Act like you are programmer with 5+ years of experience."
    }
    raw_req = json.dumps(params_dict, ensure_ascii=False).encode('utf8')
    ws.send(raw_req)
    params_dict['prompt'] = "請使用Python，編寫一個排序演算法"
    raw_req = json.dumps(params_dict, ensure_ascii=False).encode('utf8')
    ws.send(raw_req)
    params_dict['prompt'] = "請轉寫成java語言的實現"
    raw_req = json.dumps(params_dict, ensure_ascii=False).encode('utf8')
    ws.send(raw_req)
    params_dict['prompt'] = "請介紹一下你自己？"
    raw_req = json.dumps(params_dict, ensure_ascii=False).encode('utf8')
    ws.send(raw_req)
    params_dict['prompt'] = "請總結上述對話"
    raw_req = json.dumps(params_dict, ensure_ascii=False).encode('utf8')
    ws.send(raw_req)


# Langchain validation test.
def on_open_3(ws):
    global round
    print('Opening Websocket connection to the server ... ')

    params_dict = {}
    # params_dict['prompt'] = """To open a Websocket connection to the server: """
    params_dict['prompt'] = """Can you tell me what's the MNN?"""
    params_dict['temperature'] = 0.9
    params_dict['top_p'] = 0.1
    params_dict['top_k'] = 30
    params_dict['max_new_tokens'] = 2048
    params_dict['use_stream_chat'] = False
    params_dict['langchain'] = True
    raw_req = json.dumps(params_dict, ensure_ascii=False).encode('utf8')
    ws.send(raw_req)


authorization = ""
host = "ws://" + ""


def single_call(on_open_func, on_message_func, on_clonse_func=on_close):
    ws = websocket.WebSocketApp(
        host,
        on_open=on_open_func,
        on_message=on_message_func,
        on_error=on_error,
        on_pong=on_pong,
        on_close=on_clonse_func,
        header=[
            'Authorization: ' + authorization],
    )

    # setup ping interval to keep long connection.
    ws.run_forever(ping_interval=2)


if __name__ == "__main__":
    for i in range(5):
        p1 = Process(target=single_call, args=(on_open_1, on_message_1))
        p2 = Process(target=single_call, args=(on_open_2, on_message_2))
        p3 = Process(target=single_call, args=(on_open_3, on_message_3))

        p1.start()
        p2.start()
        p3.start()

        p1.join()
        p2.join()
        p3.join()

其中：

authorization：配置為服務Token。
host：配置為服務訪問地址。並將訪問地址中前端的http替換為ws。
use_stream_chat：通過該請求參數來控制用戶端是否為流式輸出。預設值為True，表示服務端返迴流式資料。
參考上述範例程式碼中的on_open_2函數的實現方法實現多輪對話。

常見問題及解決方案

如何切換其他的開源大模型

具體操作步驟如下：

單擊目標服務操作列下的更新。

切換其他的開源大模型。

情境化模型部署
在部署LLM大語言模型頁面，更新模型類別為其他開源大模型，然後單擊更新。

自訂模型部署

在更新服務頁面，參考下表內容，根據需要部署的模型來更新運行命令和資源規格，然後單擊更新。

模型名稱	運行命令	推薦機型
Qwen2-7b（通義千問2版本-7B參數量）	`python webui/webui_server.py --port=8000 --model-path=Qwen/Qwen2-7B-Instruct`	單卡GU30 單卡A10 單卡V100（32 G）
Qwen2-72b（通義千問2版本-72B參數量）	`python webui/webui_server.py --port=8000 --model-path=Qwen/Qwen2-72B-Instruct`	兩卡A100（80 G）四卡A100（40 G）八卡V100（32 G）
Qwen2-57b-A14b	`python webui/webui_server.py --port=8000 --model-path=Qwen/Qwen2-57B-A14B-Instruct`	兩卡A100（80 G）四卡A100（40 G）四卡V100（32 G）
Qwen1.5-1.8b（通義千問1.5版本-1.8B參數量）	`python webui/webui_server.py --port=8000 --model-path=Qwen/Qwen1.5-1.8B-Chat`	單卡T4 單卡V100（16 G）單卡GU30 單卡A10
Qwen1.5-7b（通義千問1.5版本-7B參數量）	`python webui/webui_server.py --port=8000 --model-path=Qwen/Qwen1.5-7B-Chat`	單卡GU30 單卡A10
Qwen1.5-14b（通義千問1.5版本-14B參數量）	`python webui/webui_server.py --port=8000 --model-path=Qwen/Qwen1.5-14B-Chat`	單卡V100（32 G）單卡A100（40 G）單卡A100（80 G） 2卡GU30 2卡A10
Qwen1.5-32b（通義千問1.5版本-32B參數量）	`python webui/webui_server.py --port=8000 --model-path=Qwen/Qwen1.5-32B-Chat`	單卡A100（80 G）四卡V100（32 G）
Qwen1.5-72b（通義千問1.5版本-72B參數量）	`python webui/webui_server.py --port=8000 --model-path=Qwen/Qwen1.5-72B-Chat`	8卡V100（32 G） 2卡A100（80 G） 4卡A100（40 G）
Qwen1.5-110b（通義千問1.5版本-110B參數量）	`python webui/webui_server.py --port=8000 --model-path=Qwen/Qwen1.5-110B-Chat`	8卡A100（40 G） 4卡A100（80 G）
llama3-8b	`python webui/webui_server.py --port=8000 --model-path=/huggingface/meta-Llama-3-8B-Instruct/ --model-type=llama3`	單卡GU30 單卡A10 單卡V100（32 G）
llama3-70b	`python webui/webui_server.py --port=8000 --model-path=/huggingface/meta-Llama-3-70B-Instruct/ --model-type=llama3`	兩卡A100（80 G）四卡A100（40 G）八卡V100（32 G）
Llama2-7b	`python webui/webui_server.py --port=8000 --model-path=meta-llama/Llama-2-7b-chat-hf`	單卡GU30 單卡A10 單卡V100（32 G）
Llama2-13b	`python webui/webui_server.py --port=8000 --model-path=meta-llama/Llama-2-13b-chat-hf`	單卡V100（32 G） 2卡GU30 2卡A10
llama2-70b	`python webui/webui_server.py --port=8000 --model-path=meta-llama/Llama-2-70b-chat-hf`	8卡V100（32 G） 2卡A100（80 G） 4卡A100（40 G）
chatglm3-6b	`python webui/webui_server.py --port=8000 --model-path=THUDM/chatglm3-6b`	單卡GU30 單卡A10 單卡V100（16 G）單卡V100（32 G）
baichuan2-7b	`python webui/webui_server.py --port=8000 --model-path=baichuan-inc/Baichuan2-7B-Chat`	單卡GU30 單卡A10 單卡V100（32 G）
baichuan2-13b	`python webui/webui_server.py --port=8000 --model-path=baichuan-inc/Baichuan2-13B-Chat`	2卡GU30 2卡A10 單卡V100（32 G）
falcon-7b	`python webui/webui_server.py --port=8000 --model-path=tiiuae/falcon-7b-instruct`	單卡GU30 單卡A10 單卡V100（32 G）
falcon-40b	`python webui/webui_server.py --port=8000 --model-path=tiiuae/falcon-40b-instruct`	8卡V100（32 G） 2卡A100（80 G） 4卡A100（40 G）
falcon-180b	`python webui/webui_server.py --port=8000 --model-path=tiiuae/falcon-180B-chat`	8卡A100（80 G）
Yi-6b	`python webui/webui_server.py --port=8000 --model-path=01-ai/Yi-6B-Chat`	單卡GU30 單卡A10 單卡V100（16 G）單卡V100（32 G）
Yi-34b	`python webui/webui_server.py --port=8000 --model-path=01-ai/Yi-34B-Chat`	4卡V100（16 G）單卡A100（80 G） 4卡A10
mistral-7b-instruct-v0.2	`python webui/webui_server.py --port=8000 --model-path=mistralai/Mistral-7B-Instruct-v0.2`	單卡GU30 單卡A10 單卡V100（32 G）
mixtral-8x7b-instruct-v0.1	`python webui/webui_server.py --port=8000 --model-path=mistralai/Mixtral-8x7B-Instruct-v0.1`	4卡A100(80G)
gemma-2b-it	`python webui/webui_server.py --port=8000 --model-path=google/gemma-2b-it`	單卡T4 單卡V100（16 G）單卡GU30 單卡A10
gemma-7b-it	`python webui/webui_server.py --port=8000 --model-path=google/gemma-7b-it`	單卡GU30 單卡A10 單卡V100（32 G）
deepseek-coder-7b-instruct-v1.5	`python webui/webui_server.py --port=8000 --model-path=deepseek-ai/deepseek-coder-7b-instruct-v1.5`	單卡GU30 單卡A10 單卡V100（32 G）
deepseek-coder-33b-instruct	`python webui/webui_server.py --port=8000 --model-path=deepseek-ai/deepseek-coder-33b-instruct`	單卡A100（80 G） 2卡A100（40 G） 4卡V100（32 G）
deepseek-v2-lite	`python webui/webui_server.py --port=8000 --model-path=deepseek-ai/DeepSeek-V2-Lite-Chat`	單卡A10 單卡A100（40 G）

Platform For AI：LLM大語言模型部署