Elastic Algorithm Service (EAS) of Platform for AI (PAI) provides a scenario-based deployment mode that allows you to deploy an open source large language model (LLM) by configuring several parameters. This topic describes how to use EAS to deploy and call an LLM. This topic also provides answers to some frequently asked questions.
Background information
The application of LLMs, such as the Generative Pre-trained Transformer (GPT) and TongYi Qianwen (Qwen) series of models, has garnered significant attention, especially in inference tasks. EAS allows you to easily deploy open source LLMs as an inference service. Supported LLMs include Llama 3, Qwen, Llama 2, ChatGLM, Baichuan, Yi-6B, Mistral-7B, and Falcon-7B. You can not only call the models by using WebUI or API, but also use the LangChain framework to generate custom response based on your business data.
What is LangChain:
LangChain is an open source framework that allows AI developers to integrate LLMs like GPT-4 with external data to improve performance and optimize resource utilization.
How does LangChain work:
LangChain splits the source data (such as a 20-page PDF file) into smaller chunks, converts the chunks into numerical vectors by using embedding models (such as BGE and text2vec), and then stores the vectors in a vector database.
This way, the LLM can use the data in the vector database to generate responses. For each user query, LangChain retrieves the chunk that is relevant to the user query from the vector database, includes the retrieved information and the query in a prompt, and then sends the prompt to the LLM to generate an answer.
Prerequisites
If you want to deploy a custom model, make sure the following prerequisites are met:
Model files and related configuration files are prepared. The following figure is a sample of model files.
The config.json file must be included in the configuration files. You must configure the config.json file based on the Huggingface model format. For more information about the sample file, see config.json.
An Object Storage Service (OSS) bucket or a NAS file system is created to store the model files. You can also register the model files as AI assets of PAI for easier management and maintenance. For more information about how to create an OSS bucket, see Get started by using the OSS console.
The model files and configuration files are uploaded to the OSS bucket. For more information, see Get started by using the OSS console.
Limits
The inference acceleration engines provided by EAS support only the following models: Qwen2-7b, Qwen1.5-1.8b, Qwen1.5-7b, Qwen1.5-14b, llama3-8b, llama2-7b, llama2-13b, chatglm3-6b, baichuan2-7b, baichuan2-13b, falcon-7b, yi-6b, mistral-7b-instruct-v0.2, gemma-2b-it, gemma-7b-it, deepseek-coder-7b-instruct-v1.5.
The LangChain framework is not supported by the inference acceleration engines.
Deploy an LLM in EAS
The following deployment methods are supported:
Method 1: Scenario-based model deployment (recommend)
Log on to the PAI console. Select a region and a workspace. Then, click Enter Elastic Algorithm Service (EAS).
On the Elastic Algorithm Service (EAS) page, click Deploy Service. In the Scenario-based Model Deployment section, select LLM Deployment.
On the LLM Deployment page, configure the following key parameters. For information about other parameters, see Deploy a model service in the PAI console.
Parameter
Description
Basic Information
Service Name
Specify a name for the model service.
Model Source
Source of the model. Valid values:
Open Source Model
Custom fine-tuned Model
Model Type
If you set Model Source to Open Source Model, you can select models of the following types: Qwen, Llama, ChatGLM, Baichuan, Falcon, Yi, Mistral, Gemma, and DeepSeek.
If you set Model Source to Custom fine-tuned Model, you need to specify the Model, Parameter Quantity, and Precision parameters based on your custom model.
Model Settings
If you set Model Source to Custom fine-tuned Model, you need to specify the storage location of the model. Take OSS as an example. set Type to Mount OSS and select the OSS directory where your model files are stored.
Resource Configuration
Resource Configuration
If you set Model Source to Open Source Model, the system recommends appropriate resource configurations after you select a Model Type.
If you set Model Source to Custom fine-tuned Model, the system automatically generates resource configurations after you configure Model Type. You can also specify Resource Configuration based on the parameter quantity of your model. For more information, see the How do I switch to another open source LLM? section of this topic.
Inference Acceleration
If you set Model Type to Qwen2-7b, Qwen1.5-1.8b, Qwen1.5-7b, Qwen1.5-14b, llama3-8b, llama2-7b, llama2-13b, chatglm3-6b, baichuan2-7b, baichuan2-13b, falcon-7b, yi-6b, mistral-7b-instruct-v0.2, gemma-2b-it, gemma-7b-it, or deepseek-coder-7b-instruct-v1, the Inference Acceleration feature is supported. Valid values:
Not Accelerated
BladeLLM Inference Acceleration
Open-source vLLM Inference Acceleration
NoteIf you enable the Inference Acceleration feature, deployed EAS service cannot use the LangChain framework.
Click Deploy.
Method 2: Custom deployment
Log on to the PAI console. Select a region and a workspace. Then, click Enter Elastic Algorithm Service (EAS).
On the Model Online Service (EAS) page, click Deploy Service. In the Custom Model Deployment section, click Custom Deployment.
On the Create Service page, configure the following key parameters. For information about other parameters, see Deploy a model service in the PAI console.
Parameter
Description
Model Service Information
Service Name
Specify a name for the model service.
Deployment Method
Select Deploy Web App by Using Image.
Select Image
Select chat-llm-webui from the PAI Image drop-down list and select 3.0 as the image version.
NoteThe image version is frequently updated. We recommend that you select the latest version.
If you want to enable the Inference Acceleration feature, select the following image version:
NoteIf you enable the Inference Acceleration feature, deployed EAS service cannot use the LangChain framework.
3.0-blade: uses the BladeLLM inference acceleration engine.
3.0-vllm: uses the vLLM inference acceleration engine.
Specify Model Settings
If you need to deploy a custom model, click Specify Model Settings and configure your model. Take OSS as an example, configure the following parameters:
Select Mount OSS Path and select the OSS directory where the model files are stored. Example:
oss://bucket-test/data-oss/
.Set Mount Path to
/data
.Do not Enable Read-only Mode.
Command
After you select an image version, the system automatically configures the
python webui/webui_server.py --port=8000 --model-path=Qwen/Qwen1.5-7B-Chat
command and the port. By default, the command deploys the Qwen1.5-7B-Chat model. For information about the parameters of the command, see the More parameters section of this topic.If you want to deploy other open source LLMs, specify the Command of the model. For more information, see the How do I switch to another open source LLM? section of this topic.
If you need to deploy a custom model, add the following parameters to Command:
--model-path: Set this parameter to
/data
, which is the value of the Mount Path parameter.--model-type: Specify the type of the model.
For information about the commands for different types of models, see the Command section of this topic.
Resource Deployment Information
Resource Configuration Mode
Select General.
Resource Configuration
You must select a GPU type. For best cost-effectiveness, we recommend that you use the ml.gu7i.c16m60.1-gu30 Instance Type to deploy the Qwen-7B models.
To deploy another open source model, select an instance type that matches the parameter quantity of the model. For more information, see the How do I switch to another open source LLM? section of this topic.
Click Deploy.
Call an EAS service
Call EAS services by using the web UI
Find the deployed service and click View Web App in the Service Type column.
Test the inference performance on the WebUI page.
Enter a sentence in the input text box and click Send to start a conversation. Sample input:
Provide a learning plan for personal finance
.Use the LangChain framework to integrate your own business data into the service and generate customized answers based on your local knowledge base.
On the WebUI page of the service that you deployed, click the LangChain tab.
In the lower-left corner of the ChatLLM-LangChain-WebUI page, follow the on-screen instructions to upload a knowledge base. You can upload files in the following formats: TXT, Markdown, DOCX, and PDF.
For example, you can upload a README.md file and click Vectorstore knowledge. The following result indicates that the data in the file is loaded.
Enter a question about the data you uploaded in the input text box and click Send to start a conversation.
Sample input:
How to install deepspeed
.
Call EAS services by using API operations
Obtain the service endpoint and token.
Go to the Elastic Algorithm Service (EAS) page. For more information, see the Deploy an LLM in EAS section of this topic.
Click the name of the service to go to the Service Details tab.
In the Basic Information section, click Invocation Method. On the Public Endpoint tab of the dialogue box that appears, obtain the token and endpoint.
To call API operations to perform inference, use one of the following methods:
Use HTTP
Non-streaming mode
The client sends the following types of standard HTTP requests when cURL commands are run.
STRING requests
curl $host -H 'Authorization: $authorization' --data-binary @chatllm_data.txt -v
Replace $authorization with the token. Replace $host with the endpoint. The chatllm_data.txt file is a plain text file that contains the prompt.
Structured requests
curl $host -H 'Authorization: $authorization' -H "Content-type: application/json" --data-binary @chatllm_data.json -v -H "Connection: close"
Use the chatllm_data.json file to configure inference parameters. The following sample code provides an format example of the chatllm_data.json file:
{ "max_new_tokens": 4096, "use_stream_chat": false, "prompt": "How to install it?", "system_prompt": "Act like you are programmer with 5+ years of experience.", "history": [ [ "Can you tell me what's the bladellm?", "BladeLLM is an framework for LLM serving, integrated with acceleration techniques like quantization, ai compilation, etc. , and supporting popular LLMs like OPT, Bloom, LLaMA, etc." ] ], "temperature": 0.8, "top_k": 10, "top_p": 0.8, "do_sample": true, "use_cache": true }
The following table describes the parameters in the preceding code. Configure the parameters based on your business requirements.
Parameter
Description
Default value
max_new_tokens
The maximum number of output tokens.
2048
use_stream_chat
Specifies whether to return the output tokens in the streaming mode.
true
prompt
The user prompt.
""
system_prompt
The system prompt.
""
history
The dialogue history. The value is in the List[Tuple(str, str)] format.
[()]
temperature
The randomness of the model output. A larger value specifies a higher randomness. A value of 0 specifies a fixed output. The value is of the Float type and ranges from 0 to 1.
0.95
top_k
The number of outputs selected from the generated results.
30
top_p
The probability threshold of outputs selected from the generated results. The value is of the Float type and ranges from 0 to 1.
0.8
do_sample
Specifies whether to enable output sampling.
true
use_cache
Specifies whether to enable KV cache.
true
You can also implement your own client based on the Python requests package. Example:
import argparse import json from typing import Iterable, List import requests def post_http_request(prompt: str, system_prompt: str, history: list, host: str, authorization: str, max_new_tokens: int = 2048, temperature: float = 0.95, top_k: int = 1, top_p: float = 0.8, langchain: bool = False, use_stream_chat: bool = False) -> requests.Response: headers = { "User-Agent": "Test Client", "Authorization": f"{authorization}" } if not history: history = [ ( "San Francisco is a", "city located in the state of California in the United States. \ It is known for its iconic landmarks, such as the Golden Gate Bridge \ and Alcatraz Island, as well as its vibrant culture, diverse population, \ and tech industry. The city is also home to many famous companies and \ startups, including Google, Apple, and Twitter." ) ] pload = { "prompt": prompt, "system_prompt": system_prompt, "top_k": top_k, "top_p": top_p, "temperature": temperature, "max_new_tokens": max_new_tokens, "use_stream_chat": use_stream_chat, "history": history } if langchain: pload["langchain"] = langchain response = requests.post(host, headers=headers, json=pload, stream=use_stream_chat) return response def get_response(response: requests.Response) -> List[str]: data = json.loads(response.content) output = data["response"] history = data["history"] return output, history if __name__ == "__main__": parser = argparse.ArgumentParser() parser.add_argument("--top-k", type=int, default=4) parser.add_argument("--top-p", type=float, default=0.8) parser.add_argument("--max-new-tokens", type=int, default=2048) parser.add_argument("--temperature", type=float, default=0.95) parser.add_argument("--prompt", type=str, default="How can I get there?") parser.add_argument("--langchain", action="store_true") args = parser.parse_args() prompt = args.prompt top_k = args.top_k top_p = args.top_p use_stream_chat = False temperature = args.temperature langchain = args.langchain max_new_tokens = args.max_new_tokens host = "<EAS service public endpoint>" authorization = "<EAS service public token>" print(f"Prompt: {prompt!r}\n", flush=True) # System prompts can be included in the requests. system_prompt = "Act like you are programmer with \ 5+ years of experience." # Dialogue history can be included in the requests. The client manages the history to implement multi-round dialogues. In most cases, the information from the previous round of dialogue is used. The information is in the List[Tuple(str, str)] format. history = [] response = post_http_request( prompt, system_prompt, history, host, authorization, max_new_tokens, temperature, top_k, top_p, langchain=langchain, use_stream_chat=use_stream_chat) output, history = get_response(response) print(f" --- output: {output} \n --- history: {history}", flush=True) # The server returns a JSON response that includes the inference result and dialogue history. def get_response(response: requests.Response) -> List[str]: data = json.loads(response.content) output = data["response"] history = data["history"] return output, history
Take note of the following parameters:
Set the host parameter to the service endpoint
Set the authorization parameter to the service token.
Streaming mode
In streaming mode, the HTTP SSE method is used. Sample code:
import argparse import json from typing import Iterable, List import requests def clear_line(n: int = 1) -> None: LINE_UP = '\033[1A' LINE_CLEAR = '\x1b[2K' for _ in range(n): print(LINE_UP, end=LINE_CLEAR, flush=True) def post_http_request(prompt: str, system_prompt: str, history: list, host: str, authorization: str, max_new_tokens: int = 2048, temperature: float = 0.95, top_k: int = 1, top_p: float = 0.8, langchain: bool = False, use_stream_chat: bool = False) -> requests.Response: headers = { "User-Agent": "Test Client", "Authorization": f"{authorization}" } if not history: history = [ ( "San Francisco is a", "city located in the state of California in the United States. \ It is known for its iconic landmarks, such as the Golden Gate Bridge \ and Alcatraz Island, as well as its vibrant culture, diverse population, \ and tech industry. The city is also home to many famous companies and \ startups, including Google, Apple, and Twitter." ) ] pload = { "prompt": prompt, "system_prompt": system_prompt, "top_k": top_k, "top_p": top_p, "temperature": temperature, "max_new_tokens": max_new_tokens, "use_stream_chat": use_stream_chat, "history": history } if langchain: pload["langchain"] = langchain response = requests.post(host, headers=headers, json=pload, stream=use_stream_chat) return response def get_streaming_response(response: requests.Response) -> Iterable[List[str]]: for chunk in response.iter_lines(chunk_size=8192, decode_unicode=False, delimiter=b"\0"): if chunk: data = json.loads(chunk.decode("utf-8")) output = data["response"] history = data["history"] yield output, history if __name__ == "__main__": parser = argparse.ArgumentParser() parser.add_argument("--top-k", type=int, default=4) parser.add_argument("--top-p", type=float, default=0.8) parser.add_argument("--max-new-tokens", type=int, default=2048) parser.add_argument("--temperature", type=float, default=0.95) parser.add_argument("--prompt", type=str, default="How can I get there?") parser.add_argument("--langchain", action="store_true") args = parser.parse_args() prompt = args.prompt top_k = args.top_k top_p = args.top_p use_stream_chat = True temperature = args.temperature langchain = args.langchain max_new_tokens = args.max_new_tokens host = "" authorization = "" print(f"Prompt: {prompt!r}\n", flush=True) system_prompt = "Act like you are programmer with \ 5+ years of experience." history = [] response = post_http_request( prompt, system_prompt, history, host, authorization, max_new_tokens, temperature, top_k, top_p, langchain=langchain, use_stream_chat=use_stream_chat) for h, history in get_streaming_response(response): print( f" --- stream line: {h} \n --- history: {history}", flush=True)
Take note of the following parameters:
Set the host parameter to the service endpoint
Set the authorization parameter to the service token.
Use WebSocket
The WebSocket protocol can efficiently handle the dialogue history. You can use the WebSocket method to connect to the service and perform one or more rounds of dialogue. Sample code:
import os import time import json import struct from multiprocessing import Process import websocket round = 5 questions = 0 def on_message_1(ws, message): if message == "<EOS>": print('pid-{} timestamp-({}) receives end message: {}'.format(os.getpid(), time.time(), message), flush=True) ws.send(struct.pack('!H', 1000), websocket.ABNF.OPCODE_CLOSE) else: print("{}".format(time.time())) print('pid-{} timestamp-({}) --- message received: {}'.format(os.getpid(), time.time(), message), flush=True) def on_message_2(ws, message): global questions print('pid-{} --- message received: {}'.format(os.getpid(), message)) # end the client-side streaming if message == "<EOS>": questions = questions + 1 if questions == 5: ws.send(struct.pack('!H', 1000), websocket.ABNF.OPCODE_CLOSE) def on_message_3(ws, message): print('pid-{} --- message received: {}'.format(os.getpid(), message)) # end the client-side streaming ws.send(struct.pack('!H', 1000), websocket.ABNF.OPCODE_CLOSE) def on_error(ws, error): print('error happened: ', str(error)) def on_close(ws, a, b): print("### closed ###", a, b) def on_pong(ws, pong): print('pong:', pong) # stream chat validation test def on_open_1(ws): print('Opening Websocket connection to the server ... ') params_dict = {} params_dict['prompt'] = """Show me a golang code example: """ params_dict['temperature'] = 0.9 params_dict['top_p'] = 0.1 params_dict['top_k'] = 30 params_dict['max_new_tokens'] = 2048 params_dict['do_sample'] = True raw_req = json.dumps(params_dict, ensure_ascii=False).encode('utf8') # raw_req = f"""To open a Websocket connection to the server: """ ws.send(raw_req) # end the client-side streaming # multi-round query validation test def on_open_2(ws): global round print('Opening Websocket connection to the server ... ') params_dict = {"max_new_tokens": 6144} params_dict['temperature'] = 0.9 params_dict['top_p'] = 0.1 params_dict['top_k'] = 30 params_dict['use_stream_chat'] = True params_dict['prompt'] = "Hello! " params_dict = { "system_prompt": "Act like you are programmer with 5+ years of experience." } raw_req = json.dumps(params_dict, ensure_ascii=False).encode('utf8') ws.send(raw_req) params_dict['prompt'] = "Please write a sorting algorithm in Python." raw_req = json.dumps(params_dict, ensure_ascii=False).encode('utf8') ws.send(raw_req) params_dict['prompt'] = "Please convert the programming language to Java." raw_req = json.dumps(params_dict, ensure_ascii=False).encode('utf8') ws.send(raw_req) params_dict['prompt'] = "Please introduce yourself." raw_req = json.dumps(params_dict, ensure_ascii=False).encode('utf8') ws.send(raw_req) params_dict['prompt'] = "Please summarize the dialogue above." raw_req = json.dumps(params_dict, ensure_ascii=False).encode('utf8') ws.send(raw_req) # Langchain validation test. def on_open_3(ws): global round print('Opening Websocket connection to the server ... ') params_dict = {} # params_dict['prompt'] = """To open a Websocket connection to the server: """ params_dict['prompt'] = """Can you tell me what's the MNN?""" params_dict['temperature'] = 0.9 params_dict['top_p'] = 0.1 params_dict['top_k'] = 30 params_dict['max_new_tokens'] = 2048 params_dict['use_stream_chat'] = False params_dict['langchain'] = True raw_req = json.dumps(params_dict, ensure_ascii=False).encode('utf8') ws.send(raw_req) authorization = "" host = "ws://" + "" def single_call(on_open_func, on_message_func, on_clonse_func=on_close): ws = websocket.WebSocketApp( host, on_open=on_open_func, on_message=on_message_func, on_error=on_error, on_pong=on_pong, on_close=on_clonse_func, header=[ 'Authorization: ' + authorization], ) # setup ping interval to keep long connection. ws.run_forever(ping_interval=2) if __name__ == "__main__": for i in range(5): p1 = Process(target=single_call, args=(on_open_1, on_message_1)) p2 = Process(target=single_call, args=(on_open_2, on_message_2)) p3 = Process(target=single_call, args=(on_open_3, on_message_3)) p1.start() p2.start() p3.start() p1.join() p2.join() p3.join()
Take note of the following parameters:
Set the authorization parameter to the service token.
Set the host parameter to the service endpoint Replace the http prefix in the endpoint with ws.
Use the use_stream_chat parameter to specify whether the client generates output in streaming mode. Default value: True.
Refer to the on_open_2 function in the preceding code to implement a multi-round dialogue.
FAQ
How do I switch to another open source LLM?
Perform the following steps:
On the Elastic Algorithm Service (EAS) page, find the service that you want to update and click Update in the Actions column.
Switch to another open source LLM.
Scenario-based Model Deployment
On the LLM Deployment page, set Model Type to the desired LLM and click Deploy.
Custom Model Deployment
On the Deploy Service page, modify the Command and Node Type parameters and then click Deploy. The following table describes the parameter configurations for different models.
Name
Command
Recommended specification
Qwen2-7b
python webui/webui_server.py --port=8000 --model-path=Qwen/Qwen2-7B-Instruct
1 × GU30
1 × NVIDIA A10
1 × NVIDIA V100 (32 GB)
Qwen2-72b
python webui/webui_server.py --port=8000 --model-path=Qwen/Qwen2-72B-Instruct
2 × NVIDIA A100 (80 GB)
4 × NVIDIA A100 (40 GB)
8 × NVIDIA V100 (32 GB)
Qwen2-57b-A14b
python webui/webui_server.py --port=8000 --model-path=Qwen/Qwen2-57B-A14B-Instruct
2 × NVIDIA A100 (80 GB)
4 × NVIDIA A100 (40 GB)
4 × NVIDIA V100 (32 GB)
Qwen1.5-1.8b
python webui/webui_server.py --port=8000 --model-path=Qwen/Qwen1.5-1.8B-Chat
1 × NVIDIA T4
1 × NVIDIA V100 (16 GB)
1 × GU30
1 × NVIDIA A10
Qwen1.5-7b
python webui/webui_server.py --port=8000 --model-path=Qwen/Qwen1.5-7B-Chat
1 × GU30
1 × NVIDIA A10
Qwen1.5-14b
python webui/webui_server.py --port=8000 --model-path=Qwen/Qwen1.5-14B-Chat
1 × NVIDIA V100 (32 GB)
1 × NVIDIA A100 (40 GB)
1 × NVIDIA A100 (80 GB)
2 × GU30
2 × NVIDIA A10
Qwen1.5-32b
python webui/webui_server.py --port=8000 --model-path=Qwen/Qwen1.5-32B-Chat
1 × NVIDIA A100 (80 GB)
4 × NVIDIA V100 (32 GB)
Qwen1.5-72b
python webui/webui_server.py --port=8000 --model-path=Qwen/Qwen1.5-72B-Chat
8 × NVIDIA V100 (32 GB)
2 × NVIDIA A100 (80 GB)
4 × NVIDIA A100 (40 GB)
Qwen1.5-110b
python webui/webui_server.py --port=8000 --model-path=Qwen/Qwen1.5-110B-Chat
8 × NVIDIA A100 (40 GB)
4 × NVIDIA A100 (80 GB)
llama3-8b
python webui/webui_server.py --port=8000 --model-path=/huggingface/meta-Llama-3-8B-Instruct/ --model-type=llama3
1 × GU30
1 × NVIDIA A10
1 × NVIDIA V100 (32 GB)
llama3-70b
python webui/webui_server.py --port=8000 --model-path=/huggingface/meta-Llama-3-70B-Instruct/ --model-type=llama3
2 × NVIDIA A100 (80 GB)
4 × NVIDIA A100 (40 GB)
8 × NVIDIA V100 (32 GB)
Llama2-7b
python webui/webui_server.py --port=8000 --model-path=meta-llama/Llama-2-7b-chat-hf
1 × GU30
1 × NVIDIA A10
1 × NVIDIA V100 (32 GB)
Llama2-13b
python webui/webui_server.py --port=8000 --model-path=meta-llama/Llama-2-13b-chat-hf
1 × NVIDIA V100 (32 GB)
2 × GU30
2 × NVIDIA A10
llama2-70b
python webui/webui_server.py --port=8000 --model-path=meta-llama/Llama-2-70b-chat-hf
8 × NVIDIA V100 (32 GB)
2 × NVIDIA A100 (80 GB)
4 × NVIDIA A100 (40 GB)
chatglm3-6b
python webui/webui_server.py --port=8000 --model-path=THUDM/chatglm3-6b
1 × GU30
1 × NVIDIA A10
1 × NVIDIA V100 (16 GB)
1 × NVIDIA V100 (32 GB)
baichuan2-7b
python webui/webui_server.py --port=8000 --model-path=baichuan-inc/Baichuan2-7B-Chat
1 × GU30
1 × NVIDIA A10
1 × NVIDIA V100 (32 GB)
baichuan2-13b
python webui/webui_server.py --port=8000 --model-path=baichuan-inc/Baichuan2-13B-Chat
2 × GU30
2 × NVIDIA A10
1 × NVIDIA V100 (32 GB)
falcon-7b
python webui/webui_server.py --port=8000 --model-path=tiiuae/falcon-7b-instruct
1 × GU30
1 × NVIDIA A10
1 × NVIDIA V100 (32 GB)
falcon-40b
python webui/webui_server.py --port=8000 --model-path=tiiuae/falcon-40b-instruct
8 × NVIDIA V100 (32 GB)
2 × NVIDIA A100 (80 GB)
4 × NVIDIA A100 (40 GB)
falcon-180b
python webui/webui_server.py --port=8000 --model-path=tiiuae/falcon-180B-chat
8 × NVIDIA A100 (80 GB)
Yi-6b
python webui/webui_server.py --port=8000 --model-path=01-ai/Yi-6B-Chat
1 × GU30
1 × NVIDIA A10
1 × NVIDIA V100 (16 GB)
1 × NVIDIA V100 (32 GB)
Yi-34b
python webui/webui_server.py --port=8000 --model-path=01-ai/Yi-34B-Chat
4 × NVIDIA V100 (16 GB)
1 × NVIDIA A100 (80 GB)
4 × NVIDIA A10
mistral-7b-instruct-v0.2
python webui/webui_server.py --port=8000 --model-path=mistralai/Mistral-7B-Instruct-v0.2
1 × GU30
1 × NVIDIA A10
1 × NVIDIA V100 (32 GB)
mixtral-8x7b-instruct-v0.1
python webui/webui_server.py --port=8000 --model-path=mistralai/Mixtral-8x7B-Instruct-v0.1
4 × NVIDIA A100 (80 GB)
gemma-2b-it
python webui/webui_server.py --port=8000 --model-path=google/gemma-2b-it
1 × NVIDIA T4
1 × NVIDIA V100 (16 GB)
1 × GU30
1 × NVIDIA A10
gemma-7b-it
python webui/webui_server.py --port=8000 --model-path=google/gemma-7b-it
1 × GU30
1 × NVIDIA A10
1 × NVIDIA V100 (32 GB)
deepseek-coder-7b-instruct-v1.5
python webui/webui_server.py --port=8000 --model-path=deepseek-ai/deepseek-coder-7b-instruct-v1.5
1 × GU30
1 × NVIDIA A10
1 × NVIDIA V100 (32 GB)
deepseek-coder-33b-instruct
python webui/webui_server.py --port=8000 --model-path=deepseek-ai/deepseek-coder-33b-instruct
1 × NVIDIA A100 (80 GB)
2 × NVIDIA A100 (40 GB)
4 × NVIDIA V100 (32 GB)
deepseek-v2-lite
python webui/webui_server.py --port=8000 --model-path=deepseek-ai/DeepSeek-V2-Lite-Chat
1 × NVIDIA A10
1 × NVIDIA A100 (40 GB)
References
You can use EAS to deploy a dialog service integrated with LLM and Retrieval-Augmented Generation (RAG). After you use LangChain to integrate your business data, you can use WebUI or API operations to verify the inference capability of the model. For more information, see RAG-based LLM chatbot.