Parameter | Description | Default value |
--model-path | Specify the preset model name or a custom model path. Example 1: Load a preset model. You can use a preset model in the meta-llama/Llama-2-* series, including Llama-2-7b-hf, Llama-2-7b-chat-hf, Llama-2-13b-hf, and Llama-2-13b-chat-hf. Example: python webui/webui_server.py --port=8000 --model-path=meta-llama/Llama-2-7b-chat-hf .
Example 2: Load an on-premises custom model. Example: python webui/webui_server.py --port=8000 --model-path=/llama2-7b-chat .
| meta-llama/Llama-2-7b-chat-hf |
--cpu | Use CPU to perform model inference. Example: python webui/webui_server.py --port=8000 --cpu . | By default, GPU is used for model inference. |
--precision | Specify the precision of the Llama2 model. Valid values: fp32 and fp16. Example: python webui/webui_server.py --port=8000 --precision=fp32 . | The system automatically specifies the precision of the 7B model based on the GPU memory size. |
--port | Specify the listening port of the server. Sample code: python webui/webui_server.py --port=8000 . | 8000 |
--api-only | Allows users to access the service only by calling API operations. By default, the service starts both WebUI and API server. Sample code: python webui/webui_server.py --api-only . | False |
--no-api | Allows users to access the service only by using the WebUI. By default, the service starts both WebUI and API server. Sample code: python webui/webui_server.py --no-api . | False |
--max-new-tokens | The maximum number of output tokens. Sample code: python api/api_server.py --port=8000 --max-new-tokens=1024 . | 2048 |
--temperature | The randomness of the model output. A larger value specifies a higher randomness. A value of 0 specifies a fixed output. The value is of the Float type and ranges from 0 to 1. Sample code: python api/api_server.py --port=8000 --max_length=0.8 . | 0.95 |
--max_round | The maximum number of rounds of dialogue supported during inference. Sample code: python api/api_server.py --port=8000 --max_round=10 . | 5 |
--top_k | The number of outputs selected from the generated results. The value is a positive integer. Example: python api/api_server.py --port=8000 --top_k=10 . | None |
--top_p | The probability threshold of outputs selected from the generated results. The value is of the Float type and ranges from 0 to 1. Sample code: python api/api_server.py --port=8000 --top_p=0.9 . | None |
--no-template | Models such as Llama 2 and Falcon provide a default prompt template. If you leave this parameter empty, the default prompt template is used. If you configure this parameter, you must specify your own template. Sample code: python api/api_server.py --port=8000 --no-template . | If you do not specify this parameter, the default prompt template is automatically used. |
--log-level | The log output level. Valid values: DEBUG, INFO, WARNING, and ERROR. Sample code: python api/api_server.py --port=8000 --log-level=DEBUG . | INFO |
--export-history-path | You can use EAS-LLM to export the conversation history. In this case, you must specify an output path to which you want to export the conversation history when you start the service. In most cases, you can specify the mount path of an OSS bucket. EAS exports the records of the conversation that happened over a specific period of time to a file. Sample code: python api/api_server.py --port=8000 --export-history-path=/your_mount_path . | By default, this feature is disabled. |
--export-interval | The period of time during which the conversation is recorded. Unit: seconds. For example, if you set the --export-interval parameter to 3600, the conversation records of the previous hour are exported into a file. | 3600 |