All Products
Search
Document Center

Platform For AI:Release notes for ChatLLM WebUI

Last Updated:Sep 23, 2024

This topic describes the release notes for ChatLLM Web User Interface (WebUI).

Important versions

Date

Image version

Built-in library version

Description

2024.6.21

  • eas-registry.cn-hangzhou.cr.aliyuncs.com/pai-eas/chat-llm-webui:3.0.4

    Tag: chat-llm-webui:3.0

  • eas-registry.cn-hangzhou.cr.aliyuncs.com/pai-eas/chat-llm-webui:3.0.4-flash-attn

  • eas-registry.cn-hangzhou.cr.aliyuncs.com/pai-eas/chat-llm-webui:3.0.4-vllm

    Tag: chat-llm-webui:3.0-vllm

  • eas-registry.cn-hangzhou.cr.aliyuncs.com/pai-eas/chat-llm-webui:3.0.4-vllm-flash-attn

  • eas-registry.cn-hangzhou.cr.aliyuncs.com/pai-eas/chat-llm-webui:3.0.4-blade

    Tag: chat-llm-webui:3.0-blade

  • Torch: 2.3.0

  • Torchvision: 0.18.0

  • Transformers: 4.41.2

  • vLLM: 0.5.0.post1

  • vllm-flash-attn: 2.5.9

  • Blade: 0.7.0

  • Deployment of the Rerank model is supported.

  • Simultaneous or individual deployment of the Embedding, Rerank, LLM models is supported.

  • The Transformers backend supports Deepseek-V2, Yi1.5, and Qwen2.

  • The model type of Qwen1.5 is changed to qwen1.5.

  • The vLLM backend supports Qwen2.

  • The BladeLLM backend supports Llama3 and Qwen2.

  • The HuggingFace backend supports batch input.

  • The BladeLLM backend supports OpenAI Chat.

  • Access to BladeLLM Metrics is fixed.

  • The Transformers backend supports FP8 model deployment.

  • The Transformers backend supports multiple quantization toolkits: AWQ, HQQ, and Quanto.

  • The vLLM backend supports FP8.

  • The vLLM and Blade inference parameters support stop words.

  • The Transformers backend supports H20 graphics cards.

2024.4.30

  • eas-registry.cn-hangzhou.cr.aliyuncs.com/pai-eas/chat-llm-webui:3.0.3

  • eas-registry.cn-hangzhou.cr.aliyuncs.com/pai-eas/chat-llm-webui:3.0.3-flash-attn

  • eas-registry.cn-hangzhou.cr.aliyuncs.com/pai-eas/chat-llm-webui:3.0.3-vllm

  • eas-registry.cn-hangzhou.cr.aliyuncs.com/pai-eas/chat-llm-webui:3.0.3-vllm-flash-attn

  • eas-registry.cn-hangzhou.cr.aliyuncs.com/pai-eas/chat-llm-webui:3.0.3-blade

  • Torch: 2.3.0

  • Torchvision: 0.18.0

  • Transformers: 4.40.2

  • vllm: 0.4.2

  • Blade: 0.5.1

  • Deployment of the Embedding model is supported.

  • The vLLM backend supports Token Usage return.

  • Deployment of the Sentence-Transformers model is supported.

  • The Transformers backend supports the following models: yi-9B, qwen2-moe, llama3, qwencode, qwen1.5-32G/110B, phi-3, and gemma-1.1-2/7B.

  • The vLLM backend supports the following models: yi-9B, qwen2-moe, SeaLLM, llama3, and phi-3.

  • The Blade backend supports qwen1.5 and SeaLLM.

  • Multi-model deployment of LLM and Embedding is supported.

  • The Transformers backend releases the flash-attn image.

  • The vLLM backend releases the flash-attn image.

2024.3.28

  • eas-registry.cn-hangzhou.cr.aliyuncs.com/pai-eas/chat-llm-webui:3.0.2

  • eas-registry.cn-hangzhou.cr.aliyuncs.com/pai-eas/chat-llm-webui:3.0.2-vllm

  • eas-registry.cn-hangzhou.cr.aliyuncs.com/pai-eas/chat-llm-webui:3.0.2-blade

  • Torch: 2.1.2

  • Torchvision: 0.16.2

  • Transformers: 4.38.2

  • Vllm: 0.3.3

  • Blade: 0.4.8

  • The blad inference backend is added, which supports multiple GPUs for one server and quantization.

  • The Transformers backend performs inference based on the tokenizer chat template template.

  • The HF backend supports Multi-LoRA inference.

  • Blade supports deployment of quantized models.

  • Blade supports automatic model splitting.

  • The Transformers backend supports DeepSeek and Gemma.

  • The vLLM backend supports Deepseek and Gemma.

  • The Blade backend supports qwen1.5 and yi models.

  • The vLLM and Blade images enable /metrics access.

  • The Transformers backend supports token statistics for streaming outputs.

2024.2.22

  • eas-registry.cn-hangzhou.cr.aliyuncs.com/pai-eas/chat-llm-webui:3.0.1

  • eas-registry.cn-hangzhou.cr.aliyuncs.com/pai-eas/chat-llm-webui:3.0.1-vllm

  • Torch: 2.1.2

  • Torchvision: 0.16.0

  • Transformers: 4.37.2

  • vLLM: 0.3.0

  • vLLM supports modifications to all inference parameters during inference.

  • vLLM supports Multi-LoRA.

  • vLLM supports deployment of quantized models.

  • vLLM images no longer rely on the LangChain demo.

  • The Transformers inference backend supports qwen1.5 and qwen2 models.

  • The vLLM inference backend supports qwen-1.5 and qwen-2 models.

2024.1.23

  • eas-registry.cn-hangzhou.cr.aliyuncs.com/pai-eas/chat-llm-webui:3.0

  • eas-registry.cn-hangzhou.cr.aliyuncs.com/pai-eas/chat-llm-webui:3.0-vllm

  • Torch: 2.1.2

  • Torchvision: 0.16.2

  • Transformers: 4.37.2

  • vLLM: 0.2.6

  • The backend image is split and independently compiled and published: The BladeLLM backend is added.

  • Standard OpenAI APIs are supported.

  • Baichuan and other models support performance statistics.

  • The following models are supported: yi-6b-chat, yi-34b-chat, and secgpt.

  • openai/v1/chat/completions supports chatglm3 history-format.

  • Asynchronous streaming mode is improved.

  • vLLM supports model alignment with HuggingFace.

  • The backend call interface is improved.

  • The error log is improved.

2023.12.6

eas-registry.cn-hangzhou.cr.aliyuncs.com/pai-eas/chat-llm-webui:2.1

Tag: chat-llm-webui:2.1

  • Torch: 2.0.1

  • Torchvision: 0.15.2

  • Transformers: 4.33.3

  • vLLM: 0.2.0

  • The Huggingface backend supports the following models: mistral, zephyr, yi-6b, yi-34b, qwen-72b, qwen-1.8b, qwen7b-int4, qwen14b-int4, qwen7b-int8, qwen14b-int8, qwen-72b-int4, qwen-72b-int8, qwen-1.8b-int4, and qwen-1.8b-int8.

  • The vLLM backend supports Qwen and ChatGLM1/2/3 models.

  • The HuggingFace inference backend supports flash attention.

  • ChatGLM models support performance statistics metrics.

  • The command line parameter -- history-format is added and supports to specify roles.

  • LangChain supports demo Qwen models.

  • Improved FastAPI streaming API.

2023.9.13

eas-registry.cn-hangzhou.cr.aliyuncs.com/pai-eas/chat-llm-webui:2.0

Tag: chat-llm-webui:2.0

  • Torch: 2.0.1+cu117

  • Torchvision: 0.15.2+cu117

  • Transformers: 4.33.3

  • vLLM: 0.2.0

  • Multiple backends of vLLM and Huggingface are supported.

  • LangChain demo supports ChatLLM and Llama2 models

  • The following models are supported: Baichuan, Baichuan2, Qwen, Falcon, Llama2, ChatGLM, ChatGLM2, ChatGLM3, and yi.

  • http and webscoket support conversation streaming mode.

  • The number of output tokens is included in non-streaming output mode.

  • All models support multi-round conversations.

  • Export of conversation history is supported.

  • System prompt settings and prompt splicing without a template are supported.

  • Inference parameters configurations are supported.

  • Log debug mode is supported, which supports inference time output.

  • By default, the vLLM backend supports the TP parallel scheme for multiple GPUs.

  • Model deployment with Float32, Float16, Int8, and Int4 precision is supported.

References

Elastic Algorithm Service (EAS) provides a scenairo-based deployment method for ChatLLM, allowing you to deploy popular open-source large language model (LLM) services by configuring a few parameters. For more information about how to deploy and call LLM services, see LLM Deployment.