Date | Image version | Built-in library version | Description |
2024.6.21 | eas-registry.cn-hangzhou.cr.aliyuncs.com/pai-eas/chat-llm-webui:3.0.4 Tag: chat-llm-webui:3.0 eas-registry.cn-hangzhou.cr.aliyuncs.com/pai-eas/chat-llm-webui:3.0.4-flash-attn eas-registry.cn-hangzhou.cr.aliyuncs.com/pai-eas/chat-llm-webui:3.0.4-vllm Tag: chat-llm-webui:3.0-vllm eas-registry.cn-hangzhou.cr.aliyuncs.com/pai-eas/chat-llm-webui:3.0.4-vllm-flash-attn eas-registry.cn-hangzhou.cr.aliyuncs.com/pai-eas/chat-llm-webui:3.0.4-blade Tag: chat-llm-webui:3.0-blade
| Torch: 2.3.0 Torchvision: 0.18.0 Transformers: 4.41.2 vLLM: 0.5.0.post1 vllm-flash-attn: 2.5.9 Blade: 0.7.0
| Deployment of the Rerank model is supported. Simultaneous or individual deployment of the Embedding, Rerank, LLM models is supported. The Transformers backend supports Deepseek-V2, Yi1.5, and Qwen2. The model type of Qwen1.5 is changed to qwen1.5. The vLLM backend supports Qwen2. The BladeLLM backend supports Llama3 and Qwen2. The HuggingFace backend supports batch input. The BladeLLM backend supports OpenAI Chat. Access to BladeLLM Metrics is fixed. The Transformers backend supports FP8 model deployment. The Transformers backend supports multiple quantization toolkits: AWQ, HQQ, and Quanto. The vLLM backend supports FP8. The vLLM and Blade inference parameters support stop words. The Transformers backend supports H20 graphics cards.
|
2024.4.30 | eas-registry.cn-hangzhou.cr.aliyuncs.com/pai-eas/chat-llm-webui:3.0.3 eas-registry.cn-hangzhou.cr.aliyuncs.com/pai-eas/chat-llm-webui:3.0.3-flash-attn eas-registry.cn-hangzhou.cr.aliyuncs.com/pai-eas/chat-llm-webui:3.0.3-vllm eas-registry.cn-hangzhou.cr.aliyuncs.com/pai-eas/chat-llm-webui:3.0.3-vllm-flash-attn eas-registry.cn-hangzhou.cr.aliyuncs.com/pai-eas/chat-llm-webui:3.0.3-blade
| Torch: 2.3.0 Torchvision: 0.18.0 Transformers: 4.40.2 vllm: 0.4.2 Blade: 0.5.1
| Deployment of the Embedding model is supported. The vLLM backend supports Token Usage return. Deployment of the Sentence-Transformers model is supported. The Transformers backend supports the following models: yi-9B, qwen2-moe, llama3, qwencode, qwen1.5-32G/110B, phi-3, and gemma-1.1-2/7B. The vLLM backend supports the following models: yi-9B, qwen2-moe, SeaLLM, llama3, and phi-3. The Blade backend supports qwen1.5 and SeaLLM. Multi-model deployment of LLM and Embedding is supported. The Transformers backend releases the flash-attn image. The vLLM backend releases the flash-attn image.
|
2024.3.28 | eas-registry.cn-hangzhou.cr.aliyuncs.com/pai-eas/chat-llm-webui:3.0.2 eas-registry.cn-hangzhou.cr.aliyuncs.com/pai-eas/chat-llm-webui:3.0.2-vllm eas-registry.cn-hangzhou.cr.aliyuncs.com/pai-eas/chat-llm-webui:3.0.2-blade
| Torch: 2.1.2 Torchvision: 0.16.2 Transformers: 4.38.2 Vllm: 0.3.3 Blade: 0.4.8
| The blad inference backend is added, which supports multiple GPUs for one server and quantization. The Transformers backend performs inference based on the tokenizer chat template template. The HF backend supports Multi-LoRA inference. Blade supports deployment of quantized models. Blade supports automatic model splitting. The Transformers backend supports DeepSeek and Gemma. The vLLM backend supports Deepseek and Gemma. The Blade backend supports qwen1.5 and yi models. The vLLM and Blade images enable /metrics access. The Transformers backend supports token statistics for streaming outputs.
|
2024.2.22 | | Torch: 2.1.2 Torchvision: 0.16.0 Transformers: 4.37.2 vLLM: 0.3.0
| vLLM supports modifications to all inference parameters during inference. vLLM supports Multi-LoRA. vLLM supports deployment of quantized models. vLLM images no longer rely on the LangChain demo. The Transformers inference backend supports qwen1.5 and qwen2 models. The vLLM inference backend supports qwen-1.5 and qwen-2 models.
|
2024.1.23 | | Torch: 2.1.2 Torchvision: 0.16.2 Transformers: 4.37.2 vLLM: 0.2.6
| The backend image is split and independently compiled and published: The BladeLLM backend is added. Standard OpenAI APIs are supported. Baichuan and other models support performance statistics. The following models are supported: yi-6b-chat, yi-34b-chat, and secgpt. openai/v1/chat/completions supports chatglm3 history-format. Asynchronous streaming mode is improved. vLLM supports model alignment with HuggingFace. The backend call interface is improved. The error log is improved.
|
2023.12.6 | eas-registry.cn-hangzhou.cr.aliyuncs.com/pai-eas/chat-llm-webui:2.1 Tag: chat-llm-webui:2.1 | Torch: 2.0.1 Torchvision: 0.15.2 Transformers: 4.33.3 vLLM: 0.2.0
| The Huggingface backend supports the following models: mistral, zephyr, yi-6b, yi-34b, qwen-72b, qwen-1.8b, qwen7b-int4, qwen14b-int4, qwen7b-int8, qwen14b-int8, qwen-72b-int4, qwen-72b-int8, qwen-1.8b-int4, and qwen-1.8b-int8. The vLLM backend supports Qwen and ChatGLM1/2/3 models. The HuggingFace inference backend supports flash attention. ChatGLM models support performance statistics metrics. The command line parameter -- history-format is added and supports to specify roles. LangChain supports demo Qwen models. Improved FastAPI streaming API.
|
2023.9.13 | eas-registry.cn-hangzhou.cr.aliyuncs.com/pai-eas/chat-llm-webui:2.0 Tag: chat-llm-webui:2.0 | | Multiple backends of vLLM and Huggingface are supported. LangChain demo supports ChatLLM and Llama2 models The following models are supported: Baichuan, Baichuan2, Qwen, Falcon, Llama2, ChatGLM, ChatGLM2, ChatGLM3, and yi. http and webscoket support conversation streaming mode. The number of output tokens is included in non-streaming output mode. All models support multi-round conversations. Export of conversation history is supported. System prompt settings and prompt splicing without a template are supported. Inference parameters configurations are supported. Log debug mode is supported, which supports inference time output. By default, the vLLM backend supports the TP parallel scheme for multiple GPUs. Model deployment with Float32, Float16, Int8, and Int4 precision is supported.
|