Multimodal large language models (MLLMs) allow you to process and integrate multimodal data, such as text, images, and audios. This way, you can understand complex scenarios and tasks in a comprehensive manner. MLLMs are suitable for scenarios that require cross-modal comprehension and generation. You can use Elastic Algorithm Service (EAS) to deploy MLLMs as inference services with a few clicks and obtain the inference capabilities of MLLMs. This article describes how to deploy and call MLLM inference services by using PAI-EAS.
In recent years, various large language models (LLMs) have achieved unprecedented results in language tasks. LLMs are used to generate natural language text and demonstrate strong capabilities in multiple types of tasks, such as sentiment analytics, machine translation, and text summarization. However, the models are limited to text data and cannot be used to process other forms of data, such as images, audios, or videos. Only models that have multimodal comprehension can be close to the super brain of human.
To address this issue, MLLMs are introduced. As models such as GPT-4o are widely used in the industry, MLLMs have become increasingly popular in the industry. MLLMs allow you to process and integrate multimodal data, such as text, images, and audios. This way, you can understand complex scenarios and tasks in a comprehensive manner.
You can use EAS to deploy a MLLM with a few clicks. EAS allows you to deploy popular MLLM inference services with a few clicks to obtain inference capabilities.
Platform for AI (PAI) is activated and a default workspace is created. For more information, see Activate PAI and create a default workspace.
If you want to deploy a model as a RAM user, make sure that the RAM user has the permissions to manage EAS. For more information, see Grant the permissions that are required to use EAS.
1. Go to the EAS page.
2. On the Model Online Service (EAS) page, click Deploy Service. In the Custom Model Deployment section, click Custom Deploy.
3. On the Create Service page, configure the parameters. The following table describes the key parameters. For information about other parameters, see Deploy a model service in the PAI console.
Parameter |
Description |
|
Model Service Information |
Deployment Method |
Select Deploy Web App by Using Image. |
Select Image |
Select chat-mllm-webui from the PAI Image drop-down list. Select 1.0 from the version drop-down list. Note We recommend that you select the latest version of the image when you deploy the model service. |
|
Command to Run |
After you select an image, the system automatically configures this parameter. You can modify the model_type parameter to deploy different models. The following table provides the supported model types. |
|
Resource Deployment Information |
Resource Configuration |
Select the GPU configuration. We recommend that you select the ml.gu7i.c16m60.1-gu30 instance type, which is the most cost-effective. |
Models
model_type | Model link |
---|---|
qwen_vl_chat | qwen/Qwen-VL-Chat |
qwen_vl_chat_int4 | qwen/Qwen-VL-Chat-Int4 |
qwen_vl | qwen/Qwen-VL |
glm4v_9b_chat | ZhipuAI/glm-4v-9b |
llava1_5-7b-instruct | swift/llava-1___5-7b-hf |
llava1_5-13b-instruct | swift/llava-1___5-13b-hf |
internvl_chat_v1_5_int8 | AI-ModelScope/InternVL-Chat-V1-5-int8 |
internvl-chat-v1_5 | AI-ModelScope/InternVL-Chat-V1-5 |
mini-internvl-chat-2b-v1_5 | OpenGVLab/Mini-InternVL-Chat-2B-V1-5 |
mini-internvl-chat-4b-v1_5 | OpenGVLab/Mini-InternVL-Chat-4B-V1-5 |
internvl2-2b | OpenGVLab/InternVL2-2B |
internvl2-4b | OpenGVLab/InternVL2-4B |
internvl2-8b | OpenGVLab/InternVL2-8B |
internvl2-26b | OpenGVLab/InternVL2-26B |
internvl2-40b | OpenGVLab/InternVL2-40B |
4. After you configure the parameters, click Deploy.
1. Obtain the endpoint and token of the service.
2. Call API operations to perform model inference.
PAI provides the following APIs:
Obtain the inference result. The following sample code provides an example on how to use Python to perform model inference:
import requests
import json
import base64
def post_get_history(url='http://127.0.0.1:7860', headers=None):
r = requests.post(f'{url}/get_history', headers=headers, timeout=1500)
data = r.content.decode('utf-8')
return data
def post_infer(prompt, image=None, chat_history=[], temperature=0.2, top_p=0.7, max_output_tokens=512, use_stream = True, url='http://127.0.0.1:7860', headers={}):
datas = {
"prompt": prompt,
"image": image,
"chat_history": chat_history,
"temperature": temperature,
"top_p": top_p,
"max_output_tokens": max_output_tokens,
"use_stream": use_stream,
}
if use_stream:
headers.update({'Accept': 'text/event-stream'})
response = requests.post(f'{url}/infer_forward', json=datas, headers=headers, stream=True, timeout=1500)
if response.status_code != 200:
print(f"Request failed with status code {response.status_code}")
return
process_stream(response)
else:
r = requests.post(f'{url}/infer_forward', json=datas, headers=headers, timeout=1500)
data = r.content.decode('utf-8')
print(data)
def image_to_base64(image_path):
"""
Convert an image file to a Base64 encoded string.
:param image_path: The file path to the image.
:return: A Base64 encoded string representation of the image.
"""
with open(image_path, "rb") as image_file:
# Read the binary data of the image
image_data = image_file.read()
# Encode the binary data to Base64
base64_encoded_data = base64.b64encode(image_data)
# Convert bytes to string and remove any trailing newline characters
base64_string = base64_encoded_data.decode('utf-8').replace('\n', '')
return base64_string
def process_stream(response, previous_text=""):
MARK_RESPONSE_END = '##END' # DONOT CHANGE
buffer = previous_text
current_response = ""
for chunk in response.iter_content(chunk_size=100):
if chunk:
text = chunk.decode('utf-8')
current_response += text
parts = current_response.split(MARK_RESPONSE_END)
for part in parts[:-1]:
new_part = part[len(previous_text):]
if new_part:
print(new_part, end='', flush=True)
previous_text = part
current_response = parts[-1]
remaining_new_text = current_response[len(previous_text):]
if remaining_new_text:
print(remaining_new_text, end='', flush=True)
if __name__ == '__main__':
hosts = 'xxx'
head = {
'Authorization': 'xxx'
}
# get chat history
chat_history = json.loads(post_get_history(url=hosts, headers=head))['chat_history']
prompt = 'Please describe the image'
image_path = 'path_to_your_image'
image_base_64 = image_to_base64(image_path)
post_infer(prompt = prompt, image = image_base_64, chat_history = chat_history, use_stream=False, url=hosts, headers=head)
Parameter | Description |
---|---|
hosts | The service endpoint that you obtained in Step 1. |
authorization | The service token that you obtained in Step 1. |
prompt | The content of the question. A question in English is recommended. |
image_path | The on-premises path in which the image resides. |
Parameter | Data type | Description | Default value |
---|---|---|---|
prompt | String | The content of the question. This parameter is required. | No default value |
image | Base64 | The image. | No default value |
chat_history | List[List] | The chat history. | [] |
temperature | Float | The randomness of the model output. A large value specifies high randomness. The value 0 specifies a fixed output. The value ranges from 0 to 1. | 0.2 |
top_p | Float | The proportion of outputs selected from the generated results. | 0.7 |
max_output_tokens | Int | The maximum number of tokens. | 512 |
use_stream | Bool | Specifies whether to enable the streaming output mode. Valid values: - True - False |
True |
Obtain the chat history. The following sample code provides an example on how to use Python to perform model inference:
import requests
import json
def post_get_history(url='http://127.0.0.1:7860', headers=None):
r = requests.post(f'{url}/get_history', headers=headers, timeout=1500)
data = r.content.decode('utf-8')
return data
if __name__ == '__main__':
hosts = 'xxx'
head = {
'Authorization': 'xxx'
}
chat_history = json.loads(post_get_history(url=hosts, headers=head))['chat_history']
print(chat_history)
Parameter | Description |
---|---|
hosts | The service endpoint that you obtained in Step 1. |
authorization | The service token that you obtained in Step 1. |
Parameter | Data type | Description |
---|---|---|
chat_history | List[List] | The chat history. |
Clear the chat history. The following sample code provides an example on how to use Python to perform model inference:
import requests
import json
def post_clear_history(url='http://127.0.0.1:7860', headers=None):
r = requests.post(f'{url}/clear_history', headers=headers, timeout=1500)
data = r.content.decode('utf-8')
return data
if __name__ == '__main__':
hosts = 'xxx'
head = {
'Authorization': 'xxx'
}
clear_info = post_clear_history(url=hosts, headers=head)
print(clear_info)
Parameter | Description |
---|---|
hosts | The service endpoint that you obtained in Step 1. |
authorization | The service token that you obtained in Step 1. |
Exploring DevOps in the Era of AI Foundation Models Part Ⅲ: Dive Into Agent
Use EAS and ApsaraDB RDS for PostgreSQL to Deploy a RAG-Based LLM Chatbot
42 posts | 1 followers
FollowAlibaba Cloud Community - September 6, 2024
Alibaba Cloud Data Intelligence - June 20, 2024
Farruh - July 18, 2024
Alibaba Cloud Project Hub - March 19, 2024
Alibaba Cloud Data Intelligence - June 17, 2024
Alibaba Cloud Data Intelligence - April 22, 2024
42 posts | 1 followers
FollowA platform that provides enterprise-level data modeling services based on machine learning algorithms to quickly meet your needs for data-driven operations.
Learn MoreAccelerate innovation with generative AI to create new business success
Learn MoreAccelerate AI-driven business and AI model training and inference with Alibaba Cloud GPU technology
Learn MoreThis solution provides you with Artificial Intelligence services and allows you to build AI-powered, human-like, conversational, multilingual chatbots over omnichannel to quickly respond to your customers 24/7.
Learn MoreMore Posts by Alibaba Cloud Data Intelligence