Deploy a multimodal LLM in EAS - Platform For AI - Alibaba Cloud Documentation Center

Deploy and call multimodal large language models through EAS for image and text processing.

Overview

Multimodal Large Language Models (MLLMs) process text, images, and audio simultaneously, integrating different data types for complex contexts and tasks. EAS enables one-click MLLM deployment in 5 minutes.

Prerequisites

Activate PAI and create a default workspace. See Activate PAI and create a default workspace.
If using a RAM user to deploy a model, grant the RAM user management permissions for EAS. See Cloud product dependencies and authorization: EAS.

Deploy a model service

Log on to the PAI console. Select a region on the top of the page. Then, select the desired workspace and click Elastic Algorithm Service (EAS).
Click Deploy Service. In the Custom Model Deployment section, click Custom Deployment.

On the Custom Deployment page, configure the following parameters. For other parameters, see Parameters for custom deployment in the console.

Parameter		Description
Environment Information	Deployment Method	Select Image-based Deployment and Enable Web App.
	Image Configuration	Select Alibaba Cloud Image > chat-mllm-webui > chat-mllm-webui:1.0. Note Select the latest image version.
	Command	After selecting an image, this parameter is auto-configured. Modify the model_type parameter to deploy different models. See the supported model types.
Resource Information	Deployment Resources	Select a GPU type. The ml.gu7i.c16m60.1-gu30 instance type is the most cost-effective.

Models

model_type	Model link
qwen_vl_chat	qwen/Qwen-VL-Chat
qwen_vl_chat_int4	qwen/Qwen-VL-Chat-Int4
qwen_vl	qwen/Qwen-VL
glm4v_9b_chat	ZhipuAI/glm-4v-9b
llava1_5-7b-instruct	swift/llava-1___5-7b-hf
llava1_5-13b-instruct	swift/llava-1___5-13b-hf
internvl_chat_v1_5_int8	AI-ModelScope/InternVL-Chat-V1-5-int8
internvl-chat-v1_5	AI-ModelScope/InternVL-Chat-V1-5
mini-internvl-chat-2b-v1_5	OpenGVLab/Mini-InternVL-Chat-2B-V1-5
mini-internvl-chat-4b-v1_5	OpenGVLab/Mini-InternVL-Chat-4B-V1-5
internvl2-2b	OpenGVLab/InternVL2-2B
internvl2-4b	OpenGVLab/InternVL2-4B
internvl2-8b	OpenGVLab/InternVL2-8B
internvl2-26b	OpenGVLab/InternVL2-26B
internvl2-40b	OpenGVLab/InternVL2-40B

Click Deploy.

Call a service

Use WebUI for model inference

On the Elastic Algorithm Service (EAS) page, click the service name, click Web application in the upper-right corner, and follow the instructions to open WebUI.
On the WebUI page, perform model inference.

Use API for model inference

Obtain the endpoint and token.
1. On the Elastic Algorithm Service (EAS) page, click the service name. In the Basic Information section, click View Endpoint Information.
2. In the Invocation Information pane, obtain the token and endpoint.

Call APIs for model inference.

Available APIs:

Get inference result

Obtain inference result.

Note

WebUI and API calls cannot be used simultaneously. If the WebUI was already used, run clear chat history before running infer forward.

Replace the following parameters in the sample code:

Parameter	Description
hosts	Endpoint obtained in Step 1.
authorization	Service token obtained in Step 1.
prompt	Question content. English is recommended.
image_path	Local image path.

View all request input and output descriptions

Input parameters:

Parameter	Data type	Description	Default value
prompt	String	Question content. Required.	No default value
image	Base64 encoding format	Image in Base64 encoding format.	None
chat_history	List[List]	Chat history.	[]
temperature	Float	Randomness of model output. Higher values increase randomness. 0 gives deterministic output. Range: 0 to 1.	0.2
top_p	Float	Proportion of outputs selected from generated results.	0.7
max_output_tokens	Int	Maximum output tokens.	512
use_stream	Bool	Enable streaming output mode: True False	True

Output: answer to the question (STRING type).

Python code example for model inference:

import requests
import json
import base64


def post_get_history(url='http://127.0.0.1:7860', headers=None):
    r = requests.post(f'{url}/get_history', headers=headers, timeout=1500)
    data = r.content.decode('utf-8')
    return data


def post_infer(prompt, image=None, chat_history=[], temperature=0.2, top_p=0.7, max_output_tokens=512, use_stream = True, url='http://127.0.0.1:7860', headers={}):
    datas = {
        "prompt": prompt,
        "image": image,
        "chat_history": chat_history,
        "temperature": temperature,
        "top_p": top_p,
        "max_output_tokens": max_output_tokens,
        "use_stream": use_stream,
    }

    if use_stream:
        headers.update({'Accept': 'text/event-stream'})

        response = requests.post(f'{url}/infer_forward', json=datas, headers=headers, stream=True, timeout=1500)

        if response.status_code != 200:
            print(f"Request failed with status code {response.status_code}")
            return
        process_stream(response)

    else:
        r = requests.post(f'{url}/infer_forward', json=datas, headers=headers, timeout=1500)
        data = r.content.decode('utf-8')

        print(data)


def image_to_base64(image_path):
    """
    Convert an image file to a Base64 encoded string.

    :param image_path: The file path to the image.
    :return: A Base64 encoded string representation of the image.
    """
    with open(image_path, "rb") as image_file:
        # Read the binary data of the image
        image_data = image_file.read()
        # Encode the binary data to Base64
        base64_encoded_data = base64.b64encode(image_data)
        # Convert bytes to string and remove any trailing newline characters
        base64_string = base64_encoded_data.decode('utf-8').replace('\n', '')
    return base64_string


def process_stream(response, previous_text=""):
    MARK_RESPONSE_END = '##END'  # DONOT CHANGE
    buffer = previous_text
    current_response = ""

    for chunk in response.iter_content(chunk_size=100):
        if chunk:
            text = chunk.decode('utf-8')
            current_response += text

            parts = current_response.split(MARK_RESPONSE_END)
            for part in parts[:-1]:
                new_part = part[len(previous_text):]
                if new_part:
                    print(new_part, end='', flush=True)

                previous_text = part

            current_response = parts[-1]

    remaining_new_text = current_response[len(previous_text):]
    if remaining_new_text:
        print(remaining_new_text, end='', flush=True)


if __name__ == '__main__':
    # Replace <service_url> with the service endpoint.
    hosts = '<service_url>'
    # Replace <token> with the service token.
    head = {
        'Authorization': '<token>'
    }

    # get chat history
    chat_history = json.loads(post_get_history(url=hosts, headers=head))['chat_history']

    # The content of the question. A question in English is recommended.
    prompt = 'Please describe the image'
    # Replace path_to_your_image with the local path of the image.
    image_path = 'path_to_your_image'
    image_base_64 = image_to_base64(image_path)

    post_infer(prompt = prompt, image = image_base_64, chat_history = chat_history, use_stream=False, url=hosts, headers=head)

Get chat history

Obtain chat history.

Replace the following parameters in the sample code:

Parameter	Description
hosts	Service endpoint obtained in Step 1.
authorization	Service token obtained in Step 1.

No input parameters required.

Output parameters:

Parameter	Type	Note
chat_history	List[List]	Conversation history.

Python code example for model inference:

import requests
import json

def post_get_history(url='http://127.0.0.1:7860', headers=None):
    r = requests.post(f'{url}/get_history', headers=headers, timeout=1500)
    data = r.content.decode('utf-8')
    return data


if __name__ == '__main__':
    # Replace <service_url> with the service URL
    hosts = '<service_url>'
    # Replace <token> with the service token
    head = {
        'Authorization': '<token>'
    }

    chat_history = json.loads(post_get_history(url=hosts, headers=head))['chat_history']
    print(chat_history)

Clear chat history

Clear chat history.

Replace the following parameters in the sample code:

Parameter	Description
hosts	Endpoint obtained in Step 1.
authorization	Service token obtained in Step 1.

No input parameters required.
Returns: success.

Python code example for model inference:

import requests
import json


def post_clear_history(url='http://127.0.0.1:7860', headers=None):
    r = requests.post(f'{url}/clear_history', headers=headers, timeout=1500)
    data = r.content.decode('utf-8')
    return data


if __name__ == '__main__':
    # Replace <service_url> with the service endpoint.
    hosts = '<service_url>'
    # Replace <token> with the service token.
    head = {
        'Authorization': '<token>'
    }
    clear_info = post_clear_history(url=hosts, headers=head)
    print(clear_info)