All Products
Search
Document Center

Platform For AI:Deploy a multimodal LLM in EAS

Last Updated:Mar 12, 2026

Deploy and call multimodal large language models through EAS for image and text processing.

Overview

Multimodal Large Language Models (MLLMs) process text, images, and audio simultaneously, integrating different data types for complex contexts and tasks. EAS enables one-click MLLM deployment in 5 minutes.

Prerequisites

Deploy a model service

  1. Log on to the PAI console. Select a region on the top of the page. Then, select the desired workspace and click Elastic Algorithm Service (EAS).

  2. Click Deploy Service. In the Custom Model Deployment section, click Custom Deployment.

  3. On the Custom Deployment page, configure the following parameters. For other parameters, see Parameters for custom deployment in the console.

    Parameter

    Description

    Environment Information

    Deployment Method

    Select Image-based Deployment and Enable Web App.

    Image Configuration

    Select Alibaba Cloud Image > chat-mllm-webui > chat-mllm-webui:1.0.

    Note

    Select the latest image version.

    Command

    After selecting an image, this parameter is auto-configured. Modify the model_type parameter to deploy different models. See the supported model types.

    Resource Information

    Deployment Resources

    Select a GPU type. The ml.gu7i.c16m60.1-gu30 instance type is the most cost-effective.

    Models

    model_type

    Model link

    qwen_vl_chat

    qwen/Qwen-VL-Chat

    qwen_vl_chat_int4

    qwen/Qwen-VL-Chat-Int4

    qwen_vl

    qwen/Qwen-VL

    glm4v_9b_chat

    ZhipuAI/glm-4v-9b

    llava1_5-7b-instruct

    swift/llava-1___5-7b-hf

    llava1_5-13b-instruct

    swift/llava-1___5-13b-hf

    internvl_chat_v1_5_int8

    AI-ModelScope/InternVL-Chat-V1-5-int8

    internvl-chat-v1_5

    AI-ModelScope/InternVL-Chat-V1-5

    mini-internvl-chat-2b-v1_5

    OpenGVLab/Mini-InternVL-Chat-2B-V1-5

    mini-internvl-chat-4b-v1_5

    OpenGVLab/Mini-InternVL-Chat-4B-V1-5

    internvl2-2b

    OpenGVLab/InternVL2-2B

    internvl2-4b

    OpenGVLab/InternVL2-4B

    internvl2-8b

    OpenGVLab/InternVL2-8B

    internvl2-26b

    OpenGVLab/InternVL2-26B

    internvl2-40b

    OpenGVLab/InternVL2-40B

  4. Click Deploy.

Call a service

Use WebUI for model inference

  1. On the Elastic Algorithm Service (EAS) page, click the service name, click Web application in the upper-right corner, and follow the instructions to open WebUI.

  2. On the WebUI page, perform model inference.cb3daf8135235cbd35c456965fc60199

Use API for model inference

  1. Obtain the endpoint and token.

    1. On the Elastic Algorithm Service (EAS) page, click the service name. In the Basic Information section, click View Endpoint Information.

    2. In the Invocation Information pane, obtain the token and endpoint.

  2. Call APIs for model inference.

    Available APIs:

    Get inference result

    Obtain inference result.

    Note

    WebUI and API calls cannot be used simultaneously. If the WebUI was already used, run clear chat history before running infer forward.

    Replace the following parameters in the sample code:

    Parameter

    Description

    hosts

    Endpoint obtained in Step 1.

    authorization

    Service token obtained in Step 1.

    prompt

    Question content. English is recommended.

    image_path

    Local image path.

    View all request input and output descriptions

    • Input parameters:

      Parameter

      Data type

      Description

      Default value

      prompt

      String

      Question content. Required.

      No default value

      image

      Base64 encoding format

      Image in Base64 encoding format.

      None

      chat_history

      List[List]

      Chat history.

      []

      temperature

      Float

      Randomness of model output. Higher values increase randomness. 0 gives deterministic output. Range: 0 to 1.

      0.2

      top_p

      Float

      Proportion of outputs selected from generated results.

      0.7

      max_output_tokens

      Int

      Maximum output tokens.

      512

      use_stream

      Bool

      Enable streaming output mode:

      • True

      • False

      True

    • Output: answer to the question (STRING type).

    Python code example for model inference:

    import requests
    import json
    import base64
    
    
    def post_get_history(url='http://127.0.0.1:7860', headers=None):
        r = requests.post(f'{url}/get_history', headers=headers, timeout=1500)
        data = r.content.decode('utf-8')
        return data
    
    
    def post_infer(prompt, image=None, chat_history=[], temperature=0.2, top_p=0.7, max_output_tokens=512, use_stream = True, url='http://127.0.0.1:7860', headers={}):
        datas = {
            "prompt": prompt,
            "image": image,
            "chat_history": chat_history,
            "temperature": temperature,
            "top_p": top_p,
            "max_output_tokens": max_output_tokens,
            "use_stream": use_stream,
        }
    
        if use_stream:
            headers.update({'Accept': 'text/event-stream'})
    
            response = requests.post(f'{url}/infer_forward', json=datas, headers=headers, stream=True, timeout=1500)
    
            if response.status_code != 200:
                print(f"Request failed with status code {response.status_code}")
                return
            process_stream(response)
    
        else:
            r = requests.post(f'{url}/infer_forward', json=datas, headers=headers, timeout=1500)
            data = r.content.decode('utf-8')
    
            print(data)
    
    
    def image_to_base64(image_path):
        """
        Convert an image file to a Base64 encoded string.
    
        :param image_path: The file path to the image.
        :return: A Base64 encoded string representation of the image.
        """
        with open(image_path, "rb") as image_file:
            # Read the binary data of the image
            image_data = image_file.read()
            # Encode the binary data to Base64
            base64_encoded_data = base64.b64encode(image_data)
            # Convert bytes to string and remove any trailing newline characters
            base64_string = base64_encoded_data.decode('utf-8').replace('\n', '')
        return base64_string
    
    
    def process_stream(response, previous_text=""):
        MARK_RESPONSE_END = '##END'  # DONOT CHANGE
        buffer = previous_text
        current_response = ""
    
        for chunk in response.iter_content(chunk_size=100):
            if chunk:
                text = chunk.decode('utf-8')
                current_response += text
    
                parts = current_response.split(MARK_RESPONSE_END)
                for part in parts[:-1]:
                    new_part = part[len(previous_text):]
                    if new_part:
                        print(new_part, end='', flush=True)
    
                    previous_text = part
    
                current_response = parts[-1]
    
        remaining_new_text = current_response[len(previous_text):]
        if remaining_new_text:
            print(remaining_new_text, end='', flush=True)
    
    
    if __name__ == '__main__':
        # Replace <service_url> with the service endpoint.
        hosts = '<service_url>'
        # Replace <token> with the service token.
        head = {
            'Authorization': '<token>'
        }
    
        # get chat history
        chat_history = json.loads(post_get_history(url=hosts, headers=head))['chat_history']
    
        # The content of the question. A question in English is recommended.
        prompt = 'Please describe the image'
        # Replace path_to_your_image with the local path of the image.
        image_path = 'path_to_your_image'
        image_base_64 = image_to_base64(image_path)
    
        post_infer(prompt = prompt, image = image_base_64, chat_history = chat_history, use_stream=False, url=hosts, headers=head) 
    

    Get chat history

    Obtain chat history.

    • Replace the following parameters in the sample code:

      Parameter

      Description

      hosts

      Service endpoint obtained in Step 1.

      authorization

      Service token obtained in Step 1.

    • No input parameters required.

    • Output parameters:

      Parameter

      Type

      Note

      chat_history

      List[List]

      Conversation history.

    Python code example for model inference:

    import requests
    import json
    
    def post_get_history(url='http://127.0.0.1:7860', headers=None):
        r = requests.post(f'{url}/get_history', headers=headers, timeout=1500)
        data = r.content.decode('utf-8')
        return data
    
    
    if __name__ == '__main__':
        # Replace <service_url> with the service URL
        hosts = '<service_url>'
        # Replace <token> with the service token
        head = {
            'Authorization': '<token>'
        }
    
        chat_history = json.loads(post_get_history(url=hosts, headers=head))['chat_history']
        print(chat_history)
    

    Clear chat history

    Clear chat history.

    • Replace the following parameters in the sample code:

      Parameter

      Description

      hosts

      Endpoint obtained in Step 1.

      authorization

      Service token obtained in Step 1.

    • No input parameters required.

    • Returns: success.

    Python code example for model inference:

    import requests
    import json
    
    
    def post_clear_history(url='http://127.0.0.1:7860', headers=None):
        r = requests.post(f'{url}/clear_history', headers=headers, timeout=1500)
        data = r.content.decode('utf-8')
        return data
    
    
    if __name__ == '__main__':
        # Replace <service_url> with the service endpoint.
        hosts = '<service_url>'
        # Replace <token> with the service token.
        head = {
            'Authorization': '<token>'
        }
        clear_info = post_clear_history(url=hosts, headers=head)
        print(clear_info)