All Products
Search
Document Center

Platform For AI:Quickly deploy an MLLM in EAS

Last Updated:Sep 06, 2024

Multimodal large language models (MLLMs) allow you to process and integrate multimodal data, such as text, images, and audios. This way, you can understand complex scenarios and tasks in a comprehensive manner. MLLMs are suitable for scenarios that require cross-modal comprehension and generation. You can use Elastic Algorithm Service (EAS) to deploy MLLMs as inference services with a few clicks and obtain the inference capabilities of MLLMs. This topic describes how to deploy and call MLLM inference services by using EAS.

Background information

In recent years, various large language models (LLMs) have achieved unprecedented results in language tasks. LLMs are used to generate natural language text and demonstrate strong capabilities in multiple types of tasks, such as sentiment analytics, machine translation, and text summarization. However, the models are limited to text data and cannot be used to process other forms of data, such as images, audios, or videos. Only models that have multimodal comprehension can be close to the super brain of human.

To address this issue, MLLMs are introduced. As models such as GPT-4o are widely used in the industry, MLLMs have become increasingly popular in the industry. MLLMs allow you to process and integrate multimodal data, such as text, images, and audios. This way, you can understand complex scenarios and tasks in a comprehensive manner.

You can use EAS to deploy a MLLM with a few clicks. EAS allows you to deploy popular MLLM inference services with a few clicks to obtain inference capabilities.

Prerequisites

Deploy a model service in EAS

  1. Go to the EAS page.

    1. Log on to the Platform for AI (PAI) console.

    2. In the left-side navigation pane, click Workspaces. On the Workspaces page, find the workspace to which you want to deploy the model and click its name to go to the Workspace Details page.

    3. In the left-side navigation pane, choose Model Deployment > Elastic Algorithm Service (EAS) to go to the Elastic Algorithm Service (EAS) page.image

  2. On the Model Online Service (EAS) page, click Deploy Service. In the Custom Model Deployment section, click Custom Deploy.

  3. On the Create Service page, configure the parameters. The following table describes the key parameters. For information about other parameters, see Deploy a model service in the PAI console.

    Parameter

    Description

    Model Service Information

    Deployment Method

    Select Deploy Web App by Using Image.

    Select Image

    Select chat-mllm-webui from the PAI Image drop-down list. Select 1.0 from the version drop-down list.

    Note

    We recommend that you select the latest version of the image when you deploy the model service.

    Command to Run

    After you select an image, the system automatically configures this parameter. You can modify the model_type parameter to deploy different models. The following table provides the supported model types.

    Resource Deployment Information

    Resource Configuration

    Select the GPU configuration. We recommend that you select the ml.gu7i.c16m60.1-gu30 instance type, which is the most cost-effective.

    Models

    model_type

    Model link

    qwen_vl_chat

    qwen/Qwen-VL-Chat

    qwen_vl_chat_int4

    qwen/Qwen-VL-Chat-Int4

    qwen_vl

    qwen/Qwen-VL

    glm4v_9b_chat

    ZhipuAI/glm-4v-9b

    llava1_5-7b-instruct

    swift/llava-1___5-7b-hf

    llava1_5-13b-instruct

    swift/llava-1___5-13b-hf

    internvl_chat_v1_5_int8

    AI-ModelScope/InternVL-Chat-V1-5-int8

    internvl-chat-v1_5

    AI-ModelScope/InternVL-Chat-V1-5

    mini-internvl-chat-2b-v1_5

    OpenGVLab/Mini-InternVL-Chat-2B-V1-5

    mini-internvl-chat-4b-v1_5

    OpenGVLab/Mini-InternVL-Chat-4B-V1-5

    internvl2-2b

    OpenGVLab/InternVL2-2B

    internvl2-4b

    OpenGVLab/InternVL2-4B

    internvl2-8b

    OpenGVLab/InternVL2-8B

    internvl2-26b

    OpenGVLab/InternVL2-26B

    internvl2-40b

    OpenGVLab/InternVL2-40B

  4. After you configure the parameters, click Deploy.

Call a service

Use the web UI to perform model inference

  1. Find the service that you want to manage and click View Web App in the Service Type column.

  2. On the web UI page, perform model inference.cb3daf8135235cbd35c456965fc60199

Call API operations to perform model inference

  1. Obtain the endpoint and token of the service.

    1. Go to the Elastic Algorithm Service (EAS) page. For more information, see Background information.

    2. Click the name of the service. The details page of the service appears.

    3. On the Service Details tab, click View Endpoint Information in the Basic Information section. On the Public Endpoint tab of the Invocation Method dialog box, obtain the endpoint and token of the service.

  2. Call API operations to perform model inference.

    PAI provides the following APIs:

    infer forward

    Obtain the inference result. The following sample code provides an example on how to use Python to perform model inference:

    import requests
    import json
    import base64
    
    
    def post_get_history(url='http://127.0.0.1:7860', headers=None):
        r = requests.post(f'{url}/get_history', headers=headers, timeout=1500)
        data = r.content.decode('utf-8')
        return data
    
    
    def post_infer(prompt, image=None, chat_history=[], temperature=0.2, top_p=0.7, max_output_tokens=512, use_stream = True, url='http://127.0.0.1:7860', headers={}):
        datas = {
            "prompt": prompt,
            "image": image,
            "chat_history": chat_history,
            "temperature": temperature,
            "top_p": top_p,
            "max_output_tokens": max_output_tokens,
            "use_stream": use_stream,
        }
    
        if use_stream:
            headers.update({'Accept': 'text/event-stream'})
    
            response = requests.post(f'{url}/infer_forward', json=datas, headers=headers, stream=True, timeout=1500)
    
            if response.status_code != 200:
                print(f"Request failed with status code {response.status_code}")
                return
            process_stream(response)
    
        else:
            r = requests.post(f'{url}/infer_forward', json=datas, headers=headers, timeout=1500)
            data = r.content.decode('utf-8')
    
            print(data)
    
    
    def image_to_base64(image_path):
        """
        Convert an image file to a Base64 encoded string.
    
        :param image_path: The file path to the image.
        :return: A Base64 encoded string representation of the image.
        """
        with open(image_path, "rb") as image_file:
            # Read the binary data of the image
            image_data = image_file.read()
            # Encode the binary data to Base64
            base64_encoded_data = base64.b64encode(image_data)
            # Convert bytes to string and remove any trailing newline characters
            base64_string = base64_encoded_data.decode('utf-8').replace('\n', '')
        return base64_string
    
    
    def process_stream(response, previous_text=""):
        MARK_RESPONSE_END = '##END'  # DONOT CHANGE
        buffer = previous_text
        current_response = ""
    
        for chunk in response.iter_content(chunk_size=100):
            if chunk:
                text = chunk.decode('utf-8')
                current_response += text
    
                parts = current_response.split(MARK_RESPONSE_END)
                for part in parts[:-1]:
                    new_part = part[len(previous_text):]
                    if new_part:
                        print(new_part, end='', flush=True)
    
                    previous_text = part
    
                current_response = parts[-1]
    
        remaining_new_text = current_response[len(previous_text):]
        if remaining_new_text:
            print(remaining_new_text, end='', flush=True)
    
    
    if __name__ == '__main__':
        hosts = 'xxx'
        head = {
            'Authorization': 'xxx'
        }
    
        # get chat history
        chat_history = json.loads(post_get_history(url=hosts, headers=head))['chat_history']
    
        prompt = 'Please describe the image'
        image_path = 'path_to_your_image'
        image_base_64 = image_to_base64(image_path)
    
        post_infer(prompt = prompt, image = image_base_64, chat_history = chat_history, use_stream=False, url=hosts, headers=head) 
    

    • The following table describes the key parameters.

      Parameter

      Description

      hosts

      The service endpoint that you obtained in Step 1.

      authorization

      The service token that you obtained in Step 1.

      prompt

      The content of the question. A question in English is recommended.

      image_path

      The on-premises path in which the image resides.

    • The following table describes the input parameters.

      Parameter

      Data type

      Description

      Default value

      prompt

      String

      The content of the question. This parameter is required.

      No default value

      image

      Base64

      The image.

      No default value

      chat_history

      List[List]

      The chat history.

      []

      temperature

      Float

      The randomness of the model output. A large value specifies high randomness. The value 0 specifies a fixed output. The value ranges from 0 to 1.

      0.2

      top_p

      Float

      The proportion of outputs selected from the generated results.

      0.7

      max_output_tokens

      Int

      The maximum number of tokens.

      512

      use_stream

      Bool

      Specifies whether to enable the streaming output mode. Valid values:

      • True

      • False

      True

    • The output is an answer to the question and is of the STRING type.

    get chat history

    Obtain the chat history. The following sample code provides an example on how to use Python to perform model inference:

    import requests
    import json
    
    def post_get_history(url='http://127.0.0.1:7860', headers=None):
        r = requests.post(f'{url}/get_history', headers=headers, timeout=1500)
        data = r.content.decode('utf-8')
        return data
    
    
    if __name__ == '__main__':
        hosts = 'xxx'
        head = {
            'Authorization': 'xxx'
        }
    
        chat_history = json.loads(post_get_history(url=hosts, headers=head))['chat_history']
        print(chat_history)
    

    • The following table describes the key parameters.

      Parameter

      Description

      hosts

      The service endpoint that you obtained in Step 1.

      authorization

      The service token that you obtained in Step 1.

    • No input parameters are required.

    • The following table describes the output parameters.

      Parameter

      Data type

      Description

      chat_history

      List[List]

      The chat history.

    clear chat history

    Clear the chat history. The following sample code provides an example on how to use Python to perform model inference:

    import requests
    import json
    
    
    def post_clear_history(url='http://127.0.0.1:7860', headers=None):
        r = requests.post(f'{url}/clear_history', headers=headers, timeout=1500)
        data = r.content.decode('utf-8')
        return data
    
    
    if __name__ == '__main__':
        hosts = 'xxx'
        head = {
            'Authorization': 'xxx'
        }
    
        clear_info = post_clear_history(url=hosts, headers=head)
        print(clear_info)
    

    • The following table describes the key parameters.

      Parameter

      Description

      hosts

      The service endpoint that you obtained in Step 1.

      authorization

      The service token that you obtained in Step 1.

    • No input parameters are required.

    • The returned result is success.