All Products
Search
Document Center

Platform For AI:Deploy an LLM as a service

Last Updated:Feb 06, 2025

Elastic Algorithm Service (EAS) of Platform for AI (PAI) provides a scenario-based deployment mode that allows you to deploy an open source large language model (LLM) by configuring parameters. This topic describes how to use EAS to deploy and call an LLM service.

Feature overview

The application of LLMs, such as ChatGPT and TongYi Qianwen (Qwen) model series, garnered significant attention, especially in inference tasks. EAS allows you to deploy an LLM in a convenient and efficient manner and supports the following deployment options:

  • Quick deployment of open-source models: EAS allows you to deploy various open source LLMs, including DeepSeek-R1, DeepSeek-V3, QVQ-72B-Preview, QwQ-32B-Preview, Llama, Qwen, Marco, internlm3, Qwen2-VL, and AlphaFold2. The following deployment modes are supported: standard deployment, BladeLLM-based accelerated deployment, and vLLM-based accelerated deployment.

  • High-performance deployment: The BladeLLM engine developed by using PAI is used for efficient deployment to implement LLM inference with low latency and high throughput. High-performance deployment supports the deployment of open source public models and custom models. To deploy a custom model, select this deployment option.

The following table describes the differences between the two deployment options.

Type

Quick deployment of open-source models

High-performance deployment

Model configuration

Open source public models

  • Open source public models

  • Custom models

Accelerated framework

  • Accelerated deployment: BladeLLM

  • Accelerated deployment: vLLM

  • Standard deployment (without acceleration)

Accelerated deployment: BladeLLM

Calling method

  • Standard deployment: API calling and WebUI calling

  • Accelerated deployment: API calling

API calling

This topic uses quick deployment of open-source models as an example to describe how to deploy an LLM service. For information about how to perform high-performance deployment, see Get Started with BladeLLM.

Deploy an EAS service

  1. Log on to the PAI console. Select a region and a workspace. Then, click Enter Elastic Algorithm Service (EAS).

  2. On the Elastic Algorithm Service (EAS) page, click Deploy Service. In the Scenario-based Model Deployment section of the Deploy Service page, select LLM Deployment.

  3. On the LLM Deployment page, configure the parameters described in the following table.

    Parameter

    Description

    Basic Information

    Service Name

    Specify a name for the model service.

    Version

    Select Open-source Model Quick Deployment. For information about how to perform high-performance deployment, see Get Started with BladeLLM.

    Model Type

    Select a model category.

    Deployment Method

    The following table describes different deployment methods that are supported by various model categories:

    • Accelerated deployment: BladeLLM

    • Accelerated deployment: vLLM

    • Standard deployment: accelerated framework not involved

    You can view the deployment methods of a specific model category when you deploy a service. Accelerated deployment supports only API inference.

    Resource Deployment

    Resource Type

    By default, Public Resources is selected. If you want to use dedicated resources to deploy a service, you can use EAS resource groups or resource quotas. For more information about how to purchase resource groups and create resource quotas, see Work with dedicated resource groups and Lingjun resource quotas.

    Note

    You can use resource quotas only in the China (Ulanqab) and Singapore regions.

    Deployment Resources

    When you use public resources, the system automatically selects an appropriate instance type after you select a model category.

  4. Click Deploy.

Call an EAS service

Invocation methods vary based on the deployment mode. You can select an appropriate invocation method based on your deployment option.

Standard deployment

Call an EAS service by using WebUI

  1. Find the desired service and click View Web App in the Service Type column.image

  2. Test the inference performance on the WebUI page.

    Enter the dialogue content in the text box on the ChatLLM-WebUI page. For example, you can enter What is the capital of Canada? and click Send to start a dialogue.image

Call an EAS service by using API operations

  1. Retrieve the service endpoint and token.

    1. Navigate to EAS, select a workspace, and access EAS.

    2. Click the name of the desired service to view its details page.

    3. In the Basic Information section, click View Call Information. On the Public Endpoint Call tab, retrieve the service token and endpoint.

  2. To call API operations to perform inference, use one of the following methods:

    Use HTTP

    • Non-streaming mode

      The client sends the following types of standard HTTP requests when curl commands are run.

      • STRING requests

        curl $host -H 'Authorization: $authorization' --data-binary @chatllm_data.txt -v

        Replace $authorization with the service token. Replace $host with the service endpoint. The chatllm_data.txt file is a plain text file that contains the prompt, such as what is the capital of Canada?

      • Structured requests

        curl $host -H 'Authorization: $authorization' -H "Content-type: application/json" --data-binary @chatllm_data.json -v -H "Connection: close"

        Use the chatllm_data.json file to configure inference parameters. The following sample code provides a format example of the chatllm_data.json file:

        {
          "max_new_tokens": 4096,
          "use_stream_chat": false,
          "prompt": "What is the capital of Canada?",
          "system_prompt": "Act like you are a knowledgeable assistant who can provide information on geography and related topics.",
          "history": [
            [
              "Can you tell me what's the capital of France?",
              "The capital of France is Paris."
            ]
          ],
          "temperature": 0.8,
          "top_k": 10,
          "top_p": 0.8,
          "do_sample": true,
          "use_cache": true
        }

        The following table describes the parameters in the preceding code. Configure the parameters based on your business requirements.

        Parameter

        Description

        Default value

        max_new_tokens

        The maximum number of output tokens.

        2048

        use_stream_chat

        Specifies whether to return the output tokens in streaming mode.

        true

        prompt

        The user prompt.

        ""

        system_prompt

        The system prompt.

        ""

        history

        The dialogue history. The value is in the List[Tuple(str, str)] format.

        [()]

        temperature

        The randomness of the model output. A larger value specifies a higher randomness. The value 0 specifies a fixed output. The value is of the Float type and ranges from 0 to 1.

        0.95

        top_k

        The number of outputs selected from the generated results.

        30

        top_p

        The probability threshold of outputs selected from the generated results. The value is of the Float type and ranges from 0 to 1.

        0.8

        do_sample

        Specifies whether to enable output sampling.

        true

        use_cache

        Specifies whether to enable KV cache.

        true

      • You can also implement your client based on the Python requests package. You can use the --prompt parameter to specify the request content, such as python xxx.py --prompt "What is the capital of Canada?".

        import argparse
        import json
        from typing import Iterable, List
        
        import requests
        
        def post_http_request(prompt: str,
                              system_prompt: str,
                              history: list,
                              host: str,
                              authorization: str,
                              max_new_tokens: int = 2048,
                              temperature: float = 0.95,
                              top_k: int = 1,
                              top_p: float = 0.8,
                              langchain: bool = False,
                              use_stream_chat: bool = False) -> requests.Response:
            headers = {
                "User-Agent": "Test Client",
                "Authorization": f"{authorization}"
            }
            if not history:
                history = [
                    (
                        "San Francisco is a",
                        "city located in the state of California in the United States. \
                        It is known for its iconic landmarks, such as the Golden Gate Bridge \
                        and Alcatraz Island, as well as its vibrant culture, diverse population, \
                        and tech industry. The city is also home to many famous companies and \
                        startups, including Google, Apple, and Twitter."
                    )
                ]
            pload = {
                "prompt": prompt,
                "system_prompt": system_prompt,
                "top_k": top_k,
                "top_p": top_p,
                "temperature": temperature,
                "max_new_tokens": max_new_tokens,
                "use_stream_chat": use_stream_chat,
                "history": history
            }
            if langchain:
                pload["langchain"] = langchain
            response = requests.post(host, headers=headers,
                                     json=pload, stream=use_stream_chat)
            return response
        
        def get_response(response: requests.Response) -> List[str]:
            data = json.loads(response.content)
            output = data["response"]
            history = data["history"]
            return output, history
        
        if __name__ == "__main__":
            parser = argparse.ArgumentParser()
            parser.add_argument("--top-k", type=int, default=4)
            parser.add_argument("--top-p", type=float, default=0.8)
            parser.add_argument("--max-new-tokens", type=int, default=2048)
            parser.add_argument("--temperature", type=float, default=0.95)
            parser.add_argument("--prompt", type=str, default="How can I get there?")
            parser.add_argument("--langchain", action="store_true")
        
            args = parser.parse_args()
        
            prompt = args.prompt
            top_k = args.top_k
            top_p = args.top_p
            use_stream_chat = False
            temperature = args.temperature
            langchain = args.langchain
            max_new_tokens = args.max_new_tokens
        
            host = "<Public endpoint of the EAS service>"
            authorization = "<Public token of the EAS service>"
        
            print(f"Prompt: {prompt!r}\n", flush=True)
            # System prompts can be included in the requests. 
            system_prompt = "Act like you are programmer with \
                        5+ years of experience."
        
            # Dialogue history can be included in the client request. The client manages the dialogue history to implement multi-round dialogues. In most cases, information from the previous round of dialogue is used. The information is in the List[Tuple(str, str)] format. 
            history = []
            response = post_http_request(
                prompt, system_prompt, history,
                host, authorization,
                max_new_tokens, temperature, top_k, top_p,
                langchain=langchain, use_stream_chat=use_stream_chat)
            output, history = get_response(response)
            print(f" --- output: {output} \n --- history: {history}", flush=True)
        
        # The server returns a JSON response that includes the inference result and dialogue history. 
        def get_response(response: requests.Response) -> List[str]:
            data = json.loads(response.content)
            output = data["response"]
            history = data["history"]
            return output, history

        Take note of the following parameters:

        • Set the host parameter to the service endpoint.

        • Set the authorization parameter to the service token.

    • Streaming mode

      In streaming mode, the HTTP SSE method is used. You can use the --prompt parameter to specify the request content, such as python xxx.py --prompt "What is the capital of Canada?".

      import argparse
      import json
      from typing import Iterable, List
      
      import requests
      
      
      def clear_line(n: int = 1) -> None:
          LINE_UP = '\033[1A'
          LINE_CLEAR = '\x1b[2K'
          for _ in range(n):
              print(LINE_UP, end=LINE_CLEAR, flush=True)
      
      
      def post_http_request(prompt: str,
                            system_prompt: str,
                            history: list,
                            host: str,
                            authorization: str,
                            max_new_tokens: int = 2048,
                            temperature: float = 0.95,
                            top_k: int = 1,
                            top_p: float = 0.8,
                            langchain: bool = False,
                            use_stream_chat: bool = False) -> requests.Response:
          headers = {
              "User-Agent": "Test Client",
              "Authorization": f"{authorization}"
          }
          if not history:
              history = [
                  (
                      "San Francisco is a",
                      "city located in the state of California in the United States. \
                      It is known for its iconic landmarks, such as the Golden Gate Bridge \
                      and Alcatraz Island, as well as its vibrant culture, diverse population, \
                      and tech industry. The city is also home to many famous companies and \
                      startups, including Google, Apple, and Twitter."
                  )
              ]
          pload = {
              "prompt": prompt,
              "system_prompt": system_prompt,
              "top_k": top_k,
              "top_p": top_p,
              "temperature": temperature,
              "max_new_tokens": max_new_tokens,
              "use_stream_chat": use_stream_chat,
              "history": history
          }
          if langchain:
              pload["langchain"] = langchain
          response = requests.post(host, headers=headers,
                                   json=pload, stream=use_stream_chat)
          return response
      
      
      def get_streaming_response(response: requests.Response) -> Iterable[List[str]]:
          for chunk in response.iter_lines(chunk_size=8192,
                                           decode_unicode=False,
                                           delimiter=b"\0"):
              if chunk:
                  data = json.loads(chunk.decode("utf-8"))
                  output = data["response"]
                  history = data["history"]
                  yield output, history
      
      
      if __name__ == "__main__":
          parser = argparse.ArgumentParser()
          parser.add_argument("--top-k", type=int, default=4)
          parser.add_argument("--top-p", type=float, default=0.8)
          parser.add_argument("--max-new-tokens", type=int, default=2048)
          parser.add_argument("--temperature", type=float, default=0.95)
          parser.add_argument("--prompt", type=str, default="How can I get there?")
          parser.add_argument("--langchain", action="store_true")
          args = parser.parse_args()
      
          prompt = args.prompt
          top_k = args.top_k
          top_p = args.top_p
          use_stream_chat = True
          temperature = args.temperature
          langchain = args.langchain
          max_new_tokens = args.max_new_tokens
      
          host = ""
          authorization = ""
      
          print(f"Prompt: {prompt!r}\n", flush=True)
          system_prompt = "Act like you are programmer with \
                      5+ years of experience."
          history = []
          response = post_http_request(
              prompt, system_prompt, history,
              host, authorization,
              max_new_tokens, temperature, top_k, top_p,
              langchain=langchain, use_stream_chat=use_stream_chat)
      
          for h, history in get_streaming_response(response):
              print(
                  f" --- stream line: {h} \n --- history: {history}", flush=True)
      

      Take note of the following parameters:

      • Set the host parameter to the service endpoint.

      • Set the authorization parameter to the service token.

    Use WebSocket

    The WebSocket protocol can efficiently handle the dialogue history. You can use the WebSocket method to connect to the service and perform one or more rounds of dialogue. Sample code:

    import os
    import time
    import json
    import struct
    from multiprocessing import Process
    
    import websocket
    
    round = 5
    questions = 0
    
    
    def on_message_1(ws, message):
        if message == "<EOS>":
            print('pid-{} timestamp-({}) receives end message: {}'.format(os.getpid(),
                  time.time(), message), flush=True)
            ws.send(struct.pack('!H', 1000), websocket.ABNF.OPCODE_CLOSE)
        else:
            print("{}".format(time.time()))
            print('pid-{} timestamp-({}) --- message received: {}'.format(os.getpid(),
                  time.time(), message), flush=True)
    
    
    def on_message_2(ws, message):
        global questions
        print('pid-{} --- message received: {}'.format(os.getpid(), message))
        # end the client-side streaming
        if message == "<EOS>":
            questions = questions + 1
            if questions == 5:
                ws.send(struct.pack('!H', 1000), websocket.ABNF.OPCODE_CLOSE)
    
    
    def on_message_3(ws, message):
        print('pid-{} --- message received: {}'.format(os.getpid(), message))
        # end the client-side streaming
        ws.send(struct.pack('!H', 1000), websocket.ABNF.OPCODE_CLOSE)
    
    
    def on_error(ws, error):
        print('error happened: ', str(error))
    
    
    def on_close(ws, a, b):
        print("### closed ###", a, b)
    
    
    def on_pong(ws, pong):
        print('pong:', pong)
    
    # stream chat validation test
    def on_open_1(ws):
        print('Opening Websocket connection to the server ... ')
        params_dict = {}
        params_dict['prompt'] = """Show me a golang code example: """
        params_dict['temperature'] = 0.9
        params_dict['top_p'] = 0.1
        params_dict['top_k'] = 30
        params_dict['max_new_tokens'] = 2048
        params_dict['do_sample'] = True
        raw_req = json.dumps(params_dict, ensure_ascii=False).encode('utf8')
        # raw_req = f"""To open a Websocket connection to the server: """
    
        ws.send(raw_req)
        # end the client-side streaming
    
    
    # multi-round query validation test
    def on_open_2(ws):
        global round
        print('Opening Websocket connection to the server ... ')
        params_dict = {"max_new_tokens": 6144}
        params_dict['temperature'] = 0.9
        params_dict['top_p'] = 0.1
        params_dict['top_k'] = 30
        params_dict['use_stream_chat'] = True
        params_dict['prompt'] = "Hello!"
        params_dict = {
            "system_prompt":
            "Act like you are programmer with 5+ years of experience."
        }
        raw_req = json.dumps(params_dict, ensure_ascii=False).encode('utf8')
        ws.send(raw_req)
        params_dict['prompt'] = "Please write a sorting algorithm in Python."
        raw_req = json.dumps(params_dict, ensure_ascii=False).encode('utf8')
        ws.send(raw_req)
        params_dict['prompt'] = "Please convert the programming language to Java."
        raw_req = json.dumps(params_dict, ensure_ascii=False).encode('utf8')
        ws.send(raw_req)
        params_dict['prompt'] = "Please introduce yourself."
        raw_req = json.dumps(params_dict, ensure_ascii=False).encode('utf8')
        ws.send(raw_req)
        params_dict['prompt'] = "Please summarize the dialogue above."
        raw_req = json.dumps(params_dict, ensure_ascii=False).encode('utf8')
        ws.send(raw_req)
    
    
    # Langchain validation test.
    def on_open_3(ws):
        global round
        print('Opening Websocket connection to the server ... ')
    
        params_dict = {}
        # params_dict['prompt'] = """To open a Websocket connection to the server: """
        params_dict['prompt'] = """Can you tell me what's the MNN?"""
        params_dict['temperature'] = 0.9
        params_dict['top_p'] = 0.1
        params_dict['top_k'] = 30
        params_dict['max_new_tokens'] = 2048
        params_dict['use_stream_chat'] = False
        params_dict['langchain'] = True
        raw_req = json.dumps(params_dict, ensure_ascii=False).encode('utf8')
        ws.send(raw_req)
    
    
    authorization = ""
    host = "ws://" + ""
    
    
    def single_call(on_open_func, on_message_func, on_clonse_func=on_close):
        ws = websocket.WebSocketApp(
            host,
            on_open=on_open_func,
            on_message=on_message_func,
            on_error=on_error,
            on_pong=on_pong,
            on_close=on_clonse_func,
            header=[
                'Authorization: ' + authorization],
        )
    
        # setup ping interval to keep long connection.
        ws.run_forever(ping_interval=2)
    
    
    if __name__ == "__main__":
        for i in range(5):
            p1 = Process(target=single_call, args=(on_open_1, on_message_1))
            p2 = Process(target=single_call, args=(on_open_2, on_message_2))
            p3 = Process(target=single_call, args=(on_open_3, on_message_3))
    
            p1.start()
            p2.start()
            p3.start()
    
            p1.join()
            p2.join()
            p3.join()

    Take note of the following parameters:

    • Set the authorization parameter to the service token.

    • Set the host parameter to the service endpoint. Replace the http prefix in the endpoint with ws.

    • Use the use_stream_chat parameter to specify whether the client generates output in streaming mode. Default value: True.

    • Refer to the on_open_2 function in the preceding code to implement a multi-round dialogue.

BladeLLM-based accelerated deployment

BladeLLM-based accelerated deployment allows you to call a service only by calling API operations. To call a service, perform the following steps:

  1. To view the service access address and token:

    1. On the Model Online Service (EAS) page, click the Service Method column of the desired service, then select Call Information.

    2. In the Call Information dialog box, note the service access address and token.

  2. Execute the following code in the terminal to call the service and receive the generated text.

    # Call EAS service
    curl -X POST \
        -H "Content-Type: application/json" \
        -H "Authorization: AUTH_TOKEN_FOR_EAS" \
        -d '{"prompt":"What is the capital of Canada?", "stream":"true"}' \
        <service_url>/v1/completions

    Take note of the following parameters:

    • Authorization: Set this to the service token obtained in the previous step.

    • <service_url>: Replace this with the service access address obtained in the previous step.

    You should receive the following command output:

    data: {"id":"91f9a28a-f949-40fb-b720-08ceeeb2****","choices":[{"finish_reason":"","index":0,"logprobs":null,"text":" The"}],"object":"text_completion","usage":{"prompt_tokens":7,"completion_tokens":1,"total_tokens":8},"error_info":null}
    
    data: {"id":"91f9a28a-f949-40fb-b720-08ceeeb2****","choices":[{"finish_reason":"","index":0,"logprobs":null,"text":" capital"}],"object":"text_completion","usage":{"prompt_tokens":7,"completion_tokens":2,"total_tokens":9},"error_info":null}
    
    data: {"id":"91f9a28a-f949-40fb-b720-08ceeeb2****","choices":[{"finish_reason":"","index":0,"logprobs":null,"text":" of"}],"object":"text_completion","usage":{"prompt_tokens":7,"completion_tokens":3,"total_tokens":10},"error_info":null}
    
    data: {"id":"91f9a28a-f949-40fb-b720-08ceeeb2****","choices":[{"finish_reason":"","index":0,"logprobs":null,"text":" Canada"}],"object":"text_completion","usage":{"prompt_tokens":7,"completion_tokens":4,"total_tokens":11},"error_info":null}
    
    data: {"id":"91f9a28a-f949-40fb-b720-08ceeeb2****","choices":[{"finish_reason":"","index":0,"logprobs":null,"text":" is"}],"object":"text_completion","usage":{"prompt_tokens":7,"completion_tokens":5,"total_tokens":12},"error_info":null}
    
    data: {"id":"91f9a28a-f949-40fb-b720-08ceeeb2****","choices":[{"finish_reason":"","index":0,"logprobs":null,"text":" Ottawa"}],"object":"text_completion","usage":{"prompt_tokens":7,"completion_tokens":6,"total_tokens":13},"error_info":null}
    
    data: {"id":"91f9a28a-f949-40fb-b720-08ceeeb2****","choices":[{"finish_reason":"","index":0,"logprobs":null,"text":"."}],"object":"text_completion","usage":{"prompt_tokens":7,"completion_tokens":7,"total_tokens":14},"error_info":null}
    
    data: {"id":"91f9a28a-f949-40fb-b720-08ceeeb2****","choices":[{"finish_reason":"stop","index":0,"logprobs":null,"text":""}],"object":"text_completion","usage":{"prompt_tokens":7,"completion_tokens":8,"total_tokens":15},"error_info":null}
    
    data: [DONE]

vLLM-based accelerated deployment

vLLM-based accelerated deployment allows you to call a service only by calling API operations. To call a service, perform the following steps:

  1. To view the service access address and token:

    1. On the Model Online Service (EAS) page, click the Service Method column of the desired service, then select Call Information.

    2. In the Call Information dialog box, note the service access address and token.

  2. In the terminal, run the following code to call the service:

    Python

    from openai import OpenAI
    
    ##### API configuration #####
    openai_api_key = "<EAS API KEY>"
    openai_api_base = "<EAS API Endpoint>/v1"
    
    client = OpenAI(
        api_key=openai_api_key,
        base_url=openai_api_base,
    )
    
    models = client.models.list()
    model = models.data[0].id
    print(model)
    
    
    def main():
    
        stream = True
    
        chat_completion = client.chat.completions.create(
            messages=[
                {
                    "role": "user",
                    "content": [
                        {
                            "type": "text",
                            "text": "What is the capital of Canada?",
                        }
                    ],
                }
            ],
            model=model,
            max_completion_tokens=2048,
            stream=stream,
        )
    
        if stream:
            for chunk in chat_completion:
                print(chunk.choices[0].delta.content, end="")
        else:
            result = chat_completion.choices[0].message.content
            print(result)
    
    
    if __name__ == "__main__":
        main()
    

    Take note of the following parameters:

    • <EAS API KEY>: Set this parameter to the service token that you obtained.

    • <EAS API Endpoint>: Set this parameter to the service endpoint that you obtained.

    CLI

    curl -X POST <service_url>/v1/chat/completions -d '{
        "model": "Qwen2.5-7B-Instruct",
        "messages": [
            {
                "role": "system",
                "content": [
                    {
                        "type": "text",
                        "text": "You are a helpful and harmless assistant."
                    }
                ]
            },
            {
                "role": "user",
                "content": "What is the capital of Canada?"
            }
        ]
    }' -H "Content-Type: application/json" -H "Authorization: <your-token>"

    Take note of the following parameters:

    • <service_url>: Set this parameter to the service endpoint that you obtained.

    • <your-token>: Set this parameter to the service token that you obtained.

References

You can use EAS to deploy a Retrieval-Augmented Generation (RAG)-based LLM chatbot. The chatbot supports information retrieval by using an on-premises knowledge base. After you use LangChain to integrate your business data, you can use WebUI or API operations to verify the inference capability of a model. For more information, see RAG-based LLM chatbot.