Use an LLM as an inference service

0.0.201

Large language model (LLM) refers to a neural network language model with hundreds of millions of parameters, such as Generative Pre-Trained Transformer 3 (GPT-3), GPT-4, Pathways Language Model (PaLM), and PaLM 2. When you need to process large amounts of natural language data or want to build complex language understanding systems, you can use an LLM as an inference service and call APIs to easily integrate advanced natural language processing (NLP) capabilities such as text classification, sentiment analysis, and machine translation into your applications. With the LLM-as-a-service mode, you do not need to pay high infrastructure costs and can respond quickly to market changes. Furthermore, you can expand services at any time to cope with user request spikes and improve operational efficiency because the LLM runs on the cloud.

Prerequisites

The Model Service Mesh (hereinafter referred to as ModelMesh) feature is enabled in your Service Mesh (ASM) instance and the ASM environment is configured. For more information, see Step 1 and Step 2 in Use ModelMesh to roll out a multi-model inference service.
You have learned how to use ModelMesh to create custom model serving runtimes. For more information, see Use ModelMesh to create a custom model serving runtime.

Step 1: Build a custom runtime

Build a custom runtime to serve the Hugging Face LLM with prompt tuning configuration. In this example, the default values are set to the pre-built custom runtime image and pre-built prompt tuning configuration.

Implement a class that inherits from the MLModel class of MLServer.

The peft_model_server.py file contains all the code on how the Hugging Face LLM with prompt tuning configuration is being served. The _load_model function in the file is used to choose a pretrained LLM model with the PEFT prompt tuning configuration trained. The _load_model function also defines a tokenizer that can encode and decode raw string inputs from the inference requests without asking users to preprocess their input into tensor bytes.

Show the peft_model_server.py file

from typing import List

from mlserver import MLModel, types
from mlserver.codecs import decode_args

from peft import PeftModel, PeftConfig
from transformers import AutoModelForCausalLM, AutoTokenizer
import torch
import os

class PeftModelServer(MLModel):
    async def load(self) -> bool:
        self._load_model()
        self.ready = True
        return self.ready

    @decode_args
    async def predict(self, content: List[str]) -> List[str]:
        return self._predict_outputs(content)

    def _load_model(self):
        model_name_or_path = os.environ.get("PRETRAINED_MODEL_PATH", "bigscience/bloomz-560m")
        peft_model_id = os.environ.get("PEFT_MODEL_ID", "aipipeline/bloomz-560m_PROMPT_TUNING_CAUSAL_LM")
        self.tokenizer = AutoTokenizer.from_pretrained(model_name_or_path, local_files_only=True)
        config = PeftConfig.from_pretrained(peft_model_id)
        self.model = AutoModelForCausalLM.from_pretrained(config.base_model_name_or_path)
        self.model = PeftModel.from_pretrained(self.model, peft_model_id)
        self.text_column = os.environ.get("DATASET_TEXT_COLUMN_NAME", "Tweet text")
        return

    def _predict_outputs(self, content: List[str]) -> List[str]:
        output_list = []
        for input in content:
            inputs = self.tokenizer(
                f'{self.text_column} : {input} Label : ',
                return_tensors="pt",
            )
            with torch.no_grad():
                inputs = {k: v for k, v in inputs.items()}
                outputs = self.model.generate(
                    input_ids=inputs["input_ids"], attention_mask=inputs["attention_mask"], max_new_tokens=10, eos_token_id=3
                )
                outputs = self.tokenizer.batch_decode(outputs.detach().cpu().numpy(), skip_special_tokens=True)
            output_list.append(outputs[0])
        return output_list

Build a Docker image.

After the model class is implemented, you need to package its dependencies, including MLServer, into an image that is supported as a ServingRuntime resource. You can refer to the following Dockerfile to build an image:

Show the Dockerfile

# TODO: choose appropriate base image, install Python, MLServer, and
# dependencies of your MLModel implementation
FROM python:3.8-slim-buster
RUN pip install mlserver peft transformers datasets
# ...

# The custom `MLModel` implementation should be on the Python search path
# instead of relying on the working directory of the image. If using a
# single-file module, this can be accomplished with:
COPY --chown=${USER} ./peft_model_server.py /opt/peft_model_server.py
ENV PYTHONPATH=/opt/

# environment variables to be compatible with ModelMesh Serving
# these can also be set in the ServingRuntime, but this is recommended for
# consistency when building and testing
ENV MLSERVER_MODELS_DIR=/models/_mlserver_models \
    MLSERVER_GRPC_PORT=8001 \
    MLSERVER_HTTP_PORT=8002 \
    MLSERVER_LOAD_MODELS_AT_STARTUP=false \
    MLSERVER_MODEL_NAME=peft-model

# With this setting, the implementation field is not required in the model
# settings which eases integration by allowing the built-in adapter to generate
# a basic model settings file
ENV MLSERVER_MODEL_IMPLEMENTATION=peft_model_server.PeftModelServer

CMD mlserver start ${MLSERVER_MODELS_DIR}

Create a new ServingRuntime resource.

Create a new ServingRuntime resource by using the following content and point it to the image you created.

Show the YAML file

apiVersion: serving.kserve.io/v1alpha1
kind: ServingRuntime
metadata:
  name: peft-model-server
  namespace: modelmesh-serving
spec:
  supportedModelFormats:
    - name: peft-model
      version: "1"
      autoSelect: true
  multiModel: true
  grpcDataEndpoint: port:8001
  grpcEndpoint: port:8085
  containers:
    - name: mlserver
      image:  registry.cn-beijing.aliyuncs.com/test/peft-model-server:latest
      env:
        - name: MLSERVER_MODELS_DIR
          value: "/models/_mlserver_models/"
        - name: MLSERVER_GRPC_PORT
          value: "8001"
        - name: MLSERVER_HTTP_PORT
          value: "8002"
        - name: MLSERVER_LOAD_MODELS_AT_STARTUP
          value: "true"
        - name: MLSERVER_MODEL_NAME
          value: peft-model
        - name: MLSERVER_HOST
          value: "127.0.0.1"
        - name: MLSERVER_GRPC_MAX_MESSAGE_LENGTH
          value: "-1"
        - name: PRETRAINED_MODEL_PATH
          value: "bigscience/bloomz-560m"
        - name: PEFT_MODEL_ID
          value: "aipipeline/bloomz-560m_PROMPT_TUNING_CAUSAL_LM"
        # - name: "TRANSFORMERS_OFFLINE"
        #   value: "1"  
        # - name: "HF_DATASETS_OFFLINE"
        #   value: "1"    
      resources:
        requests:
          cpu: 500m
          memory: 4Gi
        limits:
          cpu: "5"
          memory: 5Gi
  builtInAdapter:
    serverType: mlserver
    runtimeManagementPort: 8001
    memBufferBytes: 134217728
    modelLoadingTimeoutMillis: 90000

Run the following command to deploy the ServingRuntime resource:
```
kubectl apply -f sample-runtime.yaml
```
After you create the ServingRuntime resource, you can see the new custom runtime in your ModelMesh deployment.

Step 2: Deploy an LLM service

To deploy a model by using the newly created runtime, you must create an InferenceService resource to serve the model. This resource is the main interface used by KServe and ModelMesh to manage models. It represents the logical endpoint of the model for serving inferences.

Create an InferenceService resource to serve the model by using the following content:

Show the YAML file

apiVersion: serving.kserve.io/v1beta1
kind: InferenceService
metadata:
  name: peft-demo
  namespace: modelmesh-serving
  annotations:
    serving.kserve.io/deploymentMode: ModelMesh
spec:
  predictor:
    model:
      modelFormat:
        name: peft-model
      runtime: peft-model-server
      storage:
        key: localMinIO
        path: sklearn/mnist-svm.joblib

In the YAML file, the InferenceService resource is named peft-demo and its model format is declared as peft-model, which is the same format as the example custom runtime created in the previous step. An optional field runtime is also passed, explicitly instructing ModelMesh to use the peft-model-server runtime to deploy this model.

Run the following command to deploy the InferenceService resource:
```
kubectl apply -f ${Name of the YAML file}.yaml
```

Step 3: Perform an inference

Run the curl command to send an inference request to the LLM service deployed in the previous step.

MODEL_NAME="peft-demo"
ASM_GW_IP="IP address of the ingress gateway"
curl -X POST -k http://${ASM_GW_IP}:8008/v2/models/${MODEL_NAME}/infer -d @./input.json

input.json in the curl command indicates the request data:

{
    "inputs": [
        {
          "name": "content",
          "shape": [1],
          "datatype": "BYTES",
          "contents": {"bytes_contents": ["RXZlcnkgZGF5IGlzIGEgbmV3IGJpbm5pbmcsIGZpbGxlZCB3aXRoIG9wdGlvbnBpZW5pbmcgYW5kIGhvcGU="]}
        }
    ]
}

The value of bytes_contents is the Base64 encoded content of the string "Every day is a new beginning, filled with opportunities and hope".

The following code block shows the JSON response:

{
 "modelName": "peft-demo__isvc-5c5315c302",
 "outputs": [
  {
   "name": "output-0",
   "datatype": "BYTES",
   "shape": [
    "1",
    "1"
   ],
   "parameters": {
    "content_type": {
     "stringParam": "str"
    }
   },
   "contents": {
    "bytesContents": [
     "VHdlZXQgdGV4dCA6IEV2ZXJ5IGRheSBpcyBhIG5ldyBiaW5uaW5nLCBmaWxsZWQgd2l0aCBvcHRpb25waWVuaW5nIGFuZCBob3BlIExhYmVsIDogbm8gY29tcGxhaW50"
    ]
   }
  }
 ]
}

The following code block shows the Base64-decoded content of bytesContents. It indicates that the inference request is performed on the LLM service as expected.

Tweet text : Every day is a new binning, filled with optionpiening and hope Label : no complaint

Feedback

Previous: Use ModelMesh to create a custom model serving runtimeNext: Use dynamic subset load balancing to accelerate model service mesh inference

On this page （1, O）

Prerequisites

Step 1: Build a custom runtime

Step 2: Deploy an LLM service

Step 3: Perform an inference

Chat now with Alibaba Cloud Customer Service to assist you in finding the right products and services to meet your needs.

Prerequisites

Step 1: Build a custom runtime

Step 2: Deploy an LLM service

Step 3: Perform an inference

Sales Support

Technical Support

Connect & Report Abuse

About Alibaba Cloud

Our Global Network

Quick Start

Global Offices

Olympic Games Paris 2024 New

Stade Roland Garros – Glitz from the Past New

Place de la Concorde – “Breaking” the Barriers New

Vaires-sur-Marne Nautical Stadium – Sports with Sustainability New

International Broadcast Center – Images, Sounds, and Data that Captivate Billions New

Customer Success Stories New

Trust Center

Security & Compliance Center

Cloud Compliance Resources

Security Compliance FAQs

Product & Feature Update New

Cloud Forward

Press Room

Alibaba Cloud e-Magazine New

Alibaba Cloud in Analyst Research

Notice

Go Global Service New

Go Global Alliance with Alibaba Cloud

Asia Accelerator Hot

Information Compliance

China Gateway - MLPS 2.0 Compliance New

China Gateway - Networking

China Gateway - Global Application Acceleration New

China Gateway - Security

China Gateway - Data Security New

ICP Support Hot

China Gateway - Omnichannel Data Mid-End New

China Gateway - Organizational Data Mid-End New

China Gateway - Business Mid-End New

China Gateway - AI Service for Conversational Chatbots New

China Gateway - Online Education

China Gateway - Domain Registration

Work at Alibaba Cloud

Experienced Professionals

Students and Graduates

Free Trial

Pricing

Promo Center

Price Reduction

Pay Less and Deploy More

FinOps

Elastic Compute Service (ECS)

Simple Application Server (SAS)

Elastic GPU Service

Elastic Desktop Service (EDS)

Object Storage Service (OSS)

Cloud Enterprise Network (CEN)

Web Application Firewall (WAF)

Domain Names

Container Compute Service (ACS)

Secure Access Service Edge (SASE)

Intelligent Media Services(IMS)

Edge Security Acceleration (ESA)(Original DCDN)

Intelligent Media Management

DingTalk Enterprise

YiDA

Alibaba Cloud Model Studio

Apsara Prime - For Easy Cloud Product Selection

Alibaba Cloud ECS - Cater All Your Cloud Hosting Needs

1TB CDN—Get Free 1 TB Outbound Traffic Plan Now

Security—Under Attack? Get Free Security Support

Short Message Service - Free Testing is Available

Elastic Compute Service (ECS) Hot

CloudBox

Compute Nest

Dedicated Host Hot

ECS Bare Metal Instance

Elastic GPU Service Featured

Simple Application Server (SAS) Hot

Auto Scaling

Cloud Phone Beta

Elastic Desktop Service (EDS) Featured

Batch Compute

Elastic High Performance Computing (E-HPC)

Super Computing Cluster (SCC)

Function Compute (FC)