Large language model (LLM) refers to a neural network language model with hundreds of millions of parameters, such as Generative Pre-Trained Transformer 3 (GPT-3), GPT-4, Pathways Language Model (PaLM), and PaLM 2. When you need to process large amounts of natural language data or want to build complex language understanding systems, you can use an LLM as an inference service and call APIs to easily integrate advanced natural language processing (NLP) capabilities such as text classification, sentiment analysis, and machine translation into your applications. With the LLM-as-a-service mode, you do not need to pay high infrastructure costs and can respond quickly to market changes. Furthermore, you can expand services at any time to cope with user request spikes and improve operational efficiency because the LLM runs on the cloud.
Step 1: Build a custom runtime
Build a custom runtime to serve the Hugging Face LLM with prompt tuning configuration. In this example, the default values are set to the pre-built custom runtime image and pre-built prompt tuning configuration.
Implement a class that inherits from the MLModel class of MLServer.
The peft_model_server.py file contains all the code on how the Hugging Face LLM with prompt tuning configuration is being served. The _load_model
function in the file is used to choose a pretrained LLM model with the PEFT prompt tuning configuration trained. The _load_model
function also defines a tokenizer that can encode and decode raw string inputs from the inference requests without asking users to preprocess their input into tensor bytes.
Show the peft_model_server.py file
from typing import List
from mlserver import MLModel, types
from mlserver.codecs import decode_args
from peft import PeftModel, PeftConfig
from transformers import AutoModelForCausalLM, AutoTokenizer
import torch
import os
class PeftModelServer(MLModel):
async def load(self) -> bool:
self._load_model()
self.ready = True
return self.ready
@decode_args
async def predict(self, content: List[str]) -> List[str]:
return self._predict_outputs(content)
def _load_model(self):
model_name_or_path = os.environ.get("PRETRAINED_MODEL_PATH", "bigscience/bloomz-560m")
peft_model_id = os.environ.get("PEFT_MODEL_ID", "aipipeline/bloomz-560m_PROMPT_TUNING_CAUSAL_LM")
self.tokenizer = AutoTokenizer.from_pretrained(model_name_or_path, local_files_only=True)
config = PeftConfig.from_pretrained(peft_model_id)
self.model = AutoModelForCausalLM.from_pretrained(config.base_model_name_or_path)
self.model = PeftModel.from_pretrained(self.model, peft_model_id)
self.text_column = os.environ.get("DATASET_TEXT_COLUMN_NAME", "Tweet text")
return
def _predict_outputs(self, content: List[str]) -> List[str]:
output_list = []
for input in content:
inputs = self.tokenizer(
f'{self.text_column} : {input} Label : ',
return_tensors="pt",
)
with torch.no_grad():
inputs = {k: v for k, v in inputs.items()}
outputs = self.model.generate(
input_ids=inputs["input_ids"], attention_mask=inputs["attention_mask"], max_new_tokens=10, eos_token_id=3
)
outputs = self.tokenizer.batch_decode(outputs.detach().cpu().numpy(), skip_special_tokens=True)
output_list.append(outputs[0])
return output_list
Build a Docker image.
After the model class is implemented, you need to package its dependencies, including MLServer, into an image that is supported as a ServingRuntime resource. You can refer to the following Dockerfile to build an image:
FROM python:3.8-slim-buster
RUN pip install mlserver peft transformers datasets
COPY --chown=${USER} ./peft_model_server.py /opt/peft_model_server.py
ENV PYTHONPATH=/opt/
ENV MLSERVER_MODELS_DIR=/models/_mlserver_models \
MLSERVER_GRPC_PORT=8001 \
MLSERVER_HTTP_PORT=8002 \
MLSERVER_LOAD_MODELS_AT_STARTUP=false \
MLSERVER_MODEL_NAME=peft-model
ENV MLSERVER_MODEL_IMPLEMENTATION=peft_model_server.PeftModelServer
CMD mlserver start ${MLSERVER_MODELS_DIR}
Create a new ServingRuntime resource.
Create a new ServingRuntime resource by using the following content and point it to the image you created.
apiVersion: serving.kserve.io/v1alpha1
kind: ServingRuntime
metadata:
name: peft-model-server
namespace: modelmesh-serving
spec:
supportedModelFormats:
- name: peft-model
version: "1"
autoSelect: true
multiModel: true
grpcDataEndpoint: port:8001
grpcEndpoint: port:8085
containers:
- name: mlserver
image: registry.cn-beijing.aliyuncs.com/test/peft-model-server:latest
env:
- name: MLSERVER_MODELS_DIR
value: "/models/_mlserver_models/"
- name: MLSERVER_GRPC_PORT
value: "8001"
- name: MLSERVER_HTTP_PORT
value: "8002"
- name: MLSERVER_LOAD_MODELS_AT_STARTUP
value: "true"
- name: MLSERVER_MODEL_NAME
value: peft-model
- name: MLSERVER_HOST
value: "127.0.0.1"
- name: MLSERVER_GRPC_MAX_MESSAGE_LENGTH
value: "-1"
- name: PRETRAINED_MODEL_PATH
value: "bigscience/bloomz-560m"
- name: PEFT_MODEL_ID
value: "aipipeline/bloomz-560m_PROMPT_TUNING_CAUSAL_LM"
resources:
requests:
cpu: 500m
memory: 4Gi
limits:
cpu: "5"
memory: 5Gi
builtInAdapter:
serverType: mlserver
runtimeManagementPort: 8001
memBufferBytes: 134217728
modelLoadingTimeoutMillis: 90000
Run the following command to deploy the ServingRuntime resource:
kubectl apply -f sample-runtime.yaml
After you create the ServingRuntime resource, you can see the new custom runtime in your ModelMesh deployment.
Step 2: Deploy an LLM service
To deploy a model by using the newly created runtime, you must create an InferenceService resource to serve the model. This resource is the main interface used by KServe and ModelMesh to manage models. It represents the logical endpoint of the model for serving inferences.
Create an InferenceService resource to serve the model by using the following content:
apiVersion: serving.kserve.io/v1beta1
kind: InferenceService
metadata:
name: peft-demo
namespace: modelmesh-serving
annotations:
serving.kserve.io/deploymentMode: ModelMesh
spec:
predictor:
model:
modelFormat:
name: peft-model
runtime: peft-model-server
storage:
key: localMinIO
path: sklearn/mnist-svm.joblib
In the YAML file, the InferenceService resource is named peft-demo
and its model format is declared as peft-model
, which is the same format as the example custom runtime created in the previous step. An optional field runtime
is also passed, explicitly instructing ModelMesh to use the peft-model-server
runtime to deploy this model.
Run the following command to deploy the InferenceService resource:
kubectl apply -f ${Name of the YAML file}.yaml
Step 3: Perform an inference
Run the curl
command to send an inference request to the LLM service deployed in the previous step.
MODEL_NAME="peft-demo"
ASM_GW_IP="IP address of the ingress gateway"
curl -X POST -k http://${ASM_GW_IP}:8008/v2/models/${MODEL_NAME}/infer -d @./input.json
input.json
in the curl
command indicates the request data:
{
"inputs": [
{
"name": "content",
"shape": [1],
"datatype": "BYTES",
"contents": {"bytes_contents": ["RXZlcnkgZGF5IGlzIGEgbmV3IGJpbm5pbmcsIGZpbGxlZCB3aXRoIG9wdGlvbnBpZW5pbmcgYW5kIGhvcGU="]}
}
]
}
The value of bytes_contents
is the Base64 encoded content of the string "Every day is a new beginning, filled with opportunities and hope"
.
The following code block shows the JSON response:
{
"modelName": "peft-demo__isvc-5c5315c302",
"outputs": [
{
"name": "output-0",
"datatype": "BYTES",
"shape": [
"1",
"1"
],
"parameters": {
"content_type": {
"stringParam": "str"
}
},
"contents": {
"bytesContents": [
"VHdlZXQgdGV4dCA6IEV2ZXJ5IGRheSBpcyBhIG5ldyBiaW5uaW5nLCBmaWxsZWQgd2l0aCBvcHRpb25waWVuaW5nIGFuZCBob3BlIExhYmVsIDogbm8gY29tcGxhaW50"
]
}
}
]
}
The following code block shows the Base64-decoded content of bytesContents
. It indicates that the inference request is performed on the LLM service as expected.
Tweet text : Every day is a new binning, filled with optionpiening and hope Label : no complaint