全部產品
Search
文件中心

Alibaba Cloud Service Mesh:將大語言模型轉化為推理服務

更新時間:Jun 30, 2024

大語言模型LLM(Large Language Model)指參數數量達到億層級的神經網路語言模型,例如GPT-3、GPT-4、PaLM、PaLM2等。當您需要處理大量自然語言資料或希望建立複雜的語言理解系統時,可以將大語言模型轉化為推理服務,通過API輕鬆整合先進的NLP能力(例如文本分類、情感分析、機器翻譯等)到您的應用程式中。通過服務化LLM,您可以避免昂貴的基礎設施成本,快速響應市場變化,並且由於模型運行在雲端,還可以隨時擴充服務以應對使用者請求的高峰,從而提高營運效率。

前提條件

步驟一:構建自訂運行時

構建自訂運行時,提供帶有提示調整配置的HuggingFace LLM。此樣本中的預設值設定為預先構建的自訂運行時鏡像和預先構建的提示調整配置。

  1. 實現一個繼承自MLServer MLModel的類。

    peft_model_server.py檔案包含了如何提供帶有提示調整配置的HuggingFace LLM的所有代碼。_load_model函數是該檔案中的一部分,用於選擇已訓練的PEFT提示調整配置的預訓練LLM模型。_load_model函數還定義了分詞器,以便對推理請求中的原始字串輸入進行編碼和解碼,而無需使用者預先處理其輸入為張量位元組。

    展開查看peft_model_server.py

    from typing import List
    
    from mlserver import MLModel, types
    from mlserver.codecs import decode_args
    
    from peft import PeftModel, PeftConfig
    from transformers import AutoModelForCausalLM, AutoTokenizer
    import torch
    import os
    
    class PeftModelServer(MLModel):
        async def load(self) -> bool:
            self._load_model()
            self.ready = True
            return self.ready
    
        @decode_args
        async def predict(self, content: List[str]) -> List[str]:
            return self._predict_outputs(content)
    
        def _load_model(self):
            model_name_or_path = os.environ.get("PRETRAINED_MODEL_PATH", "bigscience/bloomz-560m")
            peft_model_id = os.environ.get("PEFT_MODEL_ID", "aipipeline/bloomz-560m_PROMPT_TUNING_CAUSAL_LM")
            self.tokenizer = AutoTokenizer.from_pretrained(model_name_or_path, local_files_only=True)
            config = PeftConfig.from_pretrained(peft_model_id)
            self.model = AutoModelForCausalLM.from_pretrained(config.base_model_name_or_path)
            self.model = PeftModel.from_pretrained(self.model, peft_model_id)
            self.text_column = os.environ.get("DATASET_TEXT_COLUMN_NAME", "Tweet text")
            return
    
        def _predict_outputs(self, content: List[str]) -> List[str]:
            output_list = []
            for input in content:
                inputs = self.tokenizer(
                    f'{self.text_column} : {input} Label : ',
                    return_tensors="pt",
                )
                with torch.no_grad():
                    inputs = {k: v for k, v in inputs.items()}
                    outputs = self.model.generate(
                        input_ids=inputs["input_ids"], attention_mask=inputs["attention_mask"], max_new_tokens=10, eos_token_id=3
                    )
                    outputs = self.tokenizer.batch_decode(outputs.detach().cpu().numpy(), skip_special_tokens=True)
                output_list.append(outputs[0])
            return output_list
    
  2. 構建Docker鏡像。

    實現模型類之後,您需要將其依賴項(包括MLServer)打包到一個支援ServingRuntime資源的鏡像中。您可以參考如下Dockerfile進行鏡像構建。

    展開查看Dockerfile

    # TODO: choose appropriate base image, install Python, MLServer, and
    # dependencies of your MLModel implementation
    FROM python:3.8-slim-buster
    RUN pip install mlserver peft transformers datasets
    # ...
    
    # The custom `MLModel` implementation should be on the Python search path
    # instead of relying on the working directory of the image. If using a
    # single-file module, this can be accomplished with:
    COPY --chown=${USER} ./peft_model_server.py /opt/peft_model_server.py
    ENV PYTHONPATH=/opt/
    
    # environment variables to be compatible with ModelMesh Serving
    # these can also be set in the ServingRuntime, but this is recommended for
    # consistency when building and testing
    ENV MLSERVER_MODELS_DIR=/models/_mlserver_models \
        MLSERVER_GRPC_PORT=8001 \
        MLSERVER_HTTP_PORT=8002 \
        MLSERVER_LOAD_MODELS_AT_STARTUP=false \
        MLSERVER_MODEL_NAME=peft-model
    
    # With this setting, the implementation field is not required in the model
    # settings which eases integration by allowing the built-in adapter to generate
    # a basic model settings file
    ENV MLSERVER_MODEL_IMPLEMENTATION=peft_model_server.PeftModelServer
    
    CMD mlserver start ${MLSERVER_MODELS_DIR}
    
  3. 建立新的ServingRuntime資源。

    1. 使用以下內容,儲存為sample-runtime.yaml, 建立一個新的ServingRuntime資源,並將其指向您剛建立的鏡像。

      展開查看YAML

      apiVersion: serving.kserve.io/v1alpha1
      kind: ServingRuntime
      metadata:
        name: peft-model-server
        namespace: modelmesh-serving
      spec:
        supportedModelFormats:
          - name: peft-model
            version: "1"
            autoSelect: true
        multiModel: true
        grpcDataEndpoint: port:8001
        grpcEndpoint: port:8085
        containers:
          - name: mlserver
            image:  registry.cn-beijing.aliyuncs.com/test/peft-model-server:latest
            env:
              - name: MLSERVER_MODELS_DIR
                value: "/models/_mlserver_models/"
              - name: MLSERVER_GRPC_PORT
                value: "8001"
              - name: MLSERVER_HTTP_PORT
                value: "8002"
              - name: MLSERVER_LOAD_MODELS_AT_STARTUP
                value: "true"
              - name: MLSERVER_MODEL_NAME
                value: peft-model
              - name: MLSERVER_HOST
                value: "127.0.0.1"
              - name: MLSERVER_GRPC_MAX_MESSAGE_LENGTH
                value: "-1"
              - name: PRETRAINED_MODEL_PATH
                value: "bigscience/bloomz-560m"
              - name: PEFT_MODEL_ID
                value: "aipipeline/bloomz-560m_PROMPT_TUNING_CAUSAL_LM"
              # - name: "TRANSFORMERS_OFFLINE"
              #   value: "1"  
              # - name: "HF_DATASETS_OFFLINE"
              #   value: "1"    
            resources:
              requests:
                cpu: 500m
                memory: 4Gi
              limits:
                cpu: "5"
                memory: 5Gi
        builtInAdapter:
          serverType: mlserver
          runtimeManagementPort: 8001
          memBufferBytes: 134217728
          modelLoadingTimeoutMillis: 90000
      
    2. 執行以下命令,部署ServingRuntime資源。

      kubectl apply -f sample-runtime.yaml

      建立完成後,您可以在ModelMesh部署中看到新的自訂運行時。

步驟二:部署LLM服務

為了使用新建立的運行時部署模型,您需要建立一個InferenceService資源來提供模型服務。該資源是KServe和ModelMesh用於管理模型的主要介面,代表了模型在推理中的邏輯端點。

  1. 使用以下內容,建立一個InferenceService資源來提供模型服務。

    展開查看YAML

    apiVersion: serving.kserve.io/v1beta1
    kind: InferenceService
    metadata:
      name: peft-demo
      namespace: modelmesh-serving
      annotations:
        serving.kserve.io/deploymentMode: ModelMesh
    spec:
      predictor:
        model:
          modelFormat:
            name: peft-model
          runtime: peft-model-server
          storage:
            key: localMinIO
            path: sklearn/mnist-svm.joblib
    

    在YAML中,InferenceService命名為peft-demo,並聲明其模型格式為peft-model,與之前建立的樣本自訂運行時使用相同的格式。還傳遞了一個可選欄位runtime,明確告訴ModelMesh使用peft-model-server運行時來部署此模型。

  2. 執行以下命令,部署InferenceService資源。

    kubectl apply -f ${實際YAML名稱}.yaml

步驟三:運行推理服務

使用curl命令,發送推理請求到上面部署的LLM模型服務。

MODEL_NAME="peft-demo"
ASM_GW_IP="ASM網關IP地址"
curl -X POST -k http://${ASM_GW_IP}:8008/v2/models/${MODEL_NAME}/infer -d @./input.json

curl命令中的input.json表示請求資料:

{
    "inputs": [
        {
          "name": "content",
          "shape": [1],
          "datatype": "BYTES",
          "contents": {"bytes_contents": ["RXZlcnkgZGF5IGlzIGEgbmV3IGJpbm5pbmcsIGZpbGxlZCB3aXRoIG9wdGlvbnBpZW5pbmcgYW5kIGhvcGU="]}
        }
    ]
}

bytes_contents對應字串“Every day is a new beginning, filled with opportunities and hope”的Base64編碼。

JSON響應如下所示:

{
 "modelName": "peft-demo__isvc-5c5315c302",
 "outputs": [
  {
   "name": "output-0",
   "datatype": "BYTES",
   "shape": [
    "1",
    "1"
   ],
   "parameters": {
    "content_type": {
     "stringParam": "str"
    }
   },
   "contents": {
    "bytesContents": [
     "VHdlZXQgdGV4dCA6IEV2ZXJ5IGRheSBpcyBhIG5ldyBiaW5uaW5nLCBmaWxsZWQgd2l0aCBvcHRpb25waWVuaW5nIGFuZCBob3BlIExhYmVsIDogbm8gY29tcGxhaW50"
    ]
   }
  }
 ]
}

bytesContents進行Base64解碼後的內容如下,表明上述大語言模型LLM的模型服務要求符合預期。

Tweet text : Every day is a new binning, filled with optionpiening and hope Label : no complaint