PAIを使用してTongyi Qianwenをすばやく展開する - Platform For AI

このトピックでは、オープンソースモデルのTongyi Qianwen (Qwen) に基づいてwebアプリケーションをデプロイし、webページで、またはPlatform for AI (PAI) のElastic Algorithm Service (EAS) のAPI操作を使用してモデル推論を実行する方法について説明します。

背景情報

Tongyi Qianwen-7b (Qwen-7B) は、Alibaba Cloudによって開発されたTongyi Qianwen基礎モデルシリーズの70億パラメータモデルです。 Qwen-7Bは、Transformerに基づいており、超大規模な事前トレーニングデータでトレーニングされた大規模言語モデル (LLM) です。事前トレーニングデータは、多数のテキスト、専門的な本、およびコードを含む、広範囲のデータ型をカバーします。さらに、LLM AIアシスタントQwen-7B-Chatは、Qwen-7Bに基づくアライメントメカニズムを使用して開発されています。

前提条件

EASが起動されます。デフォルトのワークスペースと従量課金リソースが作成されます。詳細については、「PAIの有効化とデフォルトワークスペースの作成」をご参照ください。

デプロイQwen-7B

AIを利用したwebアプリケーションとしてQwen-7Bをデプロイするには、次の手順を実行します。

PAI コンソールにログインします。ページ上部のリージョンを選択します。次に、目的のワークスペースを選択し、[Elastic Algorithm Service (EAS) の入力] をクリックします。
[サービスのデプロイ] をクリックします。 [カスタムモデルのデプロイ] セクションで、[カスタムデプロイ] をクリックします。

[カスタムデプロイ] ページで、必要なパラメーターを設定します。次の表に、主要なパラメーターを示します。

パラメーター	説明
サービス名	サービスの名前です。この例では、サービス名qwen_demoが指定されています。
展開モード	[イメージを使用したWebアプリのデプロイ] を選択します。
イメージ設定	[PAIイメージ] をクリックし、イメージドロップダウンリストからmodelscope-inferenceを選択し、イメージバージョンドロップダウンリストから1.8.1を選択します。
コマンド	`python app.py`
ポート番号	8000
環境変数	[追加] をクリックし、次の環境変数を設定します。 MODEL_ID: qwen/Qwen-7B-Chat タスク: チャットリビジョン: v1.0.5 関連する設定の詳細については、ModelScope WebサイトのQwen-7B-Chatの説明を参照してください。
リソースタイプ	[パブリックリソース] を選択します。
デプロイリソース	[GPU] をクリックし、ml.gu7i.c16m60.1-gu30インスタンスタイプを選択します。説明この例では、トレーニングには、少なくとも20 GBのメモリを持つGPUタイプのインスタンスが必要です。 ml.gu7i.c16m60.1-gu30を使用してコストを削減することを推奨します。
追加のシステムディスク	追加のシステムディスク: 100。 (単位：GB)

[デプロイ] をクリックします。 Elastic Algorithm Service (EAS) ページに移動します。 [サービスステータス] が [実行中] に変わると、モデルがデプロイされます。
説明
ほとんどの場合、デプロイの完了には約5分かかります。デプロイメントの完了に必要な時間は、リソースの可用性、サービスの負荷、および構成によって異なります。

モデル推論の実行

モデルのデプロイ後、さまざまな方法を使用してモデルの推論を実行できます。

web UIでのモデル推論の実行

表示するサービスを見つけて、[サービスタイプ] 列の [Webアプリの表示] をクリックします。
web UIでモデル推論を実行します。

オンラインデバッグを使用したモデル推論の実行

表示するサービスの [操作] 列で、[オンラインデバッグ] をクリックします。 [オンラインデバッグ] タブが表示されます。

[ボディ] セクションで、リクエストをJSON形式で指定し、[リクエストの送信] をクリックします。応答は、右側の [デバッグ情報] セクションで返されます。

説明

この例では、デバッグ情報はリスト形式である。 入力フィールドは入力内容であり、履歴フィールドは履歴ダイアログである。本体は2つのセクションを含むリストです。最初のセクションは質問であり、2番目のセクションは質問に対する答えです。

historyフィールドなしでリクエストを入力することで推論を開始できます。例：

{"input": "Where is the provincial capital of Zhejiang?"}

サービスは、historyフィールドを含む結果を返します。例：

Status Code: 200
Content-Type: application/json
Date: Mon, 14 Aug 2023 12:01:45 GMT
Server: envoy
Vary: Accept-Encoding
X-Envoy-Upstream-Service-Time: 511
Body: {"response":"The provincial capital of Zhejiang is Hangzhou. ","history":[["Where is the provincial capital of Zhejiang?","The provincial capital of Zhejiang is Hangzhou."]]}

次のリクエストに履歴フィールドを含めて、継続的な会話を実行できます。例：

{"input": "What about Jiangsu?", "history": [["Where is the provincial capital of Zhejiang?","The provincial capital of Zhejiang is Hangzhou."]]}

サービスは結果を返します。例：

Status Code: 200
Content-Type: application/json
Date: Mon, 14 Aug 2023 12:01:23 GMT
Server: envoy
Vary: Accept-Encoding
X-Envoy-Upstream-Service-Time: 522
Body: {"response":"The provincial capital of Jiangsu is Nanjing.","history":[["Where is the provincial capital of Zhejiang?","The provincial capital of Zhejiang is Hangzhou."],[ "What about Jiangsu?","The provincial capital of Jiangsu is Nanjing."]]}

APIを使用したモデル推論の実行

API操作を呼び出してサービスを呼び出すことができます。

サービスの詳細ページで、[エンドポイント情報の表示] をクリックします。 [Invocation Method] ダイアログボックスで、[Public Endpoint] パラメーターと [Token] パラメーターの値を取得します。

端末で取得した情報に基づいてサービスを呼び出します。例：

curl -d '{"input": "What about Jiangsu?", "history": [["Where is the provincial capital of Zhejiang?", "The provincial capital of Zhejiang is Hangzhou."]]}' -H "Authorization: xxx" http://xxxx.com

サービスは結果を返します。例：

{"response":"The provincial capital of Jiangsu is Nanjing.","history":[["Where is the provincial capital of Zhejiang?","The provincial capital of Zhejiang is Hangzhou."],["What about Jiangsu?","The provincial capital of Jiangsu is Nanjing."]]}

ビジネス要件に基づいてサービスにHTTPリクエストを送信します。デバッグの詳細については、[推論サービスのデプロイ] トピックのPAIが提供するSDKを参照してください。サンプルPythonコード:

import requests
import json

data = {"input": "Who are you?"}
response = requests.post(url='http://qwen-demo.16623xxxxx.cn-hangzhou.pai-eas.aliyuncs.com/',
              headers={"Authorization": "yourtoken"},
              data=json.dumps(data))

print(response.text)

data = {"input": "What can you do?", "history": json.load (response.text)["history"]}


response = requests.post(url='http://qwen-demo.16623xxxxx.cn-hangzhou.pai-eas.aliyuncs.com/',
              headers={"Authorization": "yourtoken"},
              data=json.dumps(data))

print(response.text)

ストリーミングモードでモデル推論を実行

サービスの詳細ページで、[エンドポイント情報の表示] をクリックします。 [Invocation Method] ダイアログボックスで、[Public Endpoint] パラメーターと [Token] パラメーターの値を取得します。

ターミナルで、次のPythonコードを実行して、取得した情報に基づいてストリーミング要求を送信します。

#encoding=utf-8
from websockets.sync.client import connect
import os
import platform

def clear_screen():
    if platform.system() == "Windows":
        os.system("cls")
    else:
        os.system("clear")


def print_history(history):
    print("Welcome to the Qwen-7B model. Start the conversation by entering a content. Press clear to clear the conversation history and stop to terminate the program.")
    for pair in history:
        print(f"\nUser: {pair[0]}\nQwen-7B: {pair[1]}")


def main():
    history, response = [], ''
    clear_screen()
    print_history(history)
    with connect("<service_url>", additional_headers={"Authorization": "<token>"}) as websocket:

        while True:
            query = input("\nUser: ")
            if query.strip() == "stop":
                break
            websocket.send(query)
            while True:
                msg = websocket.recv()
                
                if msg == '<EOS>':
                    break
                clear_screen()
                print_history(history)
                print(f"\nUser: {query}")
                print("\nQwen-7B: ", end="")
                print(msg)
                response = msg
                
            history.append((query, response))


if __name__ == "__main__":
    main()

<service_url> を手順1で取得したエンドポイントに置き換え、エンドポイントのhttpをwsに置き換えます。
<token> を手順1で取得したサービストークンに置き換えます。

Platform For AI:EASでQwenをすばやく展開