Connect LLM applications or inference services to ARMS - Application Real-Time Monitoring Service

The Application Real-Time Monitoring Service (ARMS) Python agent provides OpenTelemetry-based automatic instrumentation for large language model (LLM) applications and inference services. After you connect an LLM application to ARMS, you can view its call chain to analyze information such as the input/output of different operation types and token consumption.

To explore collected traces, see LLM call chain analysis.

Supported frameworks

For supported LLM inference and application frameworks, see Python components and frameworks supported by Application Monitoring.

Install the Python agent

Choose an installation method based on the deployment environment of your LLM application:

Start the application

Prefix your start command with aliyun-instrument:

aliyun-instrument python llm_app.py

Replace llm_app.py with the entry point of your application. If you do not have an LLM application, use one of the demo applications.

Application type auto-detection

The Python agent detects your application type based on installed dependencies:

Installed dependency	Detected type
`openai`, `dashscope`, `llama_index`, or `langchain`	LLM application
`vllm` or `sglang`	LLM inference service

To override auto-detection, set the APSARA_APM_APP_TYPE environment variable:

Value	Type
`microservice`	Regular microservice application
`app`	LLM application
`model`	LLM inference service

Verify the connection

After about one minute, open the ARMS console and go to LLM Application Monitoring > Application List. The connection is successful if your application appears in the list and reports data.

Application list showing a connected LLM application

Configure the Python agent

All settings use environment variables. The following table summarizes available options:

Environment variable	Default	Plug-in support	Description
`OTEL_INSTRUMENTATION_GENAI_CAPTURE_MESSAGE_CONTENT`	`True`	Dify, LangChain	Collect input/output content
`PROFILER_GENAI_SPLITAPP_ENABLE`	`False`	Dify	Split LLM sub-applications into separate ARMS applications
`OTEL_INSTRUMENTATION_GENAI_MESSAGE_CONTENT_MAX_LENGTH`	4,096	Dify, LangChain	Cap message content field length (agent >= 1.8.3)
`OTEL_SPAN_ATTRIBUTE_VALUE_LENGTH_LIMIT`	No limit	All OpenTelemetry-compatible plug-ins	Cap span attribute value length

Input/output content collection

Property	Value
Default	`True` (enabled)
Supported plug-ins	Dify, LangChain
Environment variable	`OTEL_INSTRUMENTATION_GENAI_CAPTURE_MESSAGE_CONTENT`

When enabled, the agent collects the full content of input/output fields for models, tools, and knowledge bases. When disabled, only the size of these fields is collected.

To disable content collection:

export OTEL_INSTRUMENTATION_GENAI_CAPTURE_MESSAGE_CONTENT=False

LLM application splitting

Property	Value
Default	`False` (disabled)
Supported plug-ins	Dify
Supported regions	Heyuan, Singapore
Environment variable	`PROFILER_GENAI_SPLITAPP_ENABLE`

When enabled, each LLM sub-application (such as a Dify Workflow, Agent, or Chat App) is reported as a separate ARMS application.

To enable application splitting:

export PROFILER_GENAI_SPLITAPP_ENABLE=True

Message content field length limit

Property	Value
Default	4,096 characters
Supported plug-ins	Dify, LangChain
Minimum agent version	1.8.3
Environment variable	`OTEL_INSTRUMENTATION_GENAI_MESSAGE_CONTENT_MAX_LENGTH`

This setting caps the length of LLM message content fields (input/output). Content beyond the limit is truncated.

To set a custom limit:

export OTEL_INSTRUMENTATION_GENAI_MESSAGE_CONTENT_MAX_LENGTH=<integer_value>

Replace <integer_value> with the desired character limit.

Span attribute value length limit

Property	Value
Default	No limit
Supported plug-ins	All OpenTelemetry-compatible plug-ins (LangChain, DashScope, Dify)
Environment variable	`OTEL_SPAN_ATTRIBUTE_VALUE_LENGTH_LIMIT`

This setting caps the length of span attribute values such as gen_ai.agent.description. Values beyond the limit are truncated.

To set a custom limit:

export OTEL_SPAN_ATTRIBUTE_VALUE_LENGTH_LIMIT=<integer_value>

Replace <integer_value> with the desired character limit.

Demo applications

OpenAI demo

llm_app.py

import openai
from fastapi import FastAPI
import uvicorn

app = FastAPI()

@app.get("/")
def call_openai():
    client = openai.OpenAI(api_key="sk-xxx")
    response = client.chat.completions.create(
        model="gpt-4",
        messages=[{"role": "user", "content": "Write a haiku."}],
        max_tokens=20,
    )
    return {"data": f"{response}"}

if __name__ == "__main__":
    uvicorn.run(app, host="0.0.0.0", port=8000)

requirements.txt

fastapi
uvicorn
openai >= 1.0.0

DashScope demo

llm_app.py

from http import HTTPStatus
import dashscope
from dashscope import Generation
from fastapi import FastAPI
import uvicorn

app = FastAPI()

@app.get("/")
def call_dashscope():
    dashscope.api_key = 'YOUR-DASHSCOPE-API-KEY'
    responses = Generation.call(model=Generation.Models.qwen_turbo,
                                prompt='How is the weather today?')
    resp = ""
    if responses.status_code == HTTPStatus.OK:
        resp = f"Result is: {responses.output}"
    else:
        resp = f"Failed request_id: {responses.request_id}, status_code: {responses.status_code}, code: {responses.code}, message: {responses.message}"
    return {"data": f"{resp}"}


if __name__ == "__main__":
    uvicorn.run(app, host="0.0.0.0", port=8000)

requirements.txt

fastapi
uvicorn
dashscope >= 1.0.0

LlamaIndex demo

Store knowledge base documents (PDF, TXT, or DOC files) in a data directory.

llm_app.py

import time

from fastapi import FastAPI
import uvicorn
import aiohttp

from llama_index.core import VectorStoreIndex, SimpleDirectoryReader
from llama_index.vector_stores.chroma import ChromaVectorStore
from llama_index.core import StorageContext
from llama_index.embeddings.dashscope import DashScopeEmbedding
import chromadb
import dashscope
import os
from dotenv import load_dotenv
from llama_index.core.llms import ChatMessage
from llama_index.core import VectorStoreIndex, get_response_synthesizer
from llama_index.core.retrievers import VectorIndexRetriever
from llama_index.core.query_engine import RetrieverQueryEngine
from llama_index.llms.dashscope import DashScope, DashScopeGenerationModels
import random

load_dotenv()

os.environ["DASHSCOPE_API_KEY"] = 'sk-xxxxxx'
dashscope.api_key = 'sk-xxxxxxx'
api_key = 'sk-xxxxxxxx'

llm = DashScope(model_name=DashScopeGenerationModels.QWEN_MAX, api_key=api_key)

# Create client and a new collection
chroma_client = chromadb.EphemeralClient()
chroma_collection = chroma_client.create_collection("chapters")

# Define embedding function
embed_model = DashScopeEmbedding(model_name="text-embedding-v1", api_key=api_key)

# Load documents
filename_fn = lambda filename: {"file_name": filename}

# Automatically sets the metadata of each document according to filename_fn
documents = SimpleDirectoryReader(
    "./data/", file_metadata=filename_fn
).load_data()

# Set up ChromaVectorStore and load in data
vector_store = ChromaVectorStore(chroma_collection=chroma_collection)
storage_context = StorageContext.from_defaults(vector_store=vector_store)
index = VectorStoreIndex.from_documents(
    documents, storage_context=storage_context, embed_model=embed_model
)

retriever = VectorIndexRetriever(
    index=index,
    similarity_top_k=4,
    verbose=True
)

# Configure response synthesizer
response_synthesizer = get_response_synthesizer(llm=llm, response_mode="refine")

# Assemble query engine
query_engine = RetrieverQueryEngine(
    retriever=retriever,
    response_synthesizer=response_synthesizer,
)

SYSTEM_PROMPT = """
You are a general knowledge chatbot for children. Your task is to generate answers based on user questions by combining the most relevant content found in the knowledge base. Do not answer subjective questions.
"""

# Initialize the conversation with a system message
messages = [ChatMessage(role="system", content=SYSTEM_PROMPT)]

app = FastAPI()


async def fetch(question):
    url = "https://www.aliyun.com"
    call_url = os.environ.get("LLM_INFRA_URL")
    if call_url is None or call_url == "":
        call_url = url
    else:
        call_url = f"{call_url}?question={question}"
    print(call_url)
    async with aiohttp.ClientSession() as session:
        async with session.get(call_url) as response:
            print(f"GET Status: {response.status}")
            data = await response.text()
            print(f"GET Response JSON: {data}")
            return data


@app.get("/heatbeat")
def heatbeat():
    return {"msg", "ok"}


cnt = 0


@app.get("/query")
async def call(question: str = None):
    global cnt
    cnt += 1
    if cnt == 20:
        cnt = 0
        raise BaseException("query is over limit,20 ", 401)
    # Add user message to the conversation history
    message = ChatMessage(role="user", content=question)
    # Convert messages into a string
    message_string = f"{message.role}:{message.content}"

    search = await fetch(question)
    print(f"search:{search}")
    resp = query_engine.query(message_string)
    print(resp)
    return {"data": f"{resp}".encode('utf-8').decode('utf-8')}


if __name__ == "__main__":
    uvicorn.run(app, host="0.0.0.0", port=8000)

requirements.txt

fastapi
uvicorn
numpy==1.23.5
llama-index==0.10.62
llama-index-core==0.10.28
llama-index-embeddings-dashscope==0.1.3
llama-index-llms-dashscope==0.1.2
llama-index-vector-stores-chroma==0.1.6
aiohttp

LangChain demo

llm_app.py

from fastapi import FastAPI
from langchain.llms.fake import FakeListLLM
import uvicorn
from langchain.chains import LLMChain
from langchain.prompts import PromptTemplate

app = FastAPI()
llm = FakeListLLM(responses=["I'll callback later.", "You 'console' them!"])

template = """Question: {question}

Answer: Let's think step by step."""

prompt = PromptTemplate(template=template, input_variables=["question"])

llm_chain = LLMChain(prompt=prompt, llm=llm)

question = "What NFL team won the Super Bowl in the year Justin Beiber was born?"

@app.get("/")
def call_langchain():
    res = llm_chain.run(question)
    return {"data": res}

if __name__ == "__main__":
    uvicorn.run(app, host="0.0.0.0", port=8000)

requirements.txt

fastapi
uvicorn
langchain
langchain_community

Dify demo

For instructions on building a Dify application, see Build a customized AI Q&A assistant for web pages using Dify.