×
Community Blog The Consumption of Tokens by Large Models Can Be Quite Ambiguous

The Consumption of Tokens by Large Models Can Be Quite Ambiguous

This article discusses the challenges and strategies involved in managing resource consumption in large model applications.

By Wang Cheng

If you are deploying large model applications, be sure to give a heads-up to your CEO in advance; large model applications are far less controllable in terms of resource costs than web applications.

Classic web applications, such as e-commerce, gaming, travel, renewable energy, education, and healthcare, have controllable CPU consumption, which is positively correlated with the number of online users and login duration. If there is a sudden spike in computing resources, it could be due to operational activities or unexpected surges in traffic. After elastic scaling of servers, the system will stabilize for a while and then scale back to normal. The resource consumption on the backend is traceable and manageable. However, the token consumption of large models is not.

Table of Contents

  1. What Factors Affect Token Consumption in Large Models
  2. Hidden Sources of Token Consumption in Large Models
  3. More Complex Resource Consumption Ledger for Agents
  4. Initial Exploration on How to Control Abnormal Token Consumption
  5. Summary

1. What Factors Affect Resource Consumption in Large Models

According to an article by Quantum Bit [1], when inputting “the distance between two paths in a tree,” DeepSeek gets stuck in an infinite thought process, consuming 625 seconds of thinking time (as shown below) and generating 20,000 words of output. This phrase is not complex or nonsensical; it appears to be a perfectly normal question. If one were to nitpick, it could be said that the expression is not complete enough.

This infinite cycle of repetitive thinking is a form of mental burnout for the model itself and can lead to a waste of computational resources. If exploited by hackers, it is akin to a DDoS attack on inference models. So, what other factors influence token consumption in large models, aside from the number of online users and duration?

This article uses DeepSeek as an example; the billing rules and the factors affecting billing for other large model API calls are similar.

According to the billing documentation provided by DeepSeek’s official website [2], the costs of API calls are related to the following parameters:

  • Model Type: The price per million tokens differs between V3 and R1; R1 has a higher price due to its inference capabilities compared to V3.
  • Number of Input Tokens: Charged by the million tokens; the larger the usage, the higher the cost.
  • Number of Output Tokens: Charged by the million tokens; the larger the usage, the higher the cost, with output prices exceeding input prices.
  • Cache Hits: The unit price is lower for cache hits than for misses.
  • Peak and Off-Peak: The unit price is lower during off-peak hours.
  • Thought Chains: The number of tokens consumed during output.

In addition, the search requests during the online search process and the processing of the returned data (the steps prior to content generation) will also count towards token usage. Any action that awakens the large model's awareness will consume tokens.

According to this billing rule, the resource consumption of large models will be related to the following factors:

  • Length of User Input Text: The longer the user's input text, the more tokens consumed. Typically, 1 Chinese word, 1 English word, 1 number, or 1 symbol counts as 1 token.
  • Length of Model Output Text: The longer the output text, the more tokens consumed. For example, the output token consumption price of DeepSeek is 4 times that of input.
  • Size of User Input Context: In the context of conversations, since the model has to read the previous rounds of dialog before generating content, this significantly increases the model's input, leading to an increase in tokens.
  • Complexity of the Task: More complex tasks may require more tokens. For example, generating long texts (such as translating and interpreting papers) or performing complex reasoning (like mathematics and science-related questions) require more tokens. Additionally, if it involves multi-modal or complex agent forms, it usually consumes more tokens than dialogue robots.
  • Special Characters, Formatting, and Markup: These can increase token consumption. For instance, HTML tags, Markdown formatting, or special symbols may be split into multiple tokens.
  • Different Languages and Encoding Methods: These can affect token consumption. For example, Chinese typically consumes more tokens than English because Chinese characters may require more encoding space.
  • Related to the Model Itself: For example, the same model may output more detailed content with higher parameters, leading to easier token consumption, similar to how a taller, stronger person uses more energy per unit of movement. Additionally, if the inference layer is unoptimized or insufficiently optimized, there is a higher likelihood of generating invalid, low-quality content, which can also lead to increased token consumption. Just as a trained individual can control their breathing rhythm and consume less energy during exercise.
  • Whether the Deep Thinking Function Is Used: The output token count includes all tokens from the thought chain and the final answer, so enabling the deep thinking feature results in higher token consumption.
  • Whether Online Capabilities Are Used: Online connectivity requires the model to search external knowledge databases or websites for external information, which will consume tokens as input. The output content that includes external links and knowledge from external databases will also consume tokens as output.
  • Whether Semantic Caching Features Are Used: As the pricing differs between cache hits and misses, utilizing semantic caching can reduce resource consumption. If further optimization of the caching algorithm is performed, more resource usage can be reduced.

2. Hidden Factors in Resource Consumption of Large Models

In addition to the previously mentioned factors, there are many hidden factors that can cause abnormal resource consumption in large model applications.

Code Logic Vulnerabilities

  • Uncontrolled Loop Calls: Errors in the retry mechanism configuration can result in individual user sessions generating duplicate calls.
  • Lack of Caching Mechanism: High-frequency repetitive questions that do not utilize caching can lead to tokens being used to repeatedly generate similar answers.

Prompt Engineering Deficiencies

  • Redundant Context Carrying: Carrying the complete conversation history can significantly increase the number of tokens in a single request; the longer the contextual dialogue, the more tokens consumed.
  • Inefficient Instruction Design: Unstructured prompts can reduce the efficiency of model generation.

Ecological Dependency Risks

  • Plugin Call Black Hole: Failing to limit the depth of plugin calls can trigger repeated chain calls for a single query.
  • Third-Party Service Fluctuations: Delays in vector database responses can lead to timeout retries, indirectly increasing token consumption.

Data Pipeline Deficiencies

  • Deficiencies arising during data preprocessing: Data cleaning, preprocessing, and normalization are standard methods for improving input quality. For example, typos, missing values, and noise data in user inputs can be corrected through cleaning, preprocessing, and normalization. However, there is also a possibility that corrections and completions might create new input deficiencies, leading to abnormal resource consumption.

3. The Resource Consumption Ledger of Agents Is Even More Complex

Speaking of agents, we cannot overlook the recently popular MCP.

In January, we introduced "MCP Ten Questions | Quickly Understand the Model Context Protocol."

we will also release "An Overview of MCP Monetization," so stay tuned to the Higress public account.

MCP replaces fragmented integration methods with a single standard protocol in the interactions between large models and third-party data, APIs, and systems [3]. This represents the evolution from N x N to One for All, eliminating the need for repetitive coding and maintenance of various external system interfaces, allowing AI systems to acquire the necessary data in a simpler and more reliable manner.

Before the emergence of MCP, agents needed to use tools to connect to external systems. The more complex the planning tasks, the greater the number and frequency of calls to external systems, resulting in high engineering costs. For example, in the process flow chart of Higress AI Agent, when a user requests "I want to have coffee near Wudaokou in Beijing, please recommend something," the agent needs to use tools to call the API of GaoDe and DianPing. If the model self-correction process is introduced, the frequency of calls will further increase.

With the emergence of MCP, a wave of MCP server providers will rapidly arise.

For example, Firecrawl officially introduced the MCP protocol in January of this year through integration with the Cline platform, allowing users to access its fully automated web scraping capabilities via Firecrawl's MCP servers, avoiding the need to connect to each target page individually, thereby accelerating the development of agents. Yesterday, OpenAI released the Responses API and open-sourced the Agents SDK. It is believed that MCP and OpenAI will serve as the two main storylines reshaping the labor market for agents.

We increasingly understand the viewpoint that "AI targets corporate operational expenses rather than the budget for traditional software." Click to learn more about forward-looking perspectives on AI in 2025.

Returning to agents, compared to conversational robots, the planning and execution processes of agents are more complex and consume more tokens. Below is a diagram created by Zhihu author @tgt, showing that from the input phase onward, the agent's planning, memory, calls to external systems, and output execution all activate the large model, thus consuming tokens. Additionally, if a self-correction process is included before generating output to enhance the result, token costs will further increase.

The recently popular Manus showcases many user cases with impressive execution results, but it comes with significant computational costs behind the scenes. In general, the maturity of agents will greatly increase the consumption of foundational model calls.

4. Initial Exploration on How to Control Abnormal Token Consumption

Due to the numerous and complex factors leading to model resource consumption, merely one product or solution cannot solve this issue. A complete engineering system needs to be established from advance preparation, during the process, and after the fact. As we are still in the early stages of token consumption, what follows is only a preliminary discussion, with the belief that we will see more practices regarding lean large model costs.

(1) Before Abnormal Calls Occur: Preventive Measures

a. Establish a Real-Time Monitoring and Threshold Alert System

  • Monitoring System: Deploy a resource monitoring dashboard to track metrics, logs, traces, and tokens in real-time. In the event of an abnormal call, faults can be quickly traced and rate-limiting can be applied. [4]
  • Access Control: Implement permission levels and access control for user identities (such as API Keys), providing consumer authentication functionalities. For example, limit high-frequency calls to prevent sudden resource occupation due to malicious or unintentional operations. [5]

b. Data Preprocessing

  • Format Checking: Before calling the model, check user inputs for format, length, sensitive words, etc., to filter out invalid or abnormal requests (e.g., excessively long texts, special character attacks) and reduce wasted token consumption.
  • RAG Effect Optimization Techniques: Use metadata for structured searches prior to vector retrieval to precisely locate target documents and extract relevant information, shortening input length and reducing token usage.
  • Semantic Caching: By caching large model responses in an in-memory database and implementing it as a gateway plugin, inference latency and costs can be improved. The gateway layer automatically caches each user’s historical dialogues, applying them to context in subsequent conversations, thus enhancing the large model’s understanding of contextual semantics and reducing token consumption due to cache misses. [6]

c. Parameter Tuning

  • Temperature Parameter Tuning: Adjust parameters of the model to control its output behavior. For instance, modifying the temperature can influence the randomness of model outputs. Lowering the temperature can make the model's outputs more deterministic, reducing unnecessary token generation. For example, DeepSeek officially suggests setting the temperature to 0.0 for code generation and math problem-solving, and to 1.3 for general dialogue.
  • Output Length Pre-setting: When calling the model, set a maximum length for the output in advance. Based on the specific task's requirements, clearly inform the model about the approximate range for the output. For example, when generating summaries, set the output length not to exceed 4k to avoid generating excessively long texts. DeepSeek supports a maximum output length of 8K.

(2) When Abnormal Calls Occur: Real-Time Processing

a. Alerting and Rate-Limiting Mechanisms

  • Alerting: Set dynamic baseline thresholds for key indicators such as token consumption, call frequency, and failure rates. If these thresholds are exceeded, an alert is triggered.
  • Rate Limiting and Circuit Breaking: When a surge in token consumption or an abnormal failure rate is detected—based on sources like URL parameters, HTTP request headers, client IP addresses, consumer names, or key names in cookies—automatic rate limiting can be triggered, and even blocking actions can be taken to safeguard core functionalities and control the impact radius. [7]

b. Tracing and Isolation of Abnormal Calls

  • Temporary Blocking: Analyze logs to locate the source of abnormal calls (such as specific users, IPs, or API interfaces) and temporarily block those abnormal requesters to prevent further resource waste.

(3) After Abnormal Calls Occur: Recovery and Optimization

a. Data Compensation and Code Fixing

  • Reducing Statistical Errors: Track and recalibrate statistics for errors caused by delays in data updates (such as missing token consumption records) through offline computational tasks, ensuring the accuracy of billing and monitoring systems.
  • Code Review and Repair: Audit the code that calls large models to fix potential logical errors or vulnerabilities. For example, check for any occurrences of looping calls to the model to avoid abnormal token consumption caused by infinite loops.

b. Attacking Traceability and Defense Strategy Upgrades

  • Analyzing Abnormal Call Logs: Identify whether the calls are from adversarial attacks (such as poisoning attacks or malicious generation requests), update blacklist rules, and deploy input filtering models.
  • Enhancing Identity Authentication Mechanisms: Implement dual-factor authentication to prevent resource abuse resulting from API key leaks.
  • Improving Automated Alerting and Processing Mechanisms: Enhance automated alerting and processing systems to improve the response capability to abnormal token consumption. For example, optimize alerting rules to make alerts more accurate and timely, and improve the exceptional processing workflow to enhance efficiency.

c. Long-term Optimization Measures

  • Token Hierarchical Management: Allocate different permissions to tokens for different businesses to reduce exposure risks to core service tokens.
  • Automated Testing and Drills: Regularly simulate token abnormal scenarios (such as expiration or failure) to validate the effectiveness of fault tolerance mechanisms.

5. Summary

In the past, we invested a significant amount of time and effort into improving the utilization of infrastructure resources. Currently, all enterprises engaged in AI infrastructure are focusing on optimizing resource utilization, from underlying hardware to model layers, inference optimization layers, and even the layers of gateway entry. This will be a long race of engineering and algorithms working in tandem.

0 1 0
Share on

You may also like

Comments

Related Products

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free Get Started for Free