Leap in Log Collection Efficiency: Comprehensive Upgrade from iLogtail to LoongCollector

This article introduces LoongCollector, a comprehensive upgrade from iLogtail that boosts log collection efficiency, pipeline flexibility, and overall performance in observable data management.

Review of iLogtail Development History

● In 2013, the first version of iLogtail was introduced alongside the Apsara 5K system. The Apsara 5K project is a distributed operating system project that schedules 5,000 computers to work together as a supercomputer. At that time, the purpose of developing iLogtail was simple: to collect log data distributed across thousands of machines into a central data warehouse for easy access and analysis. In this phase, iLogtail focuses on basic log collection. Its typical technical features include real-time log collection via inotify-driven change detection, Apsara log parsing and structuring, real-time transmission of logs to remote storage, and basic self-monitoring log reporting.

● In 2015, Alibaba began to migrate its group businesses to the cloud. Double 11 in 2019 announced that the core system was fully migrated to the cloud. During this period, the users of iLogtail have expanded from Alibaba Cloud to the entire group, which puts forward higher requirements for the richness and stability of its processing capabilities. In this context, iLogtail developed excellent capabilities, such as multi-level feedback queues for fault isolation, checkpoint mechanisms to prevent log loss, multi-tenant management capabilities, and more comprehensive log processing capabilities.

● In 2017, with the official commercialization of SLS and the launch of ACK, the number of iLogtail users showed geometric growth, and new requirements sprang up. Meanwhile, the usage scenarios of iLogtail expanded from hosts to containers. To adapt to the changes in the new environment, iLogtail evolved a Go plugin system. With the support of this subsystem, iLogtail quickly supported features such as container log collection and automatic tagging of Kubernetes metadata. iLogtail also began to expand support for access to time series and tracing data.

● In 2022, iLogtail was fully open-sourced and upgraded to v1.0.0. This marks the maturity of iLogtail. iLogtail has transformed from a single log collector to an observable data collector with full features. The 1.0 series delivered complete support for common containers at runtime, making it suitable for cloud-native environments. Thanks to the efforts of members in the open-source community, it enriched its output support for downstream ecosystems. Keeping pace with the times, it added support for data access to the fourth pillar of observability data - Profiling.

● In 2024, on its 2nd open-source anniversary, iLogtail released v2.0.0. This version combined community contributions and adapted to market changes. Compared with the 1.0 series, this version significantly improved in usability, performance, and reliability.

From iLogtail to LoongCollector: Beyond Renaming

In 2025, iLogtail was officially upgraded to LoongCollector, marking a new era in the field of log collection and processing. LoongCollector has achieved comprehensive upgrades in log scenarios and has been deeply optimized in terms of functionality, performance, and stability, thus providing users with more efficient, flexible, and reliable log management solutions. Next, we will introduce the upgrades of LoongCollector in detail.

Consolidated Foundation - High Performance and Flexible Pipeline

Overview

An overall architecture upgrade was performed on iLogtail, especially for the main program in C++. By introducing the concept of pipeline, the input, processing, and output capabilities achieve completely plugin-based integration, supporting free combination of capabilities to meet the above requirements.

In LoongCollector, each collection task corresponds to a collection configuration, which describes how to collect, process, and send the required observable data. In terms of code implementation, each configuration maps to a pipeline in memory, and its general form is as follows:

LoongCollector supports various pipeline forms, aiming to meet the needs of different users for log collection and processing and flexibly adapt to various application scenarios. The specific supported pipeline forms include:

● C++ Input plugin + C++ native plugin

This combination allows users to take advantage of the high-performance features of C++ for log data input and processing. This solution is particularly suitable for scenarios where a large number of logs need to be processed in real time. It can significantly reduce latency and improve performance.

● C++ Input plugin + SPL plugin

The SPL (SLS Processing Language) plugin provides an intuitive and powerful way to analyze and process data. This combination not only improves the ability to handle complex data but also simplifies the user experience.

● C++ Input plugin + Golang extended plugin

By combining the C++ Input plugin with the Golang extended plugin, users can take full advantage of both. The C++ plugin provides high-performance collection capabilities during data collection, while the Golang plugin adds flexibility to data processing.

● Golang Input plugin + Golang extended plugin

The biggest advantage of the Golang Input plugin is its support for multiple data sources, including Systemd, Kafka, and Win event. This combination adapts to a wide variety of data sources to the greatest extent.

Hot Load Isolation for Pipeline

LoongCollector uses a bus mode to divide threads by function. Specifically, according to the pipeline architecture, LoongCollector has three worker threads: the Input Runner thread, the Processor Runner thread, and the Flusher Runner thread, which are responsible for running the input plugins, processing plugins, and output plugins of all pipelines, respectively. These threads are connected through buffer queues. To ensure fairness and isolation between pipelines, LoongCollector further adopts the following designs:

● Within each worker thread, each pipeline is allocated a corresponding time slice according to priority.

● Each pipeline has its own independent processing and send queues.

Based on the preceding description, the bus mode diagram of LoongCollector is as follows:

However, the bus mode is bound to pose greater challenges to the isolation in multi-tenancy scenarios. Isolation through threads is the simplest, but multiple threads inevitably lead to a doubling of resource usage, which is unacceptable for an observable data collector.

LoongCollector is deeply optimized in the overall scheduling of the collection configuration and the scheduling of the Flusher thread resource allocation, ensuring the multi-tenancy capability to the greatest extent in the bus mode.

When collection configurations are changed, iLogtail uses the Stop The World method. All collection configurations are suspended, reloaded, and then restarted. If multiple teams or businesses share the same iLogtail instance, they will interfere with each other. For example, if Service A and Service B share the same iLogtail instance, continuous debugging of collection configurations by Service A will inevitably affect the collection of Service B during the debugging phase.

LoongCollector further optimizes the lifecycle management of pipelines. Only pipelines that have changed are replaced, and unchanged pipelines remain intact. This minimizes the impact of collection configuration changes and avoids the impact of Stop The World on the whole.

Continuous Breakthrough - Steady Improvement in Collection Performance for Core Scenarios

CPU Reduced by an Average of 35% and Memory by 10%

In the single-line mode, resource usage for the same traffic is compared. The lower the bar value, the better.

You can see that LoongCollector has greater advantages over iLogtail in both CPU and memory usage. In particular, the CPU usage of LoongCollector is lower than that of iLogtail by 0.15C on average. In terms of memory, the advantage of LoongCollector is not obvious in low-traffic scenarios, but it achieves approximately a 10% reduction in memory usage in high-traffic scenarios.

Maximum Collection Rate in Core Scenarios Increased by an Average of 80%

File collection

As shown in the figure, in typical file collection scenarios, LoongCollector significantly outperforms iLogtail in both single-threaded and multi-threaded scenarios, with an average improvement of 40% in single-threaded scenarios and 80% in single-line scenarios. In multi-threaded scenarios, the improvement is more significant, with an average increase of 80%.

Standard output

● By refactoring the standard output collection plugin, a new plugin input_container_stdio is introduced. This plugin supports log rotation queues, significantly enhancing the stability of standard output collection.

● In terms of performance, the new plugin performs well. In containerd scenarios, the collection performance improves by 200%. In Docker scenarios, the collection performance improves by 100%.

● Regarding resource usage, in containerd scenarios, the new plugin reduces the CPU usage by 20% and memory by 25%. In Docker scenarios, the CPU usage is decreased by 25% and the memory by 20%.

Upgraded Stability - Better Self-monitoring

LoongCollector Instance Monitoring

LoongCollector provides comprehensive instance monitoring features to ensure that users can grasp the usage of system resources and instance exception information. Through intuitive monitoring dashboards, users can view the usage of resources such as CPU, memory, and network in real time, making the running status of each instance transparent. The system can also automatically trigger alerts to promptly notify users of exceptions, helping them quickly locate and handle problems.

File Collection Monitoring

In file collection scenarios, LoongCollector is equipped with powerful data monitoring capabilities, including comprehensive monitoring of the collection directory usage and collection latency. Users can quickly check key information, such as the current number of files in each directory on the overview page, so as to find potential file backlog issues in time. Additionally, on the details page, the self-monitoring system also provides a more in-depth analysis feature to help users identify latency issues during file collection.

Pipeline Details Monitoring

For multi-collection configuration scenarios, the pipeline details monitoring feature of LoongCollector comprehensively displays key data such as the duration and exception information of each collection configuration. Users can clearly see the processing duration of each pipeline stage in this interface, easily identifying performance bottlenecks and potential optimization points. Meanwhile, self-monitoring also records exceptions that occur in each step to help users quickly locate problems and make adjustments. Through in-depth monitoring and analysis of the pipeline process, users can more effectively optimize log collection and processing strategies to improve the performance and reliability of the entire log processing system.

Network Exception Isolation - Tolerating Single-zone Network Exceptions

Another isolation issue in bus mode is the sending exception isolation of pipelines. For example, if a network exception occurs in a region, all pipelines configured with an SLS output plugin and sent to the region will experience sending failures. In the bus mode, as the Flusher Runner thread is globally shared, it will retrieve the data to be sent from the send queue and push it to the sink queue regardless of whether the pipeline has sending exceptions. As a result, requests from the pipeline with sending exceptions are repeatedly sent, which occupies limited network I/O resources and affects the sending of other normal pipelines.

To isolate pipeline sending exceptions in the bus mode, we add a traffic distribution mechanism to LoongCollector to control the behavior of Flusher Runner threads to retrieve data from the send queue.

Traffic control is performed based on three dimensions: Zone, Project, and Logstore.

● Zone throttling: handles network/server issues

● Project throttling: addresses project quota issues

● Logstore throttling: manages shard quota issues

Each throttler uses an adaptive throttling algorithm based on the network congestion control algorithm AIMD (Additive Increase, Multiplicative Decrease). When a sending failure occurs, the sending concurrency is quickly reduced; when sending succeeds, the concurrency is gradually increased. To avoid exceptions caused by network jitter, statistics on the sending status of a batch of data over a period are counted to prevent frequent concurrency fluctuations.

This strategy ensures that when a network exception occurs in a sending target, the data packets allowed to be sent by the target can quickly decrease, minimizing the impact of the problematic target on other normal targets. If the network is interrupted, the hibernation period can minimize unnecessary sending, and resume data sending promptly when the network is restored.

In the following example, when LoongCollector simultaneously sends collected data to regions A and B, the network exception in region B does not affect the data collection and sending of region A.

Automatic Detection of Network Quality - Coping with Network Fluctuations

SLS endpoints are divided into internal endpoints and public endpoints. If the internal endpoint is fixed, data sending will be blocked once the internal network fails. Considering this situation, LoongCollector automatically detects the network quality of SLS endpoints. Once the network quality is detected to be poor, it will automatically switch to another endpoint.

As shown in the figure, LoongCollector sends data over the internal network. If an internal network exception occurs, LoongCollector automatically switches to the public endpoint for data sending. Once the internal network is restored, LoongCollector automatically reverts to the internal endpoint. This ensures data sending stability in the case of a single-network exception. The traffic graph shows that endpoint switching causes no traffic fluctuation.

Seamless Migration - Complete Existing Migration Solution Without Interruptions

Seamless Upgrade from iLogtail to LoongCollector in Host Scenarios

For more information about seamless upgrade from iLogtail to LoongCollector, see the upgrade documentation. The previous collection configurations and checkpoints will not be lost, just like a restart. LoongCollector is fully compatible with all the configurations of iLogtail.

Upgrade in Kubernetes Scenarios Without Interruptions

Component resource-level upgrade management

● Uninstalling and reinstalling at the component level will definitely result in a period of service unavailability.

● Resource-level control can minimize the duration of service unavailability.

Adopting affinity control to implement seamless switching of Daemonset from logtail-ds to loongcollector-ds

The following figure shows the effect.

Ensuring zero data loss or interruption on a single node

Logtail-ds has a checkpoint mechanism. When a single-node logtail-ds stops, the offset information collected from the file is persisted to the node's checkpoint. When loongcollector-ds is enabled, it reads the offset from the checkpoint first and then continues to collect data from the offset. This ensures that data is not duplicated or lost.

Kubernetes component update

It can be seen that before and after the component upgrade, both standard output collection and file collection show very stable data trends, with no data interruption or duplication.

New Tag Processing Capability - From Disorder to Unified Management

Tags are important data for iLogtail to identify log metadata in log collection. However, iLogtail has some problems in the processing of tag data.

● Inconsistent tag data sources: iLogtail adds most metadata to tags by default. For a small amount of other metadata (such as inode), it uses the Boolean parameter in the pipeline configuration to determine whether to add them to tags.

● Users cannot rename or delete tags.

● C++ and Go have completely different mechanisms for handling tags, with no unified solution.

LoongCollector optimizes the overall tag processing to address the above issues.

● Input-level tag data is separately processed and controlled by each Input plugin.

● In container scenarios, all tags of the new standard output collection plugin are the same as those of the file collection.

● The tag processing plugins can be used to add, delete, and rename metadata at the instance level. The tag processing features of C++ are the same as those of Go. This ensures that all pipeline types can use the full tag processing capabilities.

The sample code is used to process file Input tags and tags at the instance level.

{
  "configName": "taiye-file-test-new",
  "inputs": [{
    "Type": "input_file",
    "FilePaths": [
      "/home/**/test.log"
    ],
    "EnableContainerDiscovery": true,
    "CollectingContainersMeta": true,
    "ContainerFilters": {
      "IncludeEnv": {
        "aliyun_logs_taiye-file-test": "/home/test.log"
      }
    },
        // Handle the Input tags
    "Tags": {
      "K8sNamespaceTagKey": "my-namespace",
      "ContainerIpTagKey": ""
    }
  }],
  "flushers": [{
    "Type": "flusher_sls",
    "Endpoint": "cn-hangzhou-intranet.log.aliyuncs.com",
    "Logstore": "taiye-file-test-new",
    "Region": "cn-hangzhou",
    "TelemetryType": "logs"
  }],
  "global": {
        // Rename the HOST_NAME tag
    "PipelineMetaTagKey": {
      "HOST_NAME": "taiye-123"
    },
        // Enable instance tag processing
    "EnableProcessorTag": true
  }
}

The _tag__: namespace and __tag__: hostname tags are renamed correctly, and the __tag__:_container_ip tag is deleted.

What's More -- LoongCollector Will Bring a New Experience of Full-stack Collection of Observable Data

LoongCollecoter, built on iLogtail's high-performance pipeline as a base, integrates Prometheus metric collection and eBPF data collection into the iLogtail pipeline to fully upgrade the collection capabilities and realize OneAgent-based observable data collection. Stay tuned for more upcoming features.

Community