Future Direction of Observability in Cloud-Native: A Case Study of Autonomous Driving

Preface

I was honored to attend the meetup of the cloud-native community in Beijing and had the opportunity to discuss cloud-native technologies and applications with experts in the industry. In this meetup, I talked about the topic of observability in cloud-native. This article is mainly a literal summary of my presentation, and I welcome all readers to leave a message for discussion.

Origin of Observability

Observability originated from the field of electrical engineering. The main reason is that with the development and complexity of the system, a mechanism must be set up to understand the internal operation status of the system in order to better monitor and repair problems. For this reason, engineers have designed a number of sensors and dashboards to demonstrate the internal status of the system.

If a system is said to be observable, for any possible evolution of state and control vectors the current state can be estimated using only the information from outputs.

Electrical engineering has been developed for hundreds of years, in which observability in various sub-fields is being improved and upgraded. For example, vehicles (cars/planes, etc.) are also the masters of observability. Regardless of the super project of the plane, there are hundreds of sensors inside a small car to detect various states inside and outside the car, so that the car can be stable, comfortable, and safe.

Future of Observability

With the development of more than a hundred years, observability under the electrical engineering has not only been used to assist people to check and locate problems. In terms of automobile engineering, the entire observability development has gone through several processes:

Blindness (No observability): On January 29, 1886, German Karl Benz invented the first car in human history. At that time, the car only had the most basic ability of driving, and there was nothing related to observability.
Sensors: As vehicles began to enter the market, people need to know whether they ran out of gas or water. Therefore, basic sensor dashboards were invented.
Alarms: In order to better ensure the safety of the car, people began to use self-inspection and real-time alarm systems to actively inform the driver of some abnormal information, such as the dead battery, high water temperature, low tire pressure, brake pad wear and so on.
Auxiliary: Although the alarm can be sent out immediately, sometimes people are still too late to deal with the issue or don't want to deal with it. At this time, the auxiliary system comes in handy and it can provide many services such as constant speed cruise, active safety, autonomous parking, and so on. These auxiliary systems are a combination of sensors and automatic control, which can partially solve things that drivers may not be able to do or do not want to do.
Autonomous driving: The above functions ultimately require people to participate, but the autonomous driving does not need the participation of people. It is the observability system and control system that allows the car to run automatically.

Core Elements of Autonomous Driving

As the peak of observability of electrical engineering, autonomous driving gives the best play to all kinds of internal and external data obtained by automobiles. To sum up, it mainly has several core elements:

Rich data sources: The automobile has multiple laser and image radars scattered around the periphery, which can realize a high frame rate, 360 ° real-time observation of the surrounding objects and their status. For the inside, we can know the current vehicle speed, wheel angle, tire pressure and other information.
Data centralization: Compared with the auxiliary driving capability, the core breakthrough of autonomous driving is that all data inside and outside the vehicle can be processed together. In this way, the value of the data can be really brought into play, instead of the data of each module operating independently as an isolated island.
Powerful computing power: Centralized data also means a sharp increase in the amount of data. Any autonomous driving product is supported by powerful chips. Only with sufficient computational power can sufficient calculations be made in the shortest possible time.
Software iteration: Computing power and algorithms constitute the ultimate goal of intelligence, but the algorithm cannot be flawless. We will continuously upgrade the algorithm based on the gradually accumulated autonomous driving data. This allows the software system to be constantly upgraded for better autonomous driving results.

Observability in IT

With decades of development, the monitoring and troubleshooting methods in IT systems have also been gradually used for observable engineering. At that time, the most popular method was to use a combination of Metrics, Logging, and Tracing.

The preceding figure may already be very familiar to you. It is excerpted from a blog post published by Peter Bourgon after he attended the 2017 Distributed Tracing Summit. This figure briefly introduces the definitions and relationships of Metrics, Tracing and Logging. Each of the three types of data plays a role in observability, and each kind of data can not be completely replaced by other data.

The following shows a typical troubleshooting process described in Grafana Loki.

At the beginning, exceptions are detected through various preset alarms, which are usually from Metrics and Logging.
After an exception is found, open the monitoring dashboard to find the exception curve, and identify the exception module based on various queries and statistics (Metrics).
Perform query and statistical analysis on this module and the associated logs to find the core error information through Logging.
Finally, locate the code that caused the exception based on the detailed data of Tracing.

The preceding example illustrates how to use Metrics, Tracing, and Logging for joint troubleshooting. Different combination solutions can be used in different scenarios. For example, a simple system can directly trigger alerts based on error messages from Logging and locate problems. It can also trigger alerts based on Metrics (latency and error code) extracted from Tracing. But on the whole, a system with good observability must have the above three types of data.

Observability in Cloud-native

What cloud native brings is not only the ability to deploy applications on the cloud, but also the upgrading of a new IT system architecture, including development models, system architecture, deployment models, and the evolution and iteration of the infrastructure suite.

Higher efficiency requirements: With the popularization of the DevOps mode, the efficiency requirement for planning, development, testing, and delivery is getting higher and higher. The problem is that we need to know whether the release is successful, what problems appear, where the problems are, and how to solve them quickly.
More complex systems: The architecture has evolved from an integrated architecture to a layered architecture and then to the current microservice model. The upgrade of the architecture brings advantages such as development efficiency, release efficiency, system flexibility, and robustness. However, the system becomes more complex and difficult to locate.
Enhanced environment dynamics: A feature brought by microservice architectures and containerization deployment models is that the dynamic nature of the environment will be enhanced, and the lifecycle of each instance will be shorter. After problems occur, the scene has been destroyed, and the way of logging on to the machine to check problems no longer exists.
More upstream and downstream dependencies: Problem locating depends on more upstream and downstream applications. In microservice, cloud, and Kubernetes environments, more upstream and downstream applications are involved, including various products and middleware, Kubernetes clusters, container runtimes, and virtual machines.

Savior: OpenTelemetry

Many readers will have a deep understanding of the preceding problems. The industry has also withdrawn various observability-related products, including many open-source and commercial projects. For example:

Metrics: Zabbix, Nagios, Prometheus, InfluxDB, OpenFalcon, OpenCensus
Tracing: Jaeger, Zipkin, SkyWalking, OpenTracing, OpenCensus
Logging: (Evangeliese-Lutherse-Kerk) ELK, Splunk, SumoLogic, Loki, Loggly

A combination of these projects can solve one or several specific problems, but when these projects are applied, you will find various problems:

Interleaving multiple solutions: You may need to use at least three solutions, Metrics, Logging and Tracing, and the maintenance cost is huge.
Data interoperability: Although data is generated by the same business component and system, the data is difficult to interact with each other in different solutions. It is impossible to give full play to the value of data.
Vendor binding: An observability system may be bound by a vendor in terms of data collection, transmission, storage, computing, visualization, and alerts. It is very costly to replace it once it goes online.
Unfriendliness of the cloud-native: Many of these solutions are for traditional systems. They do not support cloud native and are very costly to deploy and use. They do not comply with the one-click deployment and out-of-the-box usage of the cloud-native.

In this context, the OpenTelemetry project was born under the Cloud Native Computing Foundation (CNCF). It aims to unify Logging, Tracing, and Metrics to achieve data interoperability.

Create and collect telemetry data from your services and software, then forward them to a variety of analysis tools.

The core function of OpenTelemetry is to generate and collect observability data, which can be transferred to various analysis softwares. The following figure shows the architecture. The Library is used to generate observability data in a unified format, and the Collector is used to receive the data and transmit the data to various backend systems.

The revolutionary progress that OpenTelemetry has brought to the cloud-native, includes:

Unified protocol: OpenTelemetry will provide us with a unified standard for Metrics, Tracing, and Logging. This standard is under development, and the LogModel has already been defined. All the three have the same metadata structure and can be easily correlated to each other.
Unified agent: With one agent, all observability data can be collected and transmitted. There is no need to deploy various agents for each system, which greatly reduces the resource consumption of the system and simplifies the architecture of the overall observability system.
Friendliness of the cloud-native: OpenTelemetry is developed based on CNCF and provides more friendly support for various cloud-native systems. In addition, many cloud vendors have announced the support for OpenTelemetry, which will be more convenient to use in the future.
Independence of vendor: This project is completely neutral and does not favor any vendor, so that everyone can have full freedom to choose or replace the service provider that suits them, without the need to receive monopoly or binding from some manufacturers.
Compatibility: OpenTelemetry is supported by various observability solutions under CNCF, and will have excellent compatibility with the OpenTracing class, OpenCensus, Prometheus, and Fluentd in the future, which can facilitate everyone to seamlessly migrate to the OpenTelemetry scheme.

Limits of OpenTelemetry

From the above analysis, you can see that the orientation of OpenTelemetry is the infrastructure for observability and the solution for data specification and acquisition problems. Subsequent implementations rely on vendors. Of course, the best way is to have a unified engine to store all Metrics, Logging, and Tracing, and a unified platform to analyze, display, and correlate these data. Currently, no vendor can well support the unified backend of OpenTelemetry. However, we still need to use the products of each vendor to implement OpenTelemetry. Another problem brought by this is that the association of various data is more complex, and the data association between each vendor needs to be dealt with. This problem will definitely be solved in one to two years. Now, many vendors are trying to implement a unified solution for all types of data in OpenTelemetry.

Future Direction of Observability

Our team has been responsible for monitoring, logging, distributed tracing, and other observability-related tasks since we started the Apsara 5K project in 2009. We have experienced some architecture changes from minicomputers to distributed systems, and then to microservices and cloud services. The relevant observability solutions have also undergone a lot of evolution. We think that the development of the overall observability correlation is very consistent with the setting of the autonomous driving class.

Autonomous driving is divided into six levels, of which level 0-2 is mainly decided by people. Unconscious driving can be carried out above level 3, that is, hands and eyes can temporarily not pay attention to driving. At level 5, people can completely leave the boring job of driving and move freely on the car.

The observability of an IT system can also be divided into six levels:

Level 0: Manual analysis. Relying on basic dashboard, alarm, log query, distributed link tracking and other methods for manual alarm and analysis, are also the scenarios used by most companies at present.
Level 1: Intelligent alerts, which can automatically scan all observability data, identify anomalies and trigger alerts through machine learning, and eliminate the need to manually set and adjust various baseline alerts.
Level 2: Exception association and a unified view. The automatically identified exceptions are associated to form a unified business view, which is convenient for quickly locating problems.
Level 3: Root cause analysis and problem self-repairing. The root cause of the problem can be automatically and directly located based on exceptions and the Configuration Management Database (CMDB) information of the system. After the root cause is accurately located, the problem can be self-repaired. This stage is equivalent to a qualitative leap. In some scenarios, self-repairing can be achieved without human involvement.
Level 4: Fault prediction. There will always be losses when faults occur, so the best situation is to avoid faults. Therefore, the fault prediction technology can better ensure the reliability of the system and make use of some previous accumulated fault information to predict faults.
Level 5: Change impact prediction. We know that most faults are caused by changes. Therefore, if we can simulate the impact of each change on the system and the problems that may occur, we can evaluate in advance and decide whether this change can be allowed.

Observability Related Work of Alibaba Cloud Log Service

Log Service (SLS) is currently working on cloud-native observability. Based on OpenTelemetry, the future cloud-native observability standard, collects all types of observability data, covers all kinds of data sources and data types, and achieves multi-language support, multi-device support and unified type. We will provide unified storage and computing capabilities to support all kinds of observability data, support PB-level storage, Extract, Transform, and Load (ETL), stream processing, and analysis of tens of billions of data records within seconds. We will provide strong computing power for upper-layer algorithms.

The problems of the IT system are very complex, especially when different scenarios and architectures are involved. Therefore, we combine the algorithm and experience to carry out abnormal analysis. The algorithm includes basic statistics and logical algorithms, as well as Algorithmic IT Operations (AIOp)-related algorithms. Experience includes manually input expert knowledge, problems, solutions, and events accumulated on the internet. The top layer provides functions to assist in decision-making, such as alert notifications, data visualization, and webhooks.

In addition, the top layer provides rich external integration capabilities, such as integration with third-party visualization, analysis or alerting systems. It also provides OpenAPI to facilitate integration among different applications.

Summary

As the most active project under CNCF with the exception of Kubernetes, OpenTelemetry has received attention from major cloud vendors and related solution companies. It is believed that OpenTelemetry will become the observability standard under the cloud-native in the future. Although it has not yet reached the level of production availability, the Software Development Kit (SDK) and Collector in various languages are basically stable, and the production available version can be released in 2021, which is worthy of everyone's expectation.

OpenTelemetry only defines the first half of the observability, and there is still a lot of complicated work to be done, so there is a long way to go.

Community

Future Direction of Observability in Cloud-Native: A Case Study of Autonomous Driving

Preface

Origin of Observability

Future of Observability

Core Elements of Autonomous Driving

Observability in IT

Observability in Cloud-native

Savior: OpenTelemetry

Limits of OpenTelemetry

Future Direction of Observability

Observability Related Work of Alibaba Cloud Log Service

Summary

References

Read previous post:

Read next post:

DavidZhang

You may also like

Comments

DavidZhang

Related Products

Microservices Engine (MSE)

ACK One

Container Registry

Architecture and Structure Design

A Free Trial That Lets You Build Big!