Service monitoring is becoming increasingly important with the popularization and implementation of cloud-native and microservices. In a medium-sized microservice scenario, O&M staff can no longer use logs to restore the request call track and the execution duration of services for the request, let alone locate and analyze the root causes of service exceptions. R&D and O&M staff need a service monitoring tool to restore each request's service call track and service execution time and display them as graphs. Distributed tracing analysis system was born for this purpose.
Different vendors have introduced various commercial products for monitoring cloud-native apps in recent years. These systems are usually called Application Performance Monitoring (APM) solutions. For example, commercial companies within China that offer such products include Alibaba Cloud ARMS, Tingyun, BriAir, and Cloudwise. Some foreign commercial companies that provide quality APM solutions include AppDynamics and DynaTrace. Their products can perfectly adapt to various scenarios. Open-source projects also have excellent APM products, such as CNCF Jaeger, Apache SkyWalking, Cat, and Pinpoint. As a top-level project of CNCF, Jaeger has become the preferred monitoring solution for O&M staff in cloud-native scenarios.
Jaeger was created in 2015 by Uber. In 2017, it was included in the incubation project category of the Cloud Native Computing Foundation (CNCF). In 2019, Jaeger graduated from CNCF. The following figure shows the Jaeger architecture. There are two architectures in the figure. They are generally the same, except that Kafka is added in the second architecture as a buffer to tackle peak traffic overload. Jaeger components include Client, Agent, Collector, DB, UI, and other components. In addition, Jaeger supports a variety of backend storage, including memory, Badger, Cassandra, Elasticsearch, and gRPC plug-ins.
Let's discuss the powerful yet unnoticed gRPC plug-ins. In simple words, gRPC plug-ins provide the capability to export Trace data from the Jaeger system. With this capability, developers can connect Trace to back-end services capable of storing and analyzing Trace data. These services can perform secondary analysis and processing on Trace, such as root cause analysis of exceptions, exception detection, and alarm to help O&M and development staff find and locate potential problems in the system.
To better comprehend the development of the jaeger plug-in, you need to understand its underlying implementation principles. The Jaeger gRPC plug-in is implemented using the HashiCorp/go-plugin framework. Next, we will introduce the Go Plug-in and its development process.
Go Plug-in of HashiCorp is open-source. It follows the open-closed principle in the design pattern. The business can be expanded by fixing the upper-layer business logic through interfaces and calling different RPC service interfaces. Currently, the Go Plug-in contains two types of plug-ins: RPC Plug-in and gRPC Plug-in. The two types of plug-in clients have different underlying calls. RPC plug-in calls through net/rpc, and gRPC plug-in calls through gRPC service call. Both plug-ins provide two methods: Server and Client. The function of the Server method is to act as the stub of the server. After the server receives the request, it calls the implementation of the interface in the interface server. The Client method acts as a factory method that generates the implementation object of the interface for the client.
Go Plug-in will start a subprocess during the startup process. The subprocess then starts the RPC/gRPC service. The main process reaches the plug-in through RPC/gRPC interface. It supports the coexistence of multiple versions of services, which we will discuss later. It does not provide high-availability related solutions for services. Let's move on to the development process of the Go Plug-in.
The following content describes the KV instance under the Go Plug-in example. The KV example defines two methods: Put and Get, and has multiple protocol versions. This article uses gRPC as an example.
type KV interface {
// The KV interface is defined by the KV plug-in.
Put(key string, value []byte) error
Get(key string) ([]byte, error)
}
// KV interface client implementation
type GRPCClient struct{
// The client of the interface encapsulates the gRPC service
client proto.KVClient
}
func (m *GRPCClient) Put(key string, value []byte) error {
// Call the gRPC service interface
_, err := m.client.Put(context.Background(), &proto.PutRequest{
...
})
return err
}
func (m *GRPCClient) Get(key string) ([]byte, error) {
// Call the KV gRPC service
resp, err := m.client.Get(context.Background(), &proto.GetRequest{
...
})
....
return resp.Value, nil
type GRPCServer struct {
Impl KV
}
// Implement the KV gRPC service
func (m *GRPCServer) Put(ctx context.Context,req *proto.PutRequest) (*proto.Empty, error) {
// After receiving the request, the server implementation of the interface is called.
return &proto.Empty{}, m.Impl.Put(req.Key, req.Value)
}
func (m *GRPCServer) Get(ctx context.Context, req *proto.GetRequest) (*proto.GetResponse, error) {
// After receiving the request, the server implementation of the interface is called.
v, err := m.Impl.Get(req.Key)
return &proto.GetResponse{Value: v}, err
}
type KV struct{}
func (KV) Put(key string, value []byte) error {
// Specific business implementation
}
func (KV) Get(key string) ([]byte, error) {
// Specific business implementation
}
// Implement the GrpcPlug-in interface
type KVGRPCPlugin struct {
plugin.Plugin
Impl KV // Implementation of the KV interface,
}
func (p KVGRPCPlugin) GRPCClient(ctx context.Context, broker plugin.GRPCBroker, c *grpc.ClientConn) (interface{}, error) {
// Note that the return is implemented as the interface client
return &GRPCClient{client: proto.NewKVClient(c)}, nil
}
func (p KVGRPCPlugin) GRPCServer(broker plugin.GRPCBroker, s *grpc.Server) error {
// Register the gRpc service
proto.RegisterKVServer(s, &GRPCServer{Impl: p.Impl})
return nil
}
The previous section described the development of the plug-in. In this section, let's understand how to use the plug-in. The plug-in is divided into the plug-in server and the plug-in client.
As mentioned above, the Go Plug-in starts a local sub-process during its startup. The sub-process refers to the plug-in server, which needs to be an executable file containing the main method. Let's take a look at the use of the plug-in server.
1. Write a main function and register the client implementation of the plug-in in the Go Plug-in as follows:
plugin.Serve(&plugin.ServeConfig{
// shakeConfig contains the version and authentication information.
HandshakeConfig: shared.Handshake,
Plugins: map[string]plugin.Plugin{
// The name of the plug-in
"kv_grpc": &shared.KVGRPCPlugin{Impl: &KV{}},
},
GRPCServer: plugin.DefaultGRPCServer,
})
2. Compile into an executable file using go build.
The plug-in client process includes creating the plug-in client, starting the plug-in server, obtaining the interface implementation of the plug-in, and calling the service interface.
client := plugin.NewClient(&plugin.ClientConfig{
// shakeConfig contains the plug-in version and authentication information.
HandshakeConfig: shared.Handshake,
// The mapping between the name of the plug-in and the instance of the plug-in
Plugins: shared.PluginMap,
// Fill in the path of the plug-in executable file here
Cmd: exec.Command("sh", "-c", os.Getenv("KV_PLUGIN")),
// The protocol supported by the plug-in
AllowedProtocols: []plugin.Protocol{plugin.ProtocolGRPC, plugin.ProtocolNetRPC},
}),
// Obtain the client side of the plug-in. In this step, the go plug-in initiates the subprocess through the parameter passed from Cmd and the plug-in version and authentication information are verified.
rpcClient, err := client.Client()
// Obtain the object of the interface client
raw, err := rpcClient.Dispense("kv_grpc")
kv := raw.(shared.KV)
// Execute the command
result, err := kv.Get(os.Args[1])
We know that Jaeger has helped us implement the client and server of the plug-in and the client of the interface. Now, we only need to complete the server development of the interface to finish the development of a gRPC plug-in. Jaeger reserves two plug-in interfaces in the gRPC plug-in: Store Plug-in and ArchiveStore Plug-in. The difference between the two plug-ins is that the Store Plug-in DependencyReader interface is defined in ArchiveStore Plug-in. DependencyReader interface is used to query dependencies between services. Both plug-in interfaces expose the SpanReader and SpanWriter interfaces for the read and write operations of Trace/Span.
// Read all operation names
func GetOperations(ctx context.Context, query spanstore.OperationQueryParameters) ([]spanstore.Operation, error)
// Read all application names
func GetServices(ctx context.Context) ([]string, error)
// Find Trace that meets the conditions
func FindTraces(ctx context.Context, query *spanstore.TraceQueryParameters) ([]*model.Trace, error)
// Find Trace ID that meets the conditions
func FindTraceIDs(ctx context.Context, query *spanstore.TraceQueryParameters) ([]model.TraceID, error)
// Obtain the details of a specific Trace by using the Trace ID
func GetTrace(ctx context.Context, traceID model.TraceID) (*model.Trace, error)
// Write the Trace
func WriteSpan(ctx context.Context, span *model.Span) error
// Read the dependencies between applications, which are used to draw the application topology diagram and DAG
func GetDependencies(ctx context.Context, endTs time.Time, lookback time.Duration) ([]model.DependencyLink, error)
SLS has launched a unified storage and analysis solution for distributed tracing analysis. Currently, it supports access to various types of Trace data, such as Jaeger, Apache SkyWalking, OpenTelemetry, and Zipkin. For more information, click here.
The code logic in the Jaeger plug-in of SLS is not described here. The plug-in code is now open source. GitHub address: https://github.com/aliyun/aliyun-log-jaeger. Welcome to add⭐. The repository provides a demo that you can run with one click. Welcome to use it. The usage documents have also been provided on GitHub. Here is a practical demonstration and the significance behind the development of the Jaeger plug-in.
The plug-in development has been completed so far. At the same time, we also need to think about what benefits the plug-in provides us. Users can take advantage of the information value brought by Trace, but Trace data collection is only the beginning of system monitoring. Mining trace hidden information in Trace is the most crucial capability for building a monitoring system. Similarly, while using the information value that Trace brings us, we should think about how to sustain such capability.
Heinrich Law states that for every accident that causes a major injury, there are 29 accidents that cause minor injuries, 300 accidents that cause no injuries, and 1000 accident hazards. It tells us such a truth: every safety accident seems to be accidental, but it is the inevitable result of accumulating various factors to a certain extent. The business system reproduces a large amount of Trace data every day. We cannot even tell the running state of the system from these data, let alone some hidden problems in the system. At this time, the system needs to provide analysis capabilities in big data scenarios. SLS is a platform providing services for logs. It processes PB-level log volume every day. In addition, it also provides a series of log analysis operators and processing tools for analyzing the issues behind the system for users. Trace can be understood as a specific log, but this log has an associated context (TraceID, parentSpanID, SpanID). We believe SLS will be able to process Trace logs with ease.
As an observability/monitoring system component, Jaeger is an important data source for locating and discovering business system problems. We need to ensure that monitoring systems live longer than business systems. Once the monitoring system is down before the business system, the monitoring during that downtime is simply meaningless. As an open-source project, Jaeger only provides solutions but does not provide deployment-scale evaluation solutions to ensure high service availability. How to provide high-availability and high-performance back-end services in this situation? Who will provide the last layer of protection for the monitoring system? Being a cloud service, SLS is characterized by high performance, flexibility, and O&M-free, allowing users to easily deal with surging traffic or inaccurate scale evaluation. The SLS service itself provides availability of 99.9% and data reliability of 99.999999999%.
Building a complete monitoring system requires not only the availability of the monitoring system but also strong analysis capabilities. Analysis capabilities can help O&M staff quickly find and locate faults and improve system availability. The Jaeger plug-in provides us with the extension capability to access a variety of analysis systems. Such extension capability enables monitoring experts to provide professional analysis capabilities and O&M and development teams to focus more on business O&M.
1) https://github.com/jaegertracing/jaeger/issues/422
2) https://www.jaegertracing.io/docs/1.24/architecture/
3) https://github.com/hashicorp/go-plugin
Other References
Using Log Service Trace to Implement a Reliable Deployment Solution for Jaeger
12 posts | 1 followers
FollowDavidZhang - June 14, 2022
Alibaba Cloud Storage - June 19, 2019
Alibaba Cloud Native - August 14, 2024
Yee - February 19, 2021
Alibaba Cloud Native Community - March 25, 2024
Alibaba Clouder - January 28, 2019
12 posts | 1 followers
FollowManaged Service for Grafana displays a large amount of data in real time to provide an overview of business and O&M monitoring.
Learn MoreA unified, efficient, and secure platform that provides cloud-based O&M, access control, and operation audit.
Learn MoreAn all-in-one service for log-type data
Learn MoreBuild a Data Lake with Alibaba Cloud Object Storage Service (OSS) with 99.9999999999% (12 9s) availability, 99.995% SLA, and high scalability
Learn MoreMore Posts by DavidZhang