Descriptions of parameters in the Fluid dashboard - Container Service for Kubernetes

This topic describes the variables and panels in the Fluid control plane dashboard and Fluid JindoRuntime cache dashboard. The dashboard variables provide different dimensions for Fluid observability metrics, such as monitoring period duration, and dataset namespace and name. The panels can help you understand the health and performance of Fluid components. You can use the dashboard to find and solve possible problems at the earliest opportunity and identify potential optimization items in the cache system in specific business scenarios.

Prerequisites

Managed Service for Prometheus is enabled for the Fluid component. For more information, see Step 2: View the Fluid dashboard.

Fluid control plane dashboard

Dashboard variables

The monitoring data displayed in the panels of the Fluid dashboard varies based on the value of the variables in the Fluid dashboard. You can modify the values of the variables based on your business requirements. For example, after you change the value of the Runtime variable from AlluxioRuntime to JindoRuntime, all related panels in the dashboard switch to display data related to JindoRuntime.

Variable	Valid value	Description
interval	1m, 5m, 10m, 30m, 1h, and 6h	The duration of a monitoring cycle.
quantile	0.5, 0.75, 0.90, 0.95, and 0.99	The quantile used by certain panels when the panels visualize metrics. For example, a value of 0.90 indicates P90.
runtime	JindoRuntime AlluxioRuntime JuiceFSRuntime	The type of runtime used in Fluid. After you modify the value of this variable, the change applies to all runtime-related panels. JindoRuntime: the execution engine of JindoFS developed by the Alibaba Cloud Elastic MapReduce (EMR) team. JindoRuntime is based on C++ and provides dataset management and caching. JindoRuntime also supports Object Storage Service (OSS). AlluxioRuntime: the execution engine of open source Alluxio. AlluxioRuntime supports dataset management and caching and accelerates access to persistent volume claims (PVCs), Ceph, and Cloud Parallel File System (CPFS). You can use AlluxioRuntime in hybrid cloud scenarios. JuiceFSRuntime: a distributed cache acceleration engine developed based on JuiceFS. JuiceFSRuntime supports scenario-specific data caching and acceleration. For more information about JuiceFS, see Introduction to JuiceFS.

Panels

Panel group	Panel	Description
Component running status	Dataset Controller Ready Replicas	The number of dataset controller pods that are in the Running state in the cluster.
	History of Dataset controller restarts	The number of times that the dataset controller pods have restarted in the cluster.
	Runtime Number of ready copies of controller	The number of runtime controller pods that are in the Running state in the cluster.
	History Runtime Controller Restart Times	The number of times that the runtime controller pods have restarted in the cluster.
	Fluid Webhook ready copies	The number of Fluid webhook pods that are in the Running state in the cluster.
	Number of historical fluid Webhook restarts	The number of times that the Fluid webhook pods have restarted in the cluster.
	Fluid CSI Plug-in Ready Copies	The number of Fluid CSI plug-in pods that are in the Running state in the cluster.
	Historical Fluid CSI plug-in restarts	The number of times that the Fluid CSI plug-in pods have restarted in the cluster.
	Fluid Component Restart	The top five Fluid components that have restarted most frequently within a monitoring cycle of 2 minutes.
Fluid Controller Detailed Indicator	Runtime Controller processing time	The amount of time that the runtime controller spends to handle runtime resources within a monitoring cycle. The panel displays percentile values.
	Number of Runtime controller processing failures	The types and number of failures that occur when the runtime controller handles runtime resources. The following types of failures are displayed: Runtime deployment failure Runtime heath check failure
	Runtime Number of controller threads	The current number of active threads of the runtime controller and the maximum number of threads supported by the runtime controller.
	DataLoad Controller Threads	The current number of active threads of the DataLoad controller and the maximum number of threads supported by the DataLoad controller.
	Controller Queue Length	The length of each Fluid controller workqueue in the cluster.
	Total number of Kubernetes API requests	The total number of requests sent by the Fluid controller pods to the Kubernetes API server within a monitoring cycle.
	Runtime Controller Kubernetes API requests	The number of requests sent by the runtime controller to the Kubernetes API server within a monitoring cycle. The requests are classified and displayed by the returned status code.
	Total time consumed by unfinished processing of controller	The total amount of time that each Fluid controller spends on on-going tasks.
Fluid Webhook Detailed Indicators	Fluid Webhook Pod CPU Usage	The CPU utilization of each Fluid webhook pod within a monitoring cycle.
	Fluid Webhook Pod Memory Usage	The memory usage of each Fluid webhook pod within a monitoring cycle.
	Total number of requests processed in Fluid Webhook	The total number of requests processed by the Fluid webhook within a monitoring cycle.
	The number of requests processed in each Fluid Webhook Pod	The number of requests processed by each Fluid webhook pod within a monitoring cycle.
	Fluid Webhook Request Processing Delay	The request processing latency of the Fluid webhook within a monitoring cycle. The latency is a percentile value.
	Request processing delay of each Fluid Webhook Pod	The request processing latency of each Fluid webhook pod within a monitoring cycle. The latency is a percentile value.
Resource usage	CPU usage	The CPU utilization of each Fluid controller pod within a monitoring cycle.
	Memory usage	The memory usage of each Fluid controller pod within a monitoring cycle.
	Network Send Rate per Pod	The network transmit rate of each Fluid controller pod within a monitoring cycle.
	Network Receive Rate per Pod	The network receive rate of each Fluid controller pod within a monitoring cycle.

Fluid JindoRuntime cache dashboard

Dashboard variables

The Fluid JindoRuntime cache dashboard allows you to select a specific dataset based on the dashboard variables and view the relevant metrics of the JindoRuntime cache system associated with that dataset.

Variable	Description
namespace	The namespace that exists in the cluster.
fluid_dataset	The name of the Fluid dataset that exists in the cluster.

Panels

Panel group	Panel	Description
Dataset Overview	Ready Pod Num	The number of ready-to-use pods in each component of the selected cache system, including the master, worker, and FUSE components of the cache system.
Dataset Overview	Pod Overview	The basic information about the pods in each component of the selected cache system, including the number of restarts in the last hour, CPU resource requests and limits, and memory resource requests and limits.
Cache System Metrics	Cache Capacity Usage (%)	The proportion of cache capacity used by the selected cache system.
	Cache Capacity Usage	The maximum available cache capacity and the current capacity usage of the selected cache system.
	Cache Hit Ratio Per Minute	The data access cache hit rate of the selected cache system per minute.
	Read Bytes Per Minute	The number of data reads per minute counted by the calculated cache system, including the total number of data reads when the cache hits (Cache Hit) and the total number of data reads when the cache misses (From Backend).
	Cache System Aggregated Bandwidth	The aggregate bandwidth provided by the selected cache system for the application. The aggregate bandwidth is the sum of outbound traffic of each network interface controller of worker pods. If the worker pods run on the host network, the value may be inflated. To obtain the actual aggregate bandwidth data of the cache system, make sure that the worker pods run on the container network.
	Cache Worker Pod Network I/O	The network I/O status of each worker pod in the selected cache system. If the worker pods run on the host network, the value may be inflated. To obtain the actual aggregate bandwidth data of the cache system, make sure that the worker pods run on the container network.
	Cache System Pod Memory Usage	The memory usage of master and worker pods in the selected cache system. If the process memory of the worker pod is used as the cache medium, the cache capacity occupied by each worker component is included in the pod memory usage.
	Cache System Pod CPU Usage by Cores	The CPU usage of the master and worker pods in the selected cache system.
	Aggregated File Operation Requests	The request frequency of file metadata operations counted by the selected cache system. Only the request frequency of the GetAttr and ReadDir metadata operations is calculated.
FUSE Metrics (via CSI)	FUSE Network I/O	The network I/O status of each FUSE pod in the selected cache system. If a FUSE pod runs on the host network, the value may be inflated. To obtain the actual aggregate bandwidth data of the cache system, make sure that the FUSE pod runs on the container network.
	FUSE Memory Usage/Limit (%)	The percentage of the current memory usage of each FUSE pod relative to the memory resource limit in the selected cache system. If no memory limit is specified for FUSE pods, the value is left empty.
	FUSE CPU Throttled Percent	The percentage of CPU throttling in each FUSE pod in the selected cache system. If no CPU resource limit is specified for FUSE pods, the value is left empty.
	Meta Ops Per Second	The frequency of file metadata operations per second for each FUSE pod in the selected cache system. Only the request frequency of the GetAttr, ReadDir, and Open metadata operations is counted.
	Meta Ops P99 Latency	The P99 latency of metadata operations on each FUSE pod in the selected cache system. Only the request frequency of the GetAttr, ReadDir, and Open metadata operations is counted.
	Read/Write Ops Per Second	The frequency of file read and write operations per second for each FUSE pod in the selected cache system.
	Read/Write Ops P99 Latency	The P99 latency of file read and write operations for each FUSE pod in the selected cache system.
FUSE Metrics (via Sidecar)	FUSE Memory Usage/Limit (%)	The percentage of the current memory usage of each FUSE sidecar container relative to the memory resource limit in the selected cache system. If no memory resource limit of the FUSE sidecar container is specified, the value is left empty.
	FUSE CPU Throttled Percent	The percentage of CPU throttling in each FUSE sidecar container in the selected cache system. If no CPU resource limit of the FUSE sidecar container is specified, the value is left empty.
	Meta Ops Per Second	The frequency of file metadata operations per second counted for each FUSE sidecar container in the selected cache system. Only the request frequency of the GetAttr, ReadDir, and Open metadata operations is counted.
	Meta Ops P99 Latency	The P99 latency of metadata operations on each FUSE sidecar container in the selected cache system. Only the request frequency of the GetAttr, ReadDir, and Open metadata operations is counted.
	Read/Write Ops Per Second	The frequency of file read and write operations per second counted by each FUSE sidecar container in the selected cache system.
	Read/Write Ops P99 Latency	The P99 latency of file read and write operations counted for each FUSE sidecar container in the selected cache system.