This topic describes the variables and panels in the Fluid control plane dashboard and Fluid JindoRuntime cache dashboard. The dashboard variables provide different dimensions for Fluid observability metrics, such as monitoring period duration, and dataset namespace and name. The panels can help you understand the health and performance of Fluid components. You can use the dashboard to find and solve possible problems at the earliest opportunity and identify potential optimization items in the cache system in specific business scenarios.
Prerequisites
Managed Service for Prometheus is enabled for the Fluid component. For more information, see Step 2: View the Fluid dashboard.
Fluid control plane dashboard
Dashboard variables
The monitoring data displayed in the panels of the Fluid dashboard varies based on the value of the variables in the Fluid dashboard. You can modify the values of the variables based on your business requirements. For example, after you change the value of the Runtime variable from AlluxioRuntime to JindoRuntime, all related panels in the dashboard switch to display data related to JindoRuntime.
Variable | Valid value | Description |
interval | 1m, 5m, 10m, 30m, 1h, and 6h | The duration of a monitoring cycle. |
quantile | 0.5, 0.75, 0.90, 0.95, and 0.99 | The quantile used by certain panels when the panels visualize metrics. For example, a value of 0.90 indicates P90. |
runtime |
| The type of runtime used in Fluid. After you modify the value of this variable, the change applies to all runtime-related panels.
|
Panels
Panel group | Panel | Description |
Component running status | Dataset Controller Ready Replicas | The number of dataset controller pods that are in the Running state in the cluster. |
History of Dataset controller restarts | The number of times that the dataset controller pods have restarted in the cluster. | |
Runtime Number of ready copies of controller | The number of runtime controller pods that are in the Running state in the cluster. | |
History Runtime Controller Restart Times | The number of times that the runtime controller pods have restarted in the cluster. | |
Fluid Webhook ready copies | The number of Fluid webhook pods that are in the Running state in the cluster. | |
Number of historical fluid Webhook restarts | The number of times that the Fluid webhook pods have restarted in the cluster. | |
Fluid CSI Plug-in Ready Copies | The number of Fluid CSI plug-in pods that are in the Running state in the cluster. | |
Historical Fluid CSI plug-in restarts | The number of times that the Fluid CSI plug-in pods have restarted in the cluster. | |
Fluid Component Restart | The top five Fluid components that have restarted most frequently within a monitoring cycle of 2 minutes. | |
Fluid Controller Detailed Indicator | Runtime Controller processing time | The amount of time that the runtime controller spends to handle runtime resources within a monitoring cycle. The panel displays percentile values. |
Number of Runtime controller processing failures | The types and number of failures that occur when the runtime controller handles runtime resources. The following types of failures are displayed:
| |
Runtime Number of controller threads | The current number of active threads of the runtime controller and the maximum number of threads supported by the runtime controller. | |
DataLoad Controller Threads | The current number of active threads of the DataLoad controller and the maximum number of threads supported by the DataLoad controller. | |
Controller Queue Length | The length of each Fluid controller workqueue in the cluster. | |
Total number of Kubernetes API requests | The total number of requests sent by the Fluid controller pods to the Kubernetes API server within a monitoring cycle. | |
Runtime Controller Kubernetes API requests | The number of requests sent by the runtime controller to the Kubernetes API server within a monitoring cycle. The requests are classified and displayed by the returned status code. | |
Total time consumed by unfinished processing of controller | The total amount of time that each Fluid controller spends on on-going tasks. | |
Fluid Webhook Detailed Indicators | Fluid Webhook Pod CPU Usage | The CPU utilization of each Fluid webhook pod within a monitoring cycle. |
Fluid Webhook Pod Memory Usage | The memory usage of each Fluid webhook pod within a monitoring cycle. | |
Total number of requests processed in Fluid Webhook | The total number of requests processed by the Fluid webhook within a monitoring cycle. | |
The number of requests processed in each Fluid Webhook Pod | The number of requests processed by each Fluid webhook pod within a monitoring cycle. | |
Fluid Webhook Request Processing Delay | The request processing latency of the Fluid webhook within a monitoring cycle. The latency is a percentile value. | |
Request processing delay of each Fluid Webhook Pod | The request processing latency of each Fluid webhook pod within a monitoring cycle. The latency is a percentile value. | |
Resource usage | CPU usage | The CPU utilization of each Fluid controller pod within a monitoring cycle. |
Memory usage | The memory usage of each Fluid controller pod within a monitoring cycle. | |
Network Send Rate per Pod | The network transmit rate of each Fluid controller pod within a monitoring cycle. | |
Network Receive Rate per Pod | The network receive rate of each Fluid controller pod within a monitoring cycle. |
Fluid JindoRuntime cache dashboard
Dashboard variables
The Fluid JindoRuntime cache dashboard allows you to select a specific dataset based on the dashboard variables and view the relevant metrics of the JindoRuntime cache system associated with that dataset.
Variable | Description |
namespace | The namespace that exists in the cluster. |
fluid_dataset | The name of the Fluid dataset that exists in the cluster. |
Panels
Panel group | Panel | Description |
Dataset Overview | Ready Pod Num
| The number of ready-to-use pods in each component of the selected cache system, including the master, worker, and FUSE components of the cache system. |
Pod Overview
| The basic information about the pods in each component of the selected cache system, including the number of restarts in the last hour, CPU resource requests and limits, and memory resource requests and limits. | |
Cache System Metrics
| Cache Capacity Usage (%)
| The proportion of cache capacity used by the selected cache system. |
Cache Capacity Usage
| The maximum available cache capacity and the current capacity usage of the selected cache system. | |
Cache Hit Ratio Per Minute | The data access cache hit rate of the selected cache system per minute. | |
Read Bytes Per Minute | The number of data reads per minute counted by the calculated cache system, including the total number of data reads when the cache hits (Cache Hit) and the total number of data reads when the cache misses (From Backend). | |
Cache System Aggregated Bandwidth | The aggregate bandwidth provided by the selected cache system for the application. The aggregate bandwidth is the sum of outbound traffic of each network interface controller of worker pods. If the worker pods run on the host network, the value may be inflated. To obtain the actual aggregate bandwidth data of the cache system, make sure that the worker pods run on the container network. | |
Cache Worker Pod Network I/O | The network I/O status of each worker pod in the selected cache system. If the worker pods run on the host network, the value may be inflated. To obtain the actual aggregate bandwidth data of the cache system, make sure that the worker pods run on the container network. | |
Cache System Pod Memory Usage | The memory usage of master and worker pods in the selected cache system. If the process memory of the worker pod is used as the cache medium, the cache capacity occupied by each worker component is included in the pod memory usage. | |
Cache System Pod CPU Usage by Cores | The CPU usage of the master and worker pods in the selected cache system. | |
Aggregated File Operation Requests | The request frequency of file metadata operations counted by the selected cache system. Only the request frequency of the GetAttr and ReadDir metadata operations is calculated. | |
FUSE Metrics (via CSI)
| FUSE Network I/O | The network I/O status of each FUSE pod in the selected cache system. If a FUSE pod runs on the host network, the value may be inflated. To obtain the actual aggregate bandwidth data of the cache system, make sure that the FUSE pod runs on the container network. |
FUSE Memory Usage/Limit (%) | The percentage of the current memory usage of each FUSE pod relative to the memory resource limit in the selected cache system. If no memory limit is specified for FUSE pods, the value is left empty. | |
FUSE CPU Throttled Percent | The percentage of CPU throttling in each FUSE pod in the selected cache system. If no CPU resource limit is specified for FUSE pods, the value is left empty. | |
Meta Ops Per Second | The frequency of file metadata operations per second for each FUSE pod in the selected cache system. Only the request frequency of the GetAttr, ReadDir, and Open metadata operations is counted. | |
Meta Ops P99 Latency | The P99 latency of metadata operations on each FUSE pod in the selected cache system. Only the request frequency of the GetAttr, ReadDir, and Open metadata operations is counted. | |
Read/Write Ops Per Second | The frequency of file read and write operations per second for each FUSE pod in the selected cache system. | |
Read/Write Ops P99 Latency | The P99 latency of file read and write operations for each FUSE pod in the selected cache system. | |
FUSE Metrics (via Sidecar) | FUSE Memory Usage/Limit (%) | The percentage of the current memory usage of each FUSE sidecar container relative to the memory resource limit in the selected cache system. If no memory resource limit of the FUSE sidecar container is specified, the value is left empty. |
FUSE CPU Throttled Percent | The percentage of CPU throttling in each FUSE sidecar container in the selected cache system. If no CPU resource limit of the FUSE sidecar container is specified, the value is left empty. | |
Meta Ops Per Second | The frequency of file metadata operations per second counted for each FUSE sidecar container in the selected cache system. Only the request frequency of the GetAttr, ReadDir, and Open metadata operations is counted. | |
Meta Ops P99 Latency | The P99 latency of metadata operations on each FUSE sidecar container in the selected cache system. Only the request frequency of the GetAttr, ReadDir, and Open metadata operations is counted. | |
Read/Write Ops Per Second | The frequency of file read and write operations per second counted by each FUSE sidecar container in the selected cache system. | |
Read/Write Ops P99 Latency | The P99 latency of file read and write operations counted for each FUSE sidecar container in the selected cache system. |