All Products
Search
Document Center

Container Service for Kubernetes:Fluid dashboard parameters

Last Updated:Nov 22, 2024

This topic describes the variables and panels in the Fluid control plane dashboard and Fluid JindoRuntime cache dashboard. The dashboard variables provide different dimensions for Fluid observability metrics, such as monitoring period duration, and dataset namespace and name. The panels can help you understand the health and performance of Fluid components. You can use the dashboard to find and solve possible problems at the earliest opportunity and identify potential optimization items in the cache system in specific business scenarios.

Prerequisites

Managed Service for Prometheus is enabled for the Fluid component. For more information, see Step 2: View the Fluid dashboard.

Fluid control plane dashboard

Dashboard variables

The monitoring data displayed in the panels of the Fluid dashboard varies based on the value of the variables in the Fluid dashboard. You can modify the values of the variables based on your business requirements. For example, after you change the value of the Runtime variable from AlluxioRuntime to JindoRuntime, all related panels in the dashboard switch to display data related to JindoRuntime.

Variable

Valid value

Description

interval

1m, 5m, 10m, 30m, 1h, and 6h

The duration of a monitoring cycle.

quantile

0.5, 0.75, 0.90, 0.95, and 0.99

The quantile used by certain panels when the panels visualize metrics. For example, a value of 0.90 indicates P90.

runtime

  • JindoRuntime

  • AlluxioRuntime

  • JuiceFSRuntime

The type of runtime used in Fluid. After you modify the value of this variable, the change applies to all runtime-related panels.

  • JindoRuntime: the execution engine of JindoFS developed by the Alibaba Cloud Elastic MapReduce (EMR) team. JindoRuntime is based on C++ and provides dataset management and caching. JindoRuntime also supports Object Storage Service (OSS).

  • AlluxioRuntime: the execution engine of open source Alluxio. AlluxioRuntime supports dataset management and caching and accelerates access to persistent volume claims (PVCs), Ceph, and Cloud Parallel File System (CPFS). You can use AlluxioRuntime in hybrid cloud scenarios.

  • JuiceFSRuntime: a distributed cache acceleration engine developed based on JuiceFS. JuiceFSRuntime supports scenario-specific data caching and acceleration. For more information about JuiceFS, see Introduction to JuiceFS.

Panels

Panel group

Panel

Description

Component running status

Dataset Controller Ready Replicas

The number of dataset controller pods that are in the Running state in the cluster.

History of Dataset controller restarts

The number of times that the dataset controller pods have restarted in the cluster.

Runtime Number of ready copies of controller

The number of runtime controller pods that are in the Running state in the cluster.

History Runtime Controller Restart Times

The number of times that the runtime controller pods have restarted in the cluster.

Fluid Webhook ready copies

The number of Fluid webhook pods that are in the Running state in the cluster.

Number of historical fluid Webhook restarts

The number of times that the Fluid webhook pods have restarted in the cluster.

Fluid CSI Plug-in Ready Copies

The number of Fluid CSI plug-in pods that are in the Running state in the cluster.

Historical Fluid CSI plug-in restarts

The number of times that the Fluid CSI plug-in pods have restarted in the cluster.

Fluid Component Restart

The top five Fluid components that have restarted most frequently within a monitoring cycle of 2 minutes.

Fluid Controller Detailed Indicator

Runtime Controller processing time

The amount of time that the runtime controller spends to handle runtime resources within a monitoring cycle. The panel displays percentile values.

Number of Runtime controller processing failures

The types and number of failures that occur when the runtime controller handles runtime resources. The following types of failures are displayed:

  • Runtime deployment failure

  • Runtime heath check failure

Runtime Number of controller threads

The current number of active threads of the runtime controller and the maximum number of threads supported by the runtime controller.

DataLoad Controller Threads

The current number of active threads of the DataLoad controller and the maximum number of threads supported by the DataLoad controller.

Controller Queue Length

The length of each Fluid controller workqueue in the cluster.

Total number of Kubernetes API requests

The total number of requests sent by the Fluid controller pods to the Kubernetes API server within a monitoring cycle.

Runtime Controller Kubernetes API requests

The number of requests sent by the runtime controller to the Kubernetes API server within a monitoring cycle. The requests are classified and displayed by the returned status code.

Total time consumed by unfinished processing of controller

The total amount of time that each Fluid controller spends on on-going tasks.

Fluid Webhook Detailed Indicators

Fluid Webhook Pod CPU Usage

The CPU utilization of each Fluid webhook pod within a monitoring cycle.

Fluid Webhook Pod Memory Usage

The memory usage of each Fluid webhook pod within a monitoring cycle.

Total number of requests processed in Fluid Webhook

The total number of requests processed by the Fluid webhook within a monitoring cycle.

The number of requests processed in each Fluid Webhook Pod

The number of requests processed by each Fluid webhook pod within a monitoring cycle.

Fluid Webhook Request Processing Delay

The request processing latency of the Fluid webhook within a monitoring cycle. The latency is a percentile value.

Request processing delay of each Fluid Webhook Pod

The request processing latency of each Fluid webhook pod within a monitoring cycle. The latency is a percentile value.

Resource usage

CPU usage

The CPU utilization of each Fluid controller pod within a monitoring cycle.

Memory usage

The memory usage of each Fluid controller pod within a monitoring cycle.

Network Send Rate per Pod

The network transmit rate of each Fluid controller pod within a monitoring cycle.

Network Receive Rate per Pod

The network receive rate of each Fluid controller pod within a monitoring cycle.

Fluid JindoRuntime cache dashboard

Dashboard variables

The Fluid JindoRuntime cache dashboard allows you to select a specific dataset based on the dashboard variables and view the relevant metrics of the JindoRuntime cache system associated with that dataset.

Variable

Description

namespace

The namespace that exists in the cluster.

fluid_dataset

The name of the Fluid dataset that exists in the cluster.

Panels

Panel group

Panel

Description

Dataset Overview

Ready Pod Num

The number of ready-to-use pods in each component of the selected cache system, including the master, worker, and FUSE components of the cache system.

Pod Overview

The basic information about the pods in each component of the selected cache system, including the number of restarts in the last hour, CPU resource requests and limits, and memory resource requests and limits.

Cache System Metrics

Cache Capacity Usage (%)

The proportion of cache capacity used by the selected cache system.

Cache Capacity Usage

The maximum available cache capacity and the current capacity usage of the selected cache system.

Cache Hit Ratio Per Minute

The data access cache hit rate of the selected cache system per minute.

Read Bytes Per Minute

The number of data reads per minute counted by the calculated cache system, including the total number of data reads when the cache hits (Cache Hit) and the total number of data reads when the cache misses (From Backend).

Cache System Aggregated Bandwidth

The aggregate bandwidth provided by the selected cache system for the application. The aggregate bandwidth is the sum of outbound traffic of each network interface controller of worker pods. If the worker pods run on the host network, the value may be inflated. To obtain the actual aggregate bandwidth data of the cache system, make sure that the worker pods run on the container network.

Cache Worker Pod Network I/O

The network I/O status of each worker pod in the selected cache system. If the worker pods run on the host network, the value may be inflated. To obtain the actual aggregate bandwidth data of the cache system, make sure that the worker pods run on the container network.

Cache System Pod Memory Usage

The memory usage of master and worker pods in the selected cache system. If the process memory of the worker pod is used as the cache medium, the cache capacity occupied by each worker component is included in the pod memory usage.

Cache System Pod CPU Usage by Cores

The CPU usage of the master and worker pods in the selected cache system.

Aggregated File Operation Requests

The request frequency of file metadata operations counted by the selected cache system. Only the request frequency of the GetAttr and ReadDir metadata operations is calculated.

FUSE Metrics (via CSI)

FUSE Network I/O

The network I/O status of each FUSE pod in the selected cache system. If a FUSE pod runs on the host network, the value may be inflated. To obtain the actual aggregate bandwidth data of the cache system, make sure that the FUSE pod runs on the container network.

FUSE Memory Usage/Limit (%)

The percentage of the current memory usage of each FUSE pod relative to the memory resource limit in the selected cache system. If no memory limit is specified for FUSE pods, the value is left empty.

FUSE CPU Throttled Percent

The percentage of CPU throttling in each FUSE pod in the selected cache system. If no CPU resource limit is specified for FUSE pods, the value is left empty.

Meta Ops Per Second

The frequency of file metadata operations per second for each FUSE pod in the selected cache system. Only the request frequency of the GetAttr, ReadDir, and Open metadata operations is counted.

Meta Ops P99 Latency

The P99 latency of metadata operations on each FUSE pod in the selected cache system. Only the request frequency of the GetAttr, ReadDir, and Open metadata operations is counted.

Read/Write Ops Per Second

The frequency of file read and write operations per second for each FUSE pod in the selected cache system.

Read/Write Ops P99 Latency

The P99 latency of file read and write operations for each FUSE pod in the selected cache system.

FUSE Metrics (via Sidecar)

FUSE Memory Usage/Limit (%)

The percentage of the current memory usage of each FUSE sidecar container relative to the memory resource limit in the selected cache system. If no memory resource limit of the FUSE sidecar container is specified, the value is left empty.

FUSE CPU Throttled Percent

The percentage of CPU throttling in each FUSE sidecar container in the selected cache system. If no CPU resource limit of the FUSE sidecar container is specified, the value is left empty.

Meta Ops Per Second

The frequency of file metadata operations per second counted for each FUSE sidecar container in the selected cache system. Only the request frequency of the GetAttr, ReadDir, and Open metadata operations is counted.

Meta Ops P99 Latency

The P99 latency of metadata operations on each FUSE sidecar container in the selected cache system. Only the request frequency of the GetAttr, ReadDir, and Open metadata operations is counted.

Read/Write Ops Per Second

The frequency of file read and write operations per second counted by each FUSE sidecar container in the selected cache system.

Read/Write Ops P99 Latency

The P99 latency of file read and write operations counted for each FUSE sidecar container in the selected cache system.