Fluid is a Kubernetes-native distributed dataset orchestration and acceleration engine that serves data-intensive applications, such as big data applications and AI applications, in cloud-native scenarios. Fluid provides application-oriented dataset abstraction, a scalable data engine plug-in, automated data operations, data acceleration, and runtime platform agnostic. You can install the Fluid monitoring component on Prometheus instances of Prometheus Service with a few clicks and use the out-of-the-box dashboards provided by Prometheus Service to monitor Fluid. This topic describes how to enable Prometheus Service for Fluid.
Prerequisites
Prometheus Service is enabled for your Container Service for Kubernetes (ACK) cluster or ACK Serverless cluster. For more information, see Managed Service for Prometheus.
The cloud-native AI suite is deployed and Fluid data acceleration is enabled. The version of the ack-fluid component is 0.9.7 or later. For more information, see Deploy the cloud-native AI suite.
Limits
You can install the Fluid monitoring component only on Prometheus instances whose type is Prometheus for Container Service.
You can monitor only Fluid control plane components, such as the Fluid controllers and Fluid webhook.
Step 1: Install the Fluid monitoring component
Log on to the ARMS console.
In the left-side navigation pane, click Integration Center. Then, click the Fluid card in the AI section.
If the Fluid component is already installed, skip the preceding step.
In the Select a Kubernetes cluster section of the Fluid panel, select the Kubernetes cluster.
In the Configuration Information section, configure the parameters and click OK.
Parameter
Description
Exporter Name
The unique name of the Fluid exporter.
Metrics scrape interval (seconds)
The interval at which monitoring data is collected.
After the Fluid monitoring component is installed, the Fluid card displays Installed 1 Exporter. Click the Fluid card. In the panel that appears, you can view Targets, Metrics, Dashboards, Alerts, Service Discovery Configurations, and Exporter.
Step 2: View the Fluid dashboard
View the Fluid dashboard from the ACK console (recommended)
Log on to the ACK console. In the left-side navigation pane, click Clusters.
On the Clusters page, click the ACK cluster or ACK Serverless cluster in which the Fluid monitoring component is installed. In the left-side navigation pane, choose .
On the Prometheus Monitoring page, choose
to view the monitoring data.In the Fluid dashboard, you can view detailed information about the Fluid control plane components, such as the status of the components, the Fluid controller processing time, the QPS of the Fluid webhook, the request processing latency, and the resource usage of each component. For more information, see Panels.
In the Component running status section, you can view the number of Fluid control plane pods that are in the Running state, the number of restarts, and the time of each restart.
In the Fluid Controller Detailed Indicators section, you can check whether the Fluid controllers are busy and view information about processing failures and Kubernetes API requests.
In the Fluid Webhook Detailed Indicators section, you view the resource usage of the Fluid webhook, the number of processed requests, and the request processing latency.
In the Resource usage section, you can view the resource usage of each Fluid control plane component, the network transmit rate, and the network receive rate.
View the Fluid dashboard from ARMS
Log on to the ARMS console.
In the left-side navigation pane, click Integration Center. Then, select the Fluid card on the Integration Management page, click the Dashboards tab and click Fluid Control Plane at the bottom of the panel to view the monitoring data.
In the Fluid dashboard, you can view detailed information about the Fluid control plane components, such as the status of the components, the Fluid controller processing time, the QPS of the Fluid webhook, the request processing latency, and the resource usage of each component. For more information, see Panels.
In the Component running status section, you can view the number of Fluid control plane pods that are in the Running state, the number of restarts, and the time of each restart.
In the Fluid Controller Detailed Indicators section, you can check whether the Fluid controllers are busy and view information about processing failures and Kubernetes API requests.
In the Fluid Webhook Detailed Indicators section, you view the resource usage of the Fluid webhook, the number of processed requests, and the request processing latency.
In the Resource usage section, you can view the resource usage of each Fluid control plane component, the network transmit rate, and the network receive rate.
Metrics
The following table describes the monitoring metrics for the Fluid control plane components.
Metric | Type | Description |
dataset_ufs_total_size | Gauge | The size of datasets that are mounted to the existing Dataset objects in the current cluster. |
dataset_ufs_file_num | Gauge | The number of datasets that are mounted to the existing Dataset objects in the current cluster. |
runtime_setup_error_total | Counter | The number of failures to start up the runtime when the controller reconciles. |
runtime_sync_healthcheck_error_total | Counter | The number of runtime health check failures that occur when the controller reconciles. |
controller_runtime_reconcile_time_seconds_bucket | Histogram | The duration of the reconciliation process. |
controller_runtime_reconcile_errors_total | Counter | The number of reconciliation failures. |
controller_runtime_reconcile_total | Counter | The number of successful reconciliations. |
controller_runtime_max_concurrent_reconciles | Gauge | The maximum number of concurrent reconciliations supported by the controller. |
controller_runtime_active_workers | Gauge | The number of active reconciliations of the controller. |
workqueue_adds_total | Counter | The number of Adds events processed by the controller workqueue. |
workqueue_depth | Gauge | The length of the controller workqueue. |
workqueue_queue_duration_seconds_bucket | Histogram | The amount of time that the pending object has been waiting in the controller workqueue. |
workqueue_work_duration_seconds_bucket | Histogram | The distribution of the durations of the tasks that have been completed by the controller. |
workqueue_unfinished_work_seconds | Gauge | The total duration of all tasks that are being processed in the controller workqueue. |
workqueue_longest_running_processor_seconds | Gauge | The longest duration that the controller has spent to process a task. |
rest_client_requests_total | Counter | The number of HTTP requests calculated based on status codes, methods, and hosts. |
rest_client_request_duration_seconds_bucket | Histogram | The HTTP response latency calculated based on Verbs and URLs. |
controller_runtime_webhook_requests_in_flight | Gauge | The number of requests that are being processed by the webhook. |
controller_runtime_webhook_requests_total | Counter | The total number of requests that are processed by the webhook. |
controller_runtime_webhook_latency_seconds_bucket | Histogram | The request processing latency of the webhook. |
process_cpu_seconds_total | Counter | The CPU uptime. |
process_resident_memory_bytes | Gauge | The memory usage. |
References
For more information about Fluid, see Overview of Fluid.
For more information about Fluid panels, see Panels.