All Products
Search
Document Center

Container Service for Kubernetes:Enable Managed Service for Prometheus for the Fluid component

更新時間:Nov 15, 2024

Fluid is a Kubernetes-native distributed dataset orchestration and acceleration engine that serves data-intensive applications, such as big data applications and AI applications, in cloud-native scenarios. Fluid provides application-oriented dataset abstraction, a scalable data engine plug-in, automated data operations, data acceleration, and runtime platform agnostic. You can install the Fluid monitoring component on Prometheus instances of Managed Service for Prometheus with a few clicks and use the out-of-the-box dashboards provided by Managed Service for Prometheus to monitor Fluid. This topic describes how to enable Managed Service for Prometheus for Fluid.

Prerequisites

  • Managed Service for Prometheus is enabled for a Container Service for Kubernetes (ACK) cluster or ACK Serverless cluster. For more information, see Managed Service for Prometheus.

  • The cloud-native AI suite is deployed and Fluid data acceleration is enabled. For more information, see Deploy the cloud-native AI suite.

    • To use all features of the Fluid control plane dashboard, make sure that ack-fluid 0.9.7 or later is installed in the cluster.

    • To use all features of the Fluid JindoRuntime cache system dashboard, make sure that ack-fluid 1.0.11 or later is installed.

Limits

The cache system dashboard supports only cache runtime components of the JindoRuntime type (JindoCache engine).

Step 1: Integrate Fluid with ARMS

  1. Log on to the ARMS console.

  2. In the left-side navigation pane, click Integration Center. In the AI section, click the Fluid card.

  3. In the Select a Kubernetes cluster section of the Fluid page, select the desired cluster. If the page displays that Fluid is already installed, skip the steps in this section.

  4. In the Configuration Information section, configure parameters and click OK.

    Parameter

    Description

    Name(Not required)

    The unique name of the Fluid exporter. You can leave it empty.

    metrics collection interval (seconds)

    The interval at which you want the service to collect monitoring data.

  5. You can view integrated components on the Integration Management page of the ARMS console.

    1. Log on to the ARMS console.

    2. In the left-side navigation pane, click Integration Management. On the Integrated Addons tab, click the Fluid card.

    3. On the Environments tab, click View Details in the Actions column to view the Fluid component in the cluster and alert rules.

Step 2: View the Fluid dashboard

View the Fluid dashboard from the ACK console (recommended)

  1. Log on to the ACK console. In the left-side navigation pane, click Clusters.

  2. On the Clusters page, click the ACK cluster or ACK Serverless cluster in which the Fluid component is installed. In the left-side pane, choose Operations > Prometheus Monitoring.

  3. On the Prometheus Monitoring page, choose Others > Fluid Control Plane to view the monitoring data of Fluid control planes.

    In the Fluid control plane dashboard, you can view detailed information about the Fluid control plane components, such as the status of the components, the Fluid controller processing time, the QPS of the Fluid webhook, the request processing latency, and the resource usage of each component. For more information, see Panels.

    • In the Component running status section, you can view the number of Fluid control plane pods that are in the Running state, the number of restarts, and the time of each restart.组件运行状态

    • In the Fluid Controller Detailed Indicators section, you can check whether the Fluid controllers are busy and view information about processing failures and Kubernetes API requests.控制器详细指标

    • In the Fluid Webhook Detailed Indicators section, you view the resource usage of the Fluid webhook, the number of processed requests, and the request processing latency.webhook详细指标

    • In the Resource usage section, you can view the resource usage of each Fluid control plane component, the network transmit rate, and the network receive rate.资源使用

  4. On the Prometheus Monitoring page, choose Others > Fluid JindoRuntime Dashboard to view the monitoring data of the Fluid JindoRuntime cache system.

    In the Fluid JindoRuntime cache system dashboard, you view the overview of Fluid Datasets, metrics of the cache system, and metrics of the FUSE client. For more information, see Panels.

    • In the Dataset Overview section, you can view the overview information of a Fluid Dataset, including the number of healthy pods for the master, worker, and FUSE components and the resource configuration of each pod.

      image

    • In the Cache System Metrics section, you can view the metrics of the selected cache system, including the current cache usage, cache hit ratio, aggregated bandwidth, and QPS of the cache system for processing file metadata.

      image

    • In the FUSE Metrics (via CSI) section, you can view the metrics of FUSE clients that are mounted through the Fluid CSI plug-in and FUSE pods, including the input/output of each FUSE pod, the latency and QPS of metadata operations, and the latency and QPS of read and write operations.

      image

    • In the FUSE Metrics (via Sidecar) section, you can view the metrics of FUSE clients that are mounted through the Fluid FUSE sidecar, including the latency and QPS of metadata operations and the latency and QPS of read and write operations.

      image

View the Fluid dashboard from the ARMS console

  1. Log on to the ARMS console.

  2. In the left-side navigation pane, click Integration Management. Click the Query Dashboards tab, select Fluid from the component drop-down list, and click Fluid Control Plane in the lower part of the page. You can view the monitoring data of the Fluid control planes in the dashboard.

    In the Fluid dashboard, you can view detailed information about the Fluid control plane components, such as the status of the components, the Fluid controller processing time, the QPS of the Fluid webhook, the request processing latency, and the resource usage of each component. For more information, see Panels.

    • In the Component running status section, you can view the number of Fluid control plane pods that are in the Running state, the number of restarts, and the time of each restart.

    • In the Fluid Controller Detailed Indicators section, you can check whether the Fluid controllers are busy and view information about processing failures and Kubernetes API requests.

    • In the Fluid Webhook Detailed Indicators section, you view the resource usage of the Fluid webhook, the number of processed requests, and the request processing latency.

    • In the Resource usage section, you can view the resource usage of each Fluid control plane component, the network transmit rate, and the network receive rate.

  3. In the left-side navigation pane, click Integration Management. Click the Query Dashboards tab, select Fluid from the component drop-down list, and click Fluid JindoRuntime Dashboard in the lower part of the page. You can view the monitoring data of the Fluid JindoRuntime cache system in the dashboard.

    • In the Dataset Overview section, you can view the overview information of a Fluid Dataset, including the number of healthy pods for the master, worker, and FUSE components and the resource configuration of each pod.

    • In the Cache System Metrics section, you can view the metrics of the selected cache system, including the current cache usage, cache hit ratio, aggregated bandwidth, and QPS of the cache system for processing file metadata.

    • In the FUSE Metrics (via CSI) section, you can view the metrics of FUSE clients that are mounted through the Fluid CSI plug-in and FUSE pods, including the input/output of each FUSE pod, the latency and QPS of metadata operations, and the latency and QPS of read and write operations.

    • In the FUSE Metrics (via Sidecar) section, you can view the metrics of FUSE clients that are mounted through the Fluid FUSE sidecar, including the latency and QPS of metadata operations and the latency and QPS of read and write operations.

Introduction to metrics

The following table describes the monitoring metrics for the Fluid control plane components.

Metric

Type

Description

dataset_ufs_total_size

Gauge

The size of datasets that are mounted to the existing Dataset objects in the current cluster.

dataset_ufs_file_num

Gauge

The number of datasets that are mounted to the existing Dataset objects in the current cluster.

runtime_setup_error_total

Counter

The number of failures to start up the runtime when the controller reconciles.

runtime_sync_healthcheck_error_total

Counter

The number of runtime health check failures that occur when the controller reconciles.

controller_runtime_reconcile_time_seconds_bucket

Histogram

The duration of the reconciliation process.

controller_runtime_reconcile_errors_total

Counter

The number of reconciliation failures.

controller_runtime_reconcile_total

Counter

The number of successful reconciliations.

controller_runtime_max_concurrent_reconciles

Gauge

The maximum number of concurrent reconciliations supported by the controller.

controller_runtime_active_workers

Gauge

The number of active reconciliations of the controller.

workqueue_adds_total

Counter

The number of Adds events processed by the controller workqueue.

workqueue_depth

Gauge

The length of the controller workqueue.

workqueue_queue_duration_seconds_bucket

Histogram

The amount of time that the pending object has been waiting in the controller workqueue.

workqueue_work_duration_seconds_bucket

Histogram

The distribution of the durations of the tasks that have been completed by the controller.

workqueue_unfinished_work_seconds

Gauge

The total duration of all tasks that are being processed in the controller workqueue.

workqueue_longest_running_processor_seconds

Gauge

The longest duration that the controller has spent to process a task.

rest_client_requests_total

Counter

The number of HTTP requests calculated based on status codes, methods, and hosts.

rest_client_request_duration_seconds_bucket

Histogram

The HTTP response latency calculated based on Verbs and URLs.

controller_runtime_webhook_requests_in_flight

Gauge

The number of requests that are being processed by the webhook.

controller_runtime_webhook_requests_total

Counter

The total number of requests that are processed by the webhook.

controller_runtime_webhook_latency_seconds_bucket

Histogram

The request processing latency of the webhook.

process_cpu_seconds_total

Counter

The CPU uptime.

process_resident_memory_bytes

Gauge

The amount of used memory.

The following table describes the metrices of the JindoCache servers displayed in the dashboard of the Fluid JindoRuntime cache system.

Metric

Type

Description

jindocache_server_total_stsnodes_num

Gauge

The number of alive replicas for the worker component in the current distributed cache system.

jindocache_server_total_disk_cap

Gauge

The maximum size of disk cache (including RAM disks such as tmpfs) in the distributed cache system.

jindocache_server_total_used_disk_cap

Gauge

The amount of disk cache (including RAM disks such as tmpfs) used in the distributed cache system.

jindocache_server_total_mem_cap

Gauge

The maximum size of RAM cache in the distributed cache system.

jindocache_server_total_used_mem_cap

Gauge

The amount of RAM cache used in the distributed cache system.

jindocache_server_total_used_rocksdb_cap

Gauge

RocksDB used in the distributed cache system.

jindocache_server_backend_read_bytes_total

Gauge

The total amount of data read from the underlying storage system. Unit: bytes.

Data is read from the underlying storage system when the data is not found in the JindoCache.

jindocache_server_backend_read_time_total

Gauge

The duration of reading data from the underlying storage system. Unit: microseconds.

jindocache_server_backend_readop_num_total

Gauge

The total number of times that data is read from the underlying storage system, which equals the number of blocks in the JindoCache.

jindocache_server_backend_read_bytes_time_total_window

Gauge

The duration of reading data from the underlying storage system within a minute. Unit: microseconds.

jindocache_server_backend_read_bytes_total_window

Gauge

The total amount of data read from the underlying storage system within a minute. Unit: bytes.

jindocache_server_remote_read_bytes_total

Gauge

The total amount of data read through remote cache hits in the same cluster. Unit: bytes.

A remote cache hit occurs when the requested is found in the JindoCache but the cached data and the application reside on different nodes.

jindocache_server_remote_read_time_total

Gauge

The total duration of remote cache hits in the same cluster. Unit: microseconds.

jindocache_server_remote_readop_num_total

Gauge

The number of remote cache hits in the same cluster.

jindocache_server_remote_read_bytes_time_total_window

Gauge

The total duration of remote cache hits within a minute. Unit: microseconds.

jindocache_server_remote_read_bytes_total_window

Gauge

The total amount of data read through remote cache hits within a minute. Unit: bytes.

jindocache_server_local_read_bytes_total

Gauge

The total amount of data read through local cache hits in the same cluster. Unit: bytes.

A local cache hit occurs when the requested data is found in the JindoCache and the cached data and the application reside on the same node.

jindocache_server_local_read_time_total

Gauge

The total duration of local cache hits in the same cluster. Unit: microseconds.

jindocache_server_local_readop_num_total

Gauge

The total number of local cache hits in the same cluster.

jindocache_server_local_read_bytes_time_total_window

Gauge

The total duration of local cache hits within a minute. Unit: microseconds.

jindocache_server_local_read_bytes_total_window

Gauge

The total amount of data read through local cache hits within a minute. Unit: bytes.

jindocache_server_ns_filelet_op_count_total

Gauge

The total number of file metadata operations performed on the JindoCache master component side, including the getAttr and listStatus operations.

jindocache_server_ns_filelet_op_time_total

Gauge

The total duration of file metadata operations performed on the JindoCache master component side, including the getAttr and listStatus operations.

jindocache_server_ns_get_attr_op_total

Gauge

The number of getAttr operations performed on the JindoCache master component side.

jindocache_server_ns_get_attr_time_total

Gauge

The duration of getAttr operations performed on the JindoCache master component side.

jindocache_server_ns_get_attr_fallback_op_total

Gauge

The number of times that the JindoCache master component reads file metadata from the underlying storage system.

jindocache_server_ns_list_status_op_total

Gauge

The number of listStatus operations performed on the JindoCache master component side.

jindocache_server_ns_list_status_time_total

Gauge

The duration of listStatus operations performed on the JindoCache master component side.

jindocache_server_ns_list_status_fallback_op_total

Gauge

The number of times that the JindoCache master component reads the file list from the underlying storage system.

jindocache_server_dist_get_attr_op_num_total

Gauge

The number of getAttr operations performed on the JindoCache client side.

jindocache_server_dist_get_attr_time_total

Gauge

The duration of getAttr operations performed on the JindoCache client side.

jindocache_server_dist_list_dir_op_num_total

Gauge

The number of listStatus operations performed on the JindoCache client side.

jindocache_server_dist_list_dir_time_total

Gauge

The duration of listStatus operations performed on the JindoCache client side.

The following table describes the metrics of the JindoCache FUSE client displayed in the Fluid JindoRuntime dashboard.

Metric

Type

Description

jindo_fuse_open_count

Gauge

The number of open operations performed by the Jindo FUSE client.

jindo_fuse_open_latency

Gauge

The P50 latency of open operations performed by the Jindo FUSE client.

jindo_fuse_open_latency_80

Gauge

The P80 latency of open operations performed by the Jindo FUSE client.

jindo_fuse_open_latency_90

Gauge

The P90 latency of open operations performed by the Jindo FUSE client.

jindo_fuse_open_latency_99

Gauge

The P99 latency of open operations performed by the Jindo FUSE client.

jindo_fuse_open_latency_999

Gauge

The P99.9 latency of open operations performed by the Jindo FUSE client.

jindo_fuse_open_latency_9999

Gauge

The P99.99 latency of open operations performed by the Jindo FUSE client.

jindo_fuse_getattr_count

Gauge

The number of getAttr operations performed by the Jindo FUSE client.

jindo_fuse_getattr_latency

Gauge

The P50 latency of getAttr operations performed by the Jindo FUSE client.

jindo_fuse_getattr_latency_80

Gauge

The P80 latency of getAttr operations performed by the Jindo FUSE client.

jindo_fuse_getattr_latency_90

Gauge

The P90 latency of getAttr operations performed by the Jindo FUSE client.

jindo_fuse_getattr_latency_99

Gauge

The P99 latency of getAttr operations performed by the Jindo FUSE client.

jindo_fuse_getattr_latency_999

Gauge

The P99.9 latency of getAttr operations performed by the Jindo FUSE client.

jindo_fuse_getattr_latency_9999

Gauge

The P99.99 latency of getAttr operations performed by the Jindo FUSE client.

jindo_fuse_readdir_count

Gauge

The number of readDir operations performed by the Jindo FUSE client.

jindo_fuse_readdir_latency

Gauge

The P50 latency of readDir operations performed by the Jindo FUSE client.

jindo_fuse_readdir_latency_80

Gauge

The P80 latency of readDir operations performed by the Jindo FUSE client.

jindo_fuse_readdir_latency_90

Gauge

The P90 latency of readDir operations performed by the Jindo FUSE client.

jindo_fuse_readdir_latency_99

Gauge

The P99 latency of readDir operations performed by the Jindo FUSE client.

jindo_fuse_readdir_latency_999

Gauge

The P99.9 latency of readDir operations performed by the Jindo FUSE client.

jindo_fuse_readdir_latency_9999

Gauge

The P99.99 latency of readDir operations performed by the Jindo FUSE client.

jindo_fuse_read_count

Gauge

The number of read operations performed by the Jindo FUSE client.

jindo_fuse_read_latency

Gauge

The P50 latency of read operations performed by the Jindo FUSE client.

jindo_fuse_read_latency_80

Gauge

The P80 latency of read operations performed by the Jindo FUSE client.

jindo_fuse_read_latency_90

Gauge

The P90 latency of read operations performed by the Jindo FUSE client.

jindo_fuse_read_latency_99

Gauge

The P99 latency of read operations performed by the Jindo FUSE client.

jindo_fuse_read_latency_999

Gauge

The P99.9 latency of read operations performed by the Jindo FUSE client.

jindo_fuse_read_latency_9999

Gauge

The P99.99 latency of read operations performed by the Jindo FUSE client.

jindo_fuse_write_count

Gauge

The number of write operations performed by the Jindo FUSE client.

jindo_fuse_write_latency

Gauge

The P50 latency of write operations performed by the Jindo FUSE client.

jindo_fuse_write_latency_80

Gauge

The P80 latency of write operations performed by the Jindo FUSE client.

jindo_fuse_write_latency_90

Gauge

The P90 latency of write operations performed by the Jindo FUSE client.

jindo_fuse_write_latency_99

Gauge

The P99 latency of write operations performed by the Jindo FUSE client.

jindo_fuse_write_latency_999

Gauge

The P99.9 latency of write operations performed by the Jindo FUSE client.

jindo_fuse_write_latency_9999

Gauge

The P99.99 latency of write operations performed by the Jindo FUSE client.

References