Metrics and alert rule configurations for CPFS for Lingjun - Cloud Parallel File Storage

You can view the capacity and performance information of a CPFS for Lingjun file system to understand its storage usage, read/write throughput, and read/write IOPS. By setting alert rules for important metrics, you can receive prompt notifications about exceptions and handle them quickly. This topic describes the metrics that CPFS for Lingjun supports and how to configure alert rules for them.

Background information

CloudMonitor is a service that monitors Alibaba Cloud resources and internet applications. You can use CloudMonitor to monitor metrics of various cloud resources and set alerts for specific metrics. This provides a complete picture of your resource usage and application status on Alibaba Cloud and lets you handle faults promptly to ensure that your services run smoothly. For more information, see What is CloudMonitor?.

Retention policy of monitoring data

Monitoring data is retained for 90 days. After the retention period expires, the monitoring data is automatically cleared. The retention period starts when data is generated.

Monitoring metrics

CPFS for Lingjun supports comprehensive monitoring of file system capacity, instance performance, and client performance through CloudMonitor. Two sets of monitoring metrics are available: a new version (recommended) and an old version. The new metrics address issues in the old version, such as inconsistent naming and unclear structure, and offer improved usability and maintainability.

New customers: You can use the new metrics directly.
Existing customers: You can continue to use the old metrics to ensure business continuity. However, we recommend that you gradually migrate to the new version.

Important

If you are an existing customer who wants to switch to the new metrics, you must first test them in a test environment.

New version metrics (recommended)

The new monitoring metrics are currently available in the following region: China (Beijing).

Capacity monitoring

Type	Metric	Metric name	Unit	Description
File system - Standard	BmStdCapacity	Total file system storage capacity for the Intelligent Computing Edition (Standard Specifications)	Bytes (B)	The total storage space of the file system.
	BmStdCapacityUsed	Data usage of a standard CPFS for Lingjun file system	Bytes (B)	The amount of data that is currently used by the file system.
	BmStdInodeLimit	Maximum number of files for a standard AI Computing Edition file system	Unit	The maximum total number of files and directories that the file system can hold.
	BmStdInodeAlloc	Number of allocated files in a standard CPFS for Lingjun file system	Unit	The total number of files and directories that are currently allocated (created) in the file system.
	BmStdInodeUsed	Number of used files in a standard CPFS for Lingjun file system	Item	The total number of files and directories that are currently used in the file system.
File system - Large Large-specification file systems are available only to specific users. If you are not a user of a large-specification file system, ignore the related metrics.	BmLargeCapacity	Total storage space for large-specification file systems in the Intelligent Computing Edition	Bytes (B)	The total storage space of the file system.
	BmLargeCapacityUsed	Data volume of file systems for large-scale AI computing	Bytes (B)	The amount of data that is currently used by the file system.
	BmLargeInodeLimit	Maximum number of files in a large CPFS for Lingjun file system	Unit	The maximum total number of files and directories that the file system can hold.
	BmLargeInodeAlloc	Number of allocated files in a large CPFS for Lingjun file system	Item	The total number of files and directories that are currently allocated (created) in the file system.
	BmLargeInodeUsed	File count in the large-scale AI Computing Edition file system	Unit	The total number of files and directories that are currently used in the file system.
Fileset - Standard	BmStdFsetCapacityLimit	Capacity quota of a standard CPFS for Lingjun fileset	Bytes (B)	The maximum capacity quota set for a single fileset.
	BmStdFsetCapacityUsed	Current capacity of the standard specification fileset for the AI Computing Edition	Bytes (B)	The capacity that is currently used by a single fileset.
	BmStdFsetInodeLimit	Standard specifications for the Intelligent Computing Edition: Quota on the number of files per fileset	Unit	The maximum quota for the number of files and directories set for a single fileset.
	BmStdFsetInodeAlloc	Number of pre-allocated files in a standard CPFS for Lingjun fileset	Unit	The total number of files and directories that are currently pre-allocated for a single fileset.
	BmStdFsetInodeUsed	Number of files in a standard fileset for the Intelligent Computing Edition	Unit	The number of files and directories that are currently used by a single fileset.
Fileset - Large Large-specification file systems are available only to specific users. If you are not a user of a large-specification file system, ignore the related metrics.	BmLargeFsetCapacityLimit	Capacity Quotas for Large Filesets in the Intelligent Computing Edition	Bytes (B)	The maximum available capacity set for a single fileset.
	BmLargeFsetCapacityUsed	Current capacity of the large-specification fileset in the Intelligent Computing Edition	Bytes (B)	The amount of data that is currently used by a single fileset.
	BmLargeFsetInodeLimit	File count quota of a large CPFS for Lingjun fileset	Unit	The maximum total number of files and directories that can be held in a single fileset.
	BmLargeFsetInodeAlloc	Number of pre-allocated files in a large CPFS for Lingjun fileset	Unit	The total number of files and directories that are currently allocated (created) for a single fileset.
	BmLargeFsetInodeUsed	Current file count in large-specification filesets for the AI Computing Edition	Unit	The total number of files and directories that are currently used by a single fileset.

Performance monitoring

Type	Metric	Metric name	Unit	Description
File system - Standard	BmStdReadThroughput	Read throughput of a standard CPFS for Lingjun file system	Bytes/s	The average read throughput of the file system in bytes per second during a statistical period.
	BmStdWriteThroughput	Write throughput of the file system for the Standard specification of the Intelligent Computing Edition	Bytes/s	The average write throughput of the file system in bytes per second during a statistical period.
	BmStdReadIops	File system read IOPS for the Intelligent Computing Edition's Standard Tier	Count/s (IOPS)	The average number of read IOPS per second for the file system during a statistical period.
	BmStdWriteIops	File System Write IOPS for the Intelligent Computing Edition (Standard Specifications)	Count/s (IOPS)	The average number of write IOPS per second for the file system during a statistical period.
	BmStdReadLatency	Read latency of the file system for the Intelligent Computing Edition Standard Specification	ms	The average read latency of the file system during a statistical period.
	BmStdWriteLatency	Write latency of the standard-tier Intelligent Computing Edition file system	ms	The average write latency of the file system during a statistical period.
	BmStdMetaQps	Metadata QPS of a standard CPFS for Lingjun file system	Count/s (IOPS)	The average number of metadata requests per second for the file system during a statistical period.
	BmStdMetaLatency	Metadata latency of a standard CPFS for Lingjun file system	ms	The average latency of metadata operations for the file system during a statistical period.
File system - Large Large-specification file systems are available only to specific users. If you are not a user of a large-specification file system, ignore the related metrics.	BmLargeReadThroughput	Read throughput of a large CPFS for Lingjun file system	Bytes/s	The average read throughput of the file system in bytes per second during a statistical period.
	BmLargeWriteThroughput	High-specification file system write throughput (Intelligent Computing Edition)	Bytes/s	The average write throughput of the file system in bytes per second during a statistical period.
	BmLargeReadIops	Read IOPS of a large CPFS for Lingjun file system	Count/s (IOPS)	The average number of read IOPS per second for the file system during a statistical period.
	BmLargeWriteIops	Write IOPS of a large CPFS for Lingjun file system	Count/s (IOPS)	The average number of write IOPS per second for the file system during a statistical period.
	BmLargeReadLatency	Read latency in large-scale file systems (AI Computing Edition)	ms	The average read latency of the file system during a statistical period.
	BmLargeWriteLatency	Write latency of large-scale AI Computing Edition file systems	ms	The average write latency of the file system during a statistical period.
	BmLargeMetaQps	Metadata operation QPS of a large CPFS for Lingjun file system	Count/s (IOPS)	The average number of metadata requests per second for the file system during a statistical period.
	BmLargeMetaLatency	Metadata operation latency of a large CPFS for Lingjun file system	Microsecond (μs)	The average latency of metadata operations for the file system during a statistical period.
Client	ClientReadThroughput	Client read throughput for the Intelligent Computing Edition	Bytes/s	The average read throughput in bytes per second for the client during a statistical period.
	ClientWriteThroughput	Client write throughput for AI Computing Edition	Bytes/s	The average write throughput in bytes per second for the client during a statistical period.
	ClientReadIops	Client read IOPS on the Intelligent Computing Edition	Count/s (IOPS)	The average number of read IOPS per second for the client during a statistical period.
	ClientWriteIops	Client Write IOPS for the Intelligent Computing Edition	Count/s (IOPS)	The average number of write IOPS per second for the client during a statistical period.
	ClientReadLatency	Average Client Read Latency for the Intelligent Computing Edition	Microsecond (μs)	The average read latency for the client during a statistical period.
	ClientWriteLatency	Average Client Write Latency of the Intelligent Computing Edition	us	The average write latency for the client during a statistical period.
	ClientMetaLatency	Intelligent Computing Edition: Client metadata latency	ms	The average latency for a client to complete a single metadata operation.
	ClientMetaQps	Intelligent Computing Edition: Client metadata QPS	Count/s (IOPS)	The average number of metadata requests per second for the client during a statistical period.
Connections	VpcClientCount	Number of clients per Intelligent Computing Edition VPC	Unit	The total number of clients connected to the file system through a VPC.
Connections	RdmaClientCount	Number of RDMA clients for the Intelligent Computing Edition	Unit	The total number of clients connected to the file system through RDMA.

Note

The elastic file client is a client installed by the CPFS team on compute nodes. It connects the compute nodes to the CPFS for Lingjun file system.
You can view client performance only in the CloudMonitor console or by calling CloudMonitor API operations. For more information, see View CPFS performance monitoring or View CPFS performance monitoring.
When using a CPFS for Lingjun file system on ECS or PAI Lingjun AI Computing Service (single-tenant) resources, the hostname is the hostname of the node.
When using a CPFS for Lingjun file system on PAI general computing resources or Lingjun resources, the hostname is the pod ID of the task.
For more information about the new monitoring metrics, see CloudMonitor Metric Query.

Old version metrics

Capacity monitoring

Type	Metric	Metric name	Unit	Description
File system	CPFSCapacity	Total storage space	Bytes	The total storage space of the file system during a statistical period.
	CPFSCapacityUsed	Data volume	Bytes	The amount of data that is actually used by the file system during a statistical period.
	CPFSInode Limit	Maximum number of files	Unit	The maximum number of files that can be used by the file system during a statistical period.
	CPFSInode Alloc	Number of allocated files	Unit	The number of files that are allocated by the file system during a statistical period.
	CPFSInode Used	Number of used files	Unit	The number of files that are used by the file system during a statistical period.
Fileset	BMCPFSFsetCapacityLimit	Fileset allocated capacity	Bytes	The maximum storage space that a fileset can use to write data. After the quota is reached, no more data can be written.
	BMCPFSFsetCapacityUsed	Fileset used capacity	Bytes	The storage space that is actually used by the fileset.
	BMCPFSFsetInodeLimit	Number of files allocated by fileset	Item	The maximum number of files and directories that a fileset can use to write data. After the quota is reached, no more data can be written.
	BMCPFSFsetInodeUsed	Number of files used by fileset	Unit	The number of files that are actually used by the fileset.

Performance monitoring

Type	Metric	Metric name	Unit	Description
File system	ThruputRead	Read throughput	Bytes/s	The average read throughput of the file system in bytes per second during a statistical period.
	ThruputWrite	Write throughput	Bytes/s	The average write throughput of the file system in bytes per second during a statistical period.
	IopsRead	Read IOPS	Count/s	The average number of read IOPS per second for the file system during a statistical period.
	IopsWrite	Write IOPS	Count/s	The average number of write IOPS per second for the file system during a statistical period.
Dataflow	ThroughputImport	Import throughput	Bytes/s	The average throughput in bytes per second for a dataflow import task during a statistical period.
	ThroughputExport	Export throughput	Bytes/s	The average throughput in bytes per second for a dataflow export task during a statistical period.
	QPSImportMeta	Import metadata QPS	Count/s	The average number of metadata requests per second for a dataflow import task during a statistical period.
	QPSExportMeta	Export metadata QPS	Count/s	The average number of metadata requests per second for a dataflow export task during a statistical period.
	IOPSImport	Import IOPS	Count/s	The average number of IOPS per second for a dataflow import task during a statistical period.
	IOPSEXport	Export IOPS	Count/s	The average number of IOPS per second for a dataflow export task during a statistical period.
	LatencyImport	Import latency	us	The average latency of a dataflow import task during a statistical period.
	LatencyExport	Export latency	us	The average latency of a dataflow export task during a statistical period.
Client	ClientReadIops	Client read IOPS	Count/s	The average number of read IOPS per second for the client during a statistical period.
	ClientWriteIops	Client write IOPS	Count/s	The average number of write IOPS per second for the client during a statistical period.
	ClientReadLatency	Client average read latency	us	The average read latency for the client during a statistical period.
	ClientWriteLatency	Client average write latency	us	The average write latency for the client during a statistical period.
	ClientReadThroughput	Client read throughput	Bytes/s	The average read throughput in bytes per second for the client during a statistical period.
	ClientWriteThroughput	Client write throughput	Bytes/s	The average write throughput in bytes per second for the client during a statistical period.

Note

The elastic file client is a client installed by the CPFS team on compute nodes. It connects the compute nodes to the CPFS for Lingjun file system.
You can view client performance only in the CloudMonitor console or by calling CloudMonitor API operations. For more information, see View CPFS performance monitoring or View CPFS performance monitoring.
When using a CPFS for Lingjun file system on ECS or PAI Lingjun AI Computing Service (single-tenant) resources, the hostname is the hostname of the node.
When using a CPFS for Lingjun file system on PAI general computing resources or Lingjun resources, the hostname is the pod ID of the task.
For more information about the old monitoring metrics, see CloudMonitor Metric Query.

Alert rule description

In the CloudMonitor console, you can set alert rules for different metrics. If a metric for a resource meets the specified alert condition, CloudMonitor automatically sends an alert notification. The following table describes the alert levels, notification mechanisms, and alert conditions.

Alert level	Notification mechanism	Alert condition
Critical	Phone call, text message, email, and DingTalk Robot	The average value of the metric meets the specified judgment condition for N consecutive statistical periods. Set the value of N based on the alert level. Note The alert condition varies based on the selected metric type. The condition displayed on the interface prevails.
Warning	Text message, email, and DingTalk Robot
Info	Email and DingTalk Robot

Cloud Parallel File Storage:Data monitoring

Background information

Retention policy of monitoring data

Monitoring metrics

New version metrics (recommended)

Capacity monitoring

Performance monitoring

Old version metrics

Capacity monitoring

Performance monitoring

Alert rule description

References