Hologres metrics - Hologres - Alibaba Cloud Documentation Center

This topic describes the metrics used in Hologres. Metrics reflect the resource usage of instances and the execution status of SQL statements. You can refer to metrics to quickly identify system exceptions and perform troubleshooting.

Usage notes

About QE and FixedQE:
- Query Engine (QE) is the general name for vector computing engines that are developed for Hologres, such as the Hologres QE (HQE) and the Seahawks QE (SQE). Metrics for queries whose slow query logs contain Engine Type={XQE} are metrics related to QE.
- FixedQE indicates the execution engine of queries that are performed by using fixed plans. Metrics for queries whose slow query logs contain Engine Type={FixedQE} are metrics related to FixedQE. FixedQE is named SDK in versions earlier than Hologres V2.2.
About statement types:
- Common types: SQL statements are classified based on their keywords. For example, the INSERT xxx and INSERT xxx ON CONFLICT DO UPDATE/NOTHING statements are classified as the INSERT type.
- UNKNOWN type: SQL statements that fail to be identified are classified as the UNKNOWN type. For example, an SQL statement with a syntax error cannot be identified.
- UTILITY type: SQL statements that are not classified as the SELECT, INSERT, DELETE, or UPDATE type, such as the CREATE and DROP statements, are classified as the UTILITY type.
In CloudMonitor, each metric has a unique ID that allows you to find the specific metric more easily. Metric IDs have different prefixes based on the instance type. The prefixes for general-purpose instances, secondary instances, virtual warehouse instances, and Hologres Shared Cluster (Lakehouse Acceleration Edition) instances are standard_, follower_, warehouse_, and shared_, respectively. For more information about the metrics that are supported by different types of instances, visit the following pages in the CloudMonitor console:
If no value is displayed for a metric, the instance of the specific version does not support this metric, or no data is collected for the metric for a long period of time.
Metric data can be retained for up to 30 days.

Access control

The monitoring page in the Hologres console is sourced from CloudMonitor. If you view monitoring information as a RAM user, you must grant the RAM user the relevant permissions based on your business requirements.

AliyunCloudMonitorFullAccess: the permissions to manage CloudMonitor.
AliyunCloudMonitorReadOnlyAccess: the read-only permissions on CloudMonitor.

For more information about how to grant permissions to a RAM user, see Grant permissions to a RAM user.

Metrics

The following table describes the metrics that are supported by Hologres.

Category	Metric	Description	Supported instance type	Remarks
CPU	Instance CPU Usage(%)	The CPU utilization of an instance.	General-purpose instances, secondary instances, and virtual warehouse instances	None
CPU	Worker CPU Usage(%)	The CPU utilization of each worker node in an instance.		None
Memory	Instance Memory Usage(%)	The overall memory usage of an instance.	General-purpose instances, secondary instances, and virtual warehouse instances	None
	Worker Memory Usage(%)	The memory usage of each worker node in an instance.		None
	Memory Usage Detail(%)	The memory usage data for an instance broken down by system, metadata, cache, query, and background.	General-purpose instances, secondary instances, and virtual warehouse instances	Supported in Hologres V2.0 and later
	QE Memory Used(byte)	The amount of memory resources used by QE-based queries.		Supported in Hologres V2.0.44 and later, and V2.1.22 and later
	QE Memory Used percentage(%)	The memory usage of QE-based queries.
Query QPS and RPS	Query QPS(countS)	The total number of queries that are performed per second in an instance.	General-purpose instances, secondary instances, virtual warehouse instances, and Hologres Shared Cluster instances	Query QPS ≥ QE QPS + FixedQE QPS Note This metric is collected for all queries, including queries of the UNKNOWN and UTILITY types and queries whose slow query logs contain Engine Type={PG}.
	QE QPS(countS)	The total number of QE-based queries that are performed per second in an instance.	General-purpose instances, secondary instances, and virtual warehouse instances	Supported in Hologres V2.2 and later
	FixedQE QPS(countS)	The total number of FixedQE-based queries that are performed per second in an instance. FixedQE is named SDK in versions earlier than Hologres V2.2.		Supported in Hologres V2.2 and later
	DML RPS(countS)	The total number of data records on which data manipulation language (DML) queries are performed per second in an instance.	General-purpose instances and virtual warehouse instances	DML RPS = QE RPS + FixedQE RPS
	QE DML RPS(countS)	The total number of data records on which QE-based DML queries are performed per second in an instance.		Supported in Hologres V2.2 and later
	Fixed QE DML RPS(countS)	The total number of data records on which FixedQE-based DML queries are performed per second in an instance.		Supported in Hologres V2.2 and later
Query Latency	Query Latency(milliseconds)	The average latency of queries in an instance.	General-purpose instances, secondary instances, virtual warehouse instances, and Hologres Shared Cluster instances	None
	QE Latency(milliseconds)	The average latency of QE-based queries in an instance.	General-purpose instances, secondary instances, and virtual warehouse instances	Supported in Hologres V2.2 and later
	FixedQE Latency(milliseconds)	The average latency of FixedQE-based queries in an instance.		Supported in Hologres V2.2 and later
	Optimization Cost Latency(milliseconds)	The duration of queries in the optimization phase.	General-purpose instances, secondary instances, virtual warehouse instances, and Hologres Shared Cluster instances	Supported in Hologres V2.0.44 and later, and V2.1.22 and later
	Start Query Cost Latency(milliseconds)	The duration of queries in the start query phase.
	Get Next Cost Latency(milliseconds)	The duration of queries in the get next phase.
	Query Latency P99(milliseconds)	The 99th percentile latency of queries.		None
	Longest Active Query Time(milliseconds)	The execution time of the ongoing query with the longest duration in an instance.		None
Failed Query QPS	Failed Query QPS(countS)	The total number of failed queries per second in an instance.	General-purpose instances, secondary instances, virtual warehouse instances, and Hologres Shared Cluster instances	Failed Query QPS ≥ Failed QE QPS + Failed FixedQE QPS Note This metric is collected for all failed queries, including failed queries that are performed by executing statements of the UNKNOWN and UTILITY types and failed queries whose slow query logs contain Engine Type={PG}.
	Failed QE QPS(countS)	The total number of failed QE-based queries per second in an instance.	General-purpose instances, secondary instances, and virtual warehouse instances	Supported in Hologres V2.2 and later
	Failed FixedQE QPS(countS)	The total number of failed FixedQE-based queries per second in an instance.	General-purpose instances and virtual warehouse instances	Supported in Hologres V2.2 and later
Locks	FE Wait Lock Latency(milliseconds)	The longest duration that a data definition language (DDL) statement waits for an FE lock to be released.	General-purpose instances, secondary instances, and virtual warehouse instances	Supported in Hologres V2.0.44 and later, and V2.1.22 and later
	FixedQE Lock Latency(milliseconds)	The duration that a FixedQE-based query spends on acquiring a lock. In most cases, FixedQE-based queries need to wait for locks acquired by HQE-based queries to be released.
	BE Lock Latency(milliseconds)	The duration that HQE-based queries spend on acquiring locks in an instance. The duration that FixedQE-based queries spend on waiting for locks held by HQE-based queries is included.
Connection	Connections(Count)	The total number of connections used in an instance.	General-purpose instances, secondary instances, virtual warehouse instances, and Hologres Shared Cluster instances	None
	Connections by DB(Count)	The number of connections used by each database in an instance.	General-purpose instances, secondary instances, and virtual warehouse instances
	Connections by FE(Count)	The number of connections used by each FE node in an instance.
	Max Connection Usage(%)	The highest connection usage among FE nodes in an instance.
Query Queue	Standard_query_queue_size (count)	The number of queries that are waiting to be executed in an instance.	General-purpose instances, secondary instances, and virtual warehouse instances	Supported in Hologres V3.0 and later
	Standard_query_queue_enqueue_qps (count/s)	The number of queries that are submitted to the system processing queue per second in an instance.
	Standard_query_queue_dequeue_qps (countS)	The number of queries that change from the waiting state to the running state per second.
	Standard_query_queue_acquire_result_qps (countS)	The number of queries that are processed in each execution state per second after the queries start to be processed.
	Standard_queue_time_latency (milliseconds)	The average time that queries spend in the queue before they are processed. The execution time of queries is not included.
IO	Standard I/O Read Throughput(byte/s)	The I/O throughput for reading data from the standard storage tier.	General-purpose instances, secondary instances, and virtual warehouse instances	None
	Standard I/O Write Throughput(byte/s)	The I/O throughput for writing data to the standard storage tier.	General-purpose instances and virtual warehouse instances
	Low Frequency I/O Read Throughput(byte/s)	The I/O throughput for reading data from the Infrequent Access (IA) storage tier.	General-purpose instances, secondary instances, and virtual warehouse instances
	Low Frequency I/O Write Throughput(byte/s)	The I/O throughput for writing data to the IA storage tier.	General-purpose instances and virtual warehouse instances
Storage	Standard Storage Used(byte)	The amount of standard storage resources used.	General-purpose instances and virtual warehouse instances	None
	Percentage of Standard Storage Used(%)	The percentage of standard storage resources used.
	Low Frequency Storage Used(byte)	The amount of IA storage resources used.
	Percentage of Low Frequency Storage Used(%)	The percentage of IA storage resources used.
Frameworks	FE Replay Running Time(milliseconds)	The replay latency of an FE node.	General-purpose instances, secondary instances, and virtual warehouse instances	Supported in Hologres V2.2 and later
	Replica Sync Lag(milliseconds)	The latency for data synchronization between replicas after the shard-level replication feature is enabled.		None
	Instance Sync Lag(milliseconds)	The latency for data synchronization from the primary instance to the secondary instance. This metric is displayed only for secondary instances.
	File Sync Lag(milliseconds)	The latency for file synchronization between geo-disaster recovery instances.	General-purpose instances
Auto Analyze	Stats Miss Table Num by DB(countS)	The number of tables for which statistical information is missing in a database.	General-purpose instances and virtual warehouse instances	Supported in Hologres V2.2 and later
Serverless Computing	Serverless Computing Longest Active Query Time(milliseconds)	The execution time of the longest-running query that uses serverless computing resources after the Serverless Computing feature is enabled.	General-purpose instances and virtual warehouse instances	Supported in Hologres V2.1 and later
	Serverless Computing Queue Size(count)	The number of queries queued in a serverless computing resource pool.		Supported in Hologres V2.2 and later
	Serverless Computing Quota Usage(%)	The ratio between the amount of serverless computing resources that are used by an instance and the maximum amount of resources that can be applied for.		Supported in Hologres V2.2 and later
Binlog	Get Binlog RPS(countS)	The number of data records in binary logs that are consumed per second.	General-purpose instances, secondary instances, and virtual warehouse instances	Supported in Hologres V2.2 and later
	Get Binlog BPS(byte/s)	The number of bytes in binary logs that are consumed per second.
	Binlog WAL Sender Count by fe(Count)	The number of walsenders used on each FE node.
	Binlog WAL Sender Usage(%)	The highest walsender usage among FE nodes.
Computing Resource	Warehouse_timed_elastic_cores(Count)	The number of cores of elastic computing resources provided by the time-specific scaling feature.	Virtual warehouse instances	Supported in Hologres V2.2.21 and later
Gateway	Gateway CPU Usage(%)	The CPU utilization of each gateway in an instance.	Virtual warehouse instances	Supported in Hologres V2.0 and later
	Gateway Memory Usage(%)	The memory usage of each gateway in an instance.		Supported in Hologres V2.0 and later
	Gateway New Connections per Second(countS)	The maximum number of connections that the system can accept and successfully establish in one second.		Supported in Hologres V2.1.12 and later
	Gateway Input Traffic BPS(B/s)	The amount of data that enters the system from gateways per second.		Supported in Hologres V2.1 and later
	Gateway Output Traffic BPS(B/s)	The amount of data sent to an external system from gateways per second.		Supported in Hologres V2.1 and later

CPU

The following sections describe CPU-related metrics.

Instance CPU Usage(%)

This metric indicates the overall CPU utilization of an instance, and reflects the computing load of the instance.

Daemon processes that run in the background or a compaction task that is asynchronously executed may occupy the CPU of the instance. Therefore, a small amount of CPU utilization is normal when no query tasks are running in the instance.
Hologres supports multi-core parallel computing. In most cases, the CPU utilization during a single query can quickly increase to 100%. This indicates that the computing resources are being fully utilized.
If the CPU utilization remains close to 100%, the instance is heavily loaded. For example, the CPU utilization remains 100% for 3 consecutive hours or remains 90% and higher for 12 consecutive hours. In this case, CPU resources have become a bottleneck of the instance. You must analyze specific business scenarios and queries to identify the cause of the high CPU utilization. You can identify the cause by considering the following aspects:
- Check whether a large amount of offline data is imported by executing the INSERT statement, and whether the amount of data is still increasing.
- Check whether data is queried or written at a high queries per second (QPS), which exhausts the CPU resources.
- Check whether hybrid loads exist in the preceding or other scenarios.
Scale out the instance to accommodate more complex queries or larger data volumes.

Note

For more information about the FAQ related to the CPU utilization of an instance, see Metric FAQ.

Worker CPU Usage(%)

This metric indicates the CPU utilization of each worker node in an instance, and reflects the computing load of each worker node. The number of worker nodes allocated to a Hologres instance varies based on instance specifications. For more information about instance specifications, see Instance management.

This metric is displayed only in Hologres V1.1 and later.
If the CPU utilization of all worker nodes remains close to 100% for a long period of time, the instance is heavily loaded. In this case, you must optimize the resource usage or scale out the instance based on the specific business scenario.
If the CPU utilization of some worker nodes is high and the CPU utilization of other worker nodes is low, the resources are unevenly allocated to the worker nodes. For more information about common causes of uneven resource allocation and their troubleshooting methods, see Metric FAQ.

Memory

The following sections describe memory-related metrics.

Instance Memory Usage(%)

This metric indicates the overall memory usage of an instance.

In Hologres, memory resources are reserved by backend processes. Even if no queries are in progress, the metadata, indexes, and data cache of data tables are loaded to the memory to facilitate archiving and computing. Therefore, the memory usage is not zero. When no queries are in progress, memory usage of 30% to 40% is normal.
However, if the memory usage continues to increase and remains close to 80% for a long period of time, memory resources may become a bottleneck of the instance. This can have a negative impact on the stability or performance of the instance.
You can identify the components or processes that occupy a large amount of memory resources based on the Memory Usage Detail(%) metric and QPS of your business. This information helps you know where you can make optimizations. For more information, see FAQ about OOM.

Worker Memory Usage(%)

This metric indicates the memory usage of each worker node in an instance, and reflects the computing load of each worker node. The number of worker nodes allocated to a Hologres instance varies based on instance specifications. For more information about instance specifications, see Instance management.

This metric is displayed only in Hologres V1.1 and later.
If the memory usage of all worker nodes remains close to 80% for a long period of time, the instance is heavily loaded. In this case, you must optimize the resource usage or scale out the instance based on business requirements.
If the memory usage of some worker nodes is high and the memory usage of other worker nodes is low, the resources are unevenly allocated to worker nodes. For more information about common causes of uneven resource allocation and their troubleshooting methods, see Metric FAQ.

Memory Usage Detail(%)

This metric provides memory usage data for the following components and processes in an instance: system, metadata, cache, query, and background. This metric is displayed only in Hologres V2.0.15 and later. This metric provides the memory distribution of an instance, based on which you can perform performance optimization.

System: indicates the memory resources that are occupied by system components during runtime. The system components include the HoloHub, gateways, and FE nodes. FE nodes include FE master nodes and FE query nodes. The memory resources that are occupied by system components fluctuate when queries are performed.
Cache: indicates the memory resources that are occupied by the following cached data:
- Cached SQL-related data: includes cached results and cached blocks. The memory resources that are occupied by cached SQL-related data fluctuate based on the query running status. If an SQL query hits the cache, the query performance is improved. For example, you can execute the EXPLAIN ANALYZE statement to obtain the value of the Physical read bytes field. If the value of this field is small, the SQL query hits a large amount of cache resources. The amount of cached SQL-related data is limited.
- Cached metadata: includes cached metadata of table schemas and cached metadata of files. The storage engine of Hologres loads metadata to the cache before SQL queries. This avoids direct access to disks and improves query performance.
- The memory resources that are occupied by cached data is fixed. In most cases, about 30% of the total memory of an instance is occupied by cached data. If no queries are performed on an instance, cached metadata occupies the memory.
Metadata: indicates the memory that is occupied by metadata and files. The Lazy Open mode is used for storing metadata. In this mode, frequently accessed metadata is stored in memory, and metadata that is not accessed is not stored in memory. This helps reduce the memory that is occupied by metadata. In most cases, we recommend that less than 30% of the total memory of an instance be occupied by metadata. If metadata occupies a large amount of memory resources in an instance, a large number of files or partitioned tables in the instance are accessed. In this case, you can perform table governance in the instance based on table statistics. For more information, see Query and analyze table statistics.
Query: indicates the memory resources that are consumed when SQL statements are executed. The amount of memory resources consumed is positively correlated with the complexity and concurrency of queries. Memory resources are consumed when SQL statements are executed by using fixed plans, the HQE, or the SQE.
- The memory that is available for SQL queries is elastic. A minimum of 20 GB of memory is required by a worker node. The maximum available memory for a worker node varies based on the total available memory of the instance. If a large amount of memory is used by other components and processes, a small amount of memory is available for computing.
- If queries on an instance occupy a large amount of memory or an out of memory (OOM) issue occurs on the instance, the queries are complex or have high concurrency. We recommend that you optimize the query performance or scale out the instance to prevent high memory usage. For more information, see Optimize query performance.
Background: indicates the memory resources that are occupied by background tasks, including compaction and flushing tasks. In most cases, background tasks occupy less than 5% of the total memory of an instance. If table indexes are modified, or a large amount of data is written or updated, background tasks occupy more memory resources until the tasks are complete.
Memtable: indicates the memory resources that are occupied by memory tables. In Hologres, a memory table is used to store table data that is written, updated, or deleted in real time. In most cases, memory tables occupy less than 5% of the total memory of an instance.

QE Memory Used(byte)

This metric indicates the amount of memory resources that are used by XQE-based queries. XQE-based queries indicate queries that use Alibaba Cloud-developed engines, such as HQE or SQE.

This metric is displayed only in Hologres V2.0.44 and later and V2.1.22 and later.
The amount of memory resources that are used by queries includes the value of this metric.

This metric reflects the complexity of queries that run in an instance. If the value of this metric is large, queries that run in the instance are complex, and a large amount of memory resources are used.

QE Memory Used percentage(%)

This metric reflects the instance load. If the value of this metric is high, an OOM issue may occur due to insufficient memory. In this case, you must perform query governance or scale out your instance.

This metric is displayed only in Hologres V2.0.44 and later and V2.1.22 and later.

Query QPS and RPS

Query QPS(countS)

This metric indicates the total number of queries that are performed by using SQL statements per second in an instance. The SQL statements include statements of the SELECT, INSERT, UPDATE, DELETE, UTILITY, and UNKNOWN types.

Query QPS ≥ QE QPS + FixedQE QPS

QE QPS(countS)

This metric indicates the number of QE-based queries that are performed by using SQL statements per second in an instance. The SQL statements include statements of the SELECT, INSERT, UPDATE, and DELETE types.

This metric is displayed only in Hologres V2.2 and later.

FixedQE QPS(countS)

This metric indicates the number of FixedQE-based queries that are performed by using SQL statements per second in an instance. FixedQE is named SDK in versions earlier than Hologres V2.2. The SQL statements include statements of the SELECT, INSERT, UPDATE, and DELETE types.

This metric is displayed only in Hologres V2.2 and later.

DML RPS(countS)

This metric indicates the total number of data records that are imported or updated in queries that are performed by using DML statements per second in an instance. The DML statements include statements of the INSERT, UPDATE, and DELETE types.

DML RPS = QE DML RPS + Fixed QE DML RPS

QE DML RPS(countS)

This metric indicates the number of data records that are imported or updated in QE-based queries that are performed by using DML statements per second in an instance. The DML statements include statements of the INSERT, UPDATE, and DELETE types.

This metric is displayed only in Hologres V2.2 and later.
QE is commonly used to import or update data in the following scenarios:
- Batch import or update of data in MaxCompute or OSS tables
- Data writes or updates in batch copy mode
- Batch import between Hologres tables

Fixed QE DML RPS(countS)

This metric indicates the number of data records that are imported or updated in FixedQE-based queries that are performed by using DML statements per second in an instance. The DML statements include statements of the INSERT, UPDATE, and DELETE types. FixedQE is named SDK in versions earlier than Hologres V2.2.

This metric is displayed only in Hologres V2.2 and later.
FixedQE is commonly used in the following scenarios:
- Real-time data writes by using Flink
- Data writes by using Holo Client
- Data imports by using Spark
- Real-time synchronization of MySQL binary logs to Hologres
- Data imports from DataHub
- Batch data writes by using DataX of Data Integration
- Data writes by executing the INSERT INTO VALUES () statement on the PostgreSQL client or JDBC client

Query Latency

Query Latency(milliseconds)

This metric indicates the average latency of all queries that are performed by using SQL statements in an instance. The SQL statements include statements of the SELECT, INSERT, UPDATE, DELETE, UTILITY, and UNKNOWN types.

Query Latency ≥ MAX(QE Latency, FixedQE Latency)

QE Latency(milliseconds)

This metric indicates the average latency of QE-based queries that are performed by using SQL statements in an instance. The SQL statements include statements of the SELECT, INSERT, UPDATE, and DELETE types.

This metric is displayed only in Hologres V2.2 and later.
You can use the QE Latency(milliseconds) metric together with the Optimization Cost Latency(milliseconds), Start Query Cost Latency(milliseconds), Get Next Cost Latency(milliseconds), and QE QPS(countS) metrics.

FixedQE Latency(milliseconds)

This metric indicates the average latency of FixedQE-based queries that are performed by using SQL statements in an instance. The SQL statements include statements of the SELECT, INSERT, UPDATE, and DELETE types.

This metric is displayed only in Hologres V2.2 and later.
The high value of this metric may be caused by the following reasons:
- Occasionally high: HQE-based queries may exist and acquire locks. If the value of the FixedQE Lock Latency metric also increases, HQE-based queries acquire locks. In this case, identify the queries that acquire locks by following the instructions in Query Insight.
- Always high: Table configurations may be inappropriate, or other complex queries exist. Troubleshoot the fault by following the instructions in FAQ about Blink and Flink issues.

Optimization Cost Latency(milliseconds)

This metric indicates the duration of queries in an instance in the optimization phase. In the optimization phase, the query optimizer (QO) parses SQL statements and generates physical plans to help the execution engine to process data.

This metric is displayed only in Hologres V2.0.44 and later and V2.1.22 and later.
If the value of this metric is high, queries in the instance are complex, and query parsing is time-consuming. If queries in the instance differ only in their input parameters, we recommend that you use prepared statements to shorten the optimization phase. For more information, see JDBC.

Start Query Cost Latency(milliseconds)

This metric indicates the duration of queries in an instance in the start query phase. In this phase, pre-query initialization is performed, such as lock acquisition and schema version alignment.

This metric is displayed only in Hologres V2.0.44 and later and V2.1.22 and later.
If the value of this metric is high, the time that SQL statements spend on waiting for locks to be released or waiting for resources due to the high CPU load is long. You can perform troubleshooting based on the execution plan.

Get Next Cost Latency(milliseconds)

This metric indicates the duration from the end of the start query phase to the time when all results are returned. In this phase, the system processes data and returns results.

This metric is displayed only in Hologres V2.0.44 and later and V2.1.22 and later.
If the value of this metric is high, queries are complex. You can check values of the QE Memory Used percentage(%) and QE QPS(countS) metrics. If the values of the metrics are normal, the high value of this metric may be caused by the long time the client spends on waiting for data to be returned.

Query Latency P99(milliseconds)

This metric indicates the 99th percentile latency of all queries in an instance. The queries include system queries and queries that are performed by using statements of the SELECT, INSERT, UPDATE, and UTILITY types.

Longest Active Query Time(milliseconds)

This metric indicates the execution time of the ongoing query with the longest duration in an instance at the current point in time. The query statements include statements of the SELECT, INSERT, UPDATE, DELETE, UTILITY, and UNKNOWN types.

This metric is displayed only in Hologres V1.1 and later.
The number of worker nodes that are configured on an instance varies based on instance types. Queries are randomly distributed to worker nodes. Hologres collects the durations of ongoing queries on each worker node and displays the duration of the longest ongoing query at the current point in time. For example, if three queries are being executed on different worker nodes at the current point in time, and the durations of these queries are 10 minutes, 5 minutes, and 30 seconds, this metric displays 10 minutes.
You can check whether the duration of the query is normal based on logs of active queries or slow queries. For more information, see Manage queries and Query and analyze slow query logs. Troubleshoot slow queries, and resolve deadlocks or stuck issues at the earliest opportunity.

Note

Metric data in the Hologres console is updated every minute. Therefore, the start time of a query on the X axis of this metric may not be the accurate query start time. This metric is meant to be used only as a reference for troubleshooting exceptions. It allows you to easily find slow queries in an instance. As an auxiliary metric for self-O&M, this metric does not provide the accurate durations of ongoing queries.

Failed Query QPS

Failed Query Qps(countS)

This metric indicates the average number of failed queries that are performed by using SQL statements per second in an instance. The SQL statements include statements of the SELECT, INSERT, UPDATE, DELETE, UTILITY, and UNKNOWN types.

Failed Query QPS ≥ Failed QE QPS + Failed FixedQE QPS

You can view the failed queries in slow query logs and troubleshoot the failures to improve the availability of your Hologres instance. For more information, see Query and analyze slow query logs.

Failed QE QPS(countS)

This metric indicates the average number of failed QE-based queries that are performed by using SQL statements per second in an instance. The SQL statements include statements of the SELECT, INSERT, UPDATE, and DELETE types.

This metric is displayed only in Hologres V2.2 and later.

Failed FixedQE QPS(countS)

This metric indicates the average number of failed FixedQE-based queries that are performed by using SQL statements per second in an instance. The SQL statements include statements of the SELECT, INSERT, UPDATE, and DELETE types.

This metric is displayed only in Hologres V2.2 and later.

Locks

FE Wait Lock Latency(milliseconds)

Hologres is a distributed system. A Hologres instance has multiple FE nodes that parse, distribute, and route SQL statements. An FE node is locked if multiple connection requests are routed to the FE node for DDL operations on the same table. These DDL operations include the CREATE and DROP operations. This metric indicates the longest time that DDL statements spend on waiting for FE locks to be released.

This metric is displayed only in Hologres V2.2 and later.
In most cases, a DDL statement acquires a lock when the statement is executed, and other statements need to wait for the lock to be released. If the value of this metric exceeds 5 minutes, and the value of the FE Replay Running Time(milliseconds) metric is large, the execution of specific DDL statements may be stuck. You can identify and terminate time-consuming queries by following the instructions in Manage queries.

FixedQE Lock Latency(milliseconds)

SQL statements of the INSERT, DELETE, and UPDATE types that use HQE acquire table locks, and SQL statements of the INSERT, DELETE, and UPDATE types that use FixedQE acquire row locks. As a result, FixedQE-based SQL statements need to wait for locks acquired by HQE-based SQL statements to be released, and the value of this metric increases.

This metric is displayed only in Hologres V2.2 and later.
If the value of this metric is large, identify time-consuming FixedQE-based queries based on slow query logs and check the HQE-based SQL statements that acquire locks by using the Query Insight feature.

BE Lock Latency(milliseconds)

This metric indicates the total duration that SQL statements of the INSERT, DELETE, and UPDATE types spend on acquiring locks. The SQL statements include FixedQE-based statements and HQE-based statements.

This metric is displayed only in Hologres V2.2 and later.
If the value of this metric is large, identify time-consuming SQL statements of the INSERT, DELETE, and UPDATE types based on slow query logs and check HQE-based statements that acquire locks by using the Query Insight feature.

Connection

Connections(Count)

The default maximum number of connections supported by a Hologres instance varies based on instance specifications. For more information about instance specifications, see Instance management. This metric indicates the total number of connections that are used in an instance. The connections can be in the active, idle, or idle in transaction state. The idle in transaction state indicates that a connection is idle in a failed transaction. You can view the connection usage by following the instructions in Manage queries. If connections are insufficient, close idle connections at the earliest opportunity.

Connections by DB(Count)

This metric indicates the number of connections that are used by each database in an instance. When you use this metric, take note of the following items:

By default, a database supports a maximum of 128 connections. For more information, see Instance management.
If the number of connections used by each database in an instance remains close to the maximum number for a long period of time, a large number of connections exist in the instance. You can check whether the connections are idle or active based on your business scenario. For more information, see Manage connections. Then, release idle connections or scale out the instance at the earliest opportunity based on your business scenario.
If the number of connections used by specific worker nodes is high and the number of connections used by other worker nodes is low, the connections of the instance are unevenly distributed. You can find and release idle connections at the earliest opportunity to mitigate the uneven connection load. For more information, see Manage connections.

Connections by FE(Count)

This metric indicates the number of connections that are used by each FE node in an instance. When you use this metric, take note of the following items:

By default, an FE node supports a maximum of 128 connections. For more information, see Instance management.
If the number of connections used by each FE node in an instance remains close to the maximum number for a long period of time, a large number of connections exist in the instance. You can check whether the connections are idle or active based on your business scenario. For more information, see Manage connections. Then, release idle connections or scale out the instance at the earliest opportunity based on your business scenario.
If the number of connections used by specific worker nodes is high and the number of connections used by other worker nodes is low, the connections of the instance are unevenly distributed. You can find and release idle connections at the earliest opportunity to mitigate the uneven connection load. For more information, see Manage connections.

Max Connection Usage(%)

This metric indicates the highest connection usage among FE nodes. Hologres collects the connection usage of each FE node and displays the highest connection usage by using the Max(frontend_connection_used_rate) function. This allows you to check whether the number of connections exceeds the maximum number supported by an FE node to prevent connection failures. The round robin policy is adopted among multiple FE nodes to establish physical connections to these FE nodes in sequence. You can view the connection usage by following the instructions in Manage queries. If connections are insufficient, close idle connections at the earliest opportunity.

Query Queue

Standard_query_queue_size (count)

This metric indicates the number of queries that are waiting to be executed in an instance.

This metric is displayed only in Hologres V3.0 and later.

Standard_query_queue_enqueue_qps (count/s)

This metric indicates the number of queries that are submitted to the system processing queue per second in an instance. It can be used to measure the system load and query frequency.

This metric is displayed only in Hologres V3.0 and later.

Standard_query_queue_dequeue_qps (countS)

This metric indicates the number of queries that change from the waiting state to the running state per second.

This metric is displayed only in Hologres V3.0 and later.

Standard_query_queue_acquire_result_qps (countS)

This metric indicates the number of queries that are processed in each execution state per second in the query queue. The following states are included:

kReadyToRun: The query is ready to run.
kQueueTimeout: The query times out and cannot run, leading to an error.
kCanceled: The query is canceled.
kExceedConcurrencyLimit: The query exceeds the concurrency limit and cannot run, leading to an error.

This metric is displayed only in Hologres V3.0 and later.

Standard_queue_time_latency (milliseconds)

This metric indicates the average time that queries spend in the queue before they are processed. The execution time of queries is not included. The time is measured in milliseconds.

This metric is displayed only in Hologres V3.0 and later.

IO

I/O throughput indicates the amount of data read from or written to the instance and reflects the frequency at which the read or write operations are performed. You can obtain the workloads on the instance based on the I/O throughput and troubleshoot relevant issues at the earliest opportunity.

1 GiB = 1,024 MiB = 1,024 × 1,024 KiB

Standard I/O Read Throughput(byte/s)

This metric indicates the amount of data that is read from the standard storage tier in queries per second.

Standard I/O Write Throughput(byte/s)

This metric indicates the amount of data that is written to the standard storage tier in queries per second.

Low Frequency I/O Read Throughput(byte/s)

This metric indicates the amount of data that is read from the IA storage tier in queries per second.

Low Frequency I/O Write Throughput(byte/s)

This metric indicates the amount of data that is written to the IA storage tier in queries per second.

Storage

Storage indicates the size of the logical disk that is used to store the data of all databases in an instance. 1 GiB = 1,024 MiB = 1,024 × 1,024 KiB. The storage usage of a Hologres instance can be continuously increased without limits.

After the storage quota of a subscription instance is used up, the excess storage is charged in pay-as-you-go mode. Excess storage usage does not negatively affect the stability or use of the Hologres instance.

After your storage quota is used up, we recommend that you update storage configurations or clean up unwanted data at the earliest opportunity. This prevents unnecessary and more expensive storage costs. You can use your storage fee savings to pay for computing resources.

You can use the pg_relation_size function to query the storage usage and details of tables and databases. You can also manage tables in an instance in a fine-grained manner based on the hologres.hg_table_info table. For more information, see Query and analyze table statistics.

Standard Storage Used(byte)

This metric indicates the amount of standard storage resources that are used to store data. If the purchased storage quota is used up, you can expand the storage capacity.

Percentage of Standard Storage Used(%)

This metric indicates the percentage of standard storage resources that are used. If the purchased storage quota is used up, you can expand the storage capacity.

Low Frequency Storage Used(byte)

This metric indicates the amount of IA storage resources that are used to store data. If the purchased storage quota is used up, you can expand the storage capacity.

Percentage of Low Frequency Storage Used(%)

This metric indicates the percentage of IA storage resources that are used. If the purchased storage quota is used up, you can expand the storage capacity.

Framework

FE Replay Running Time(milliseconds)

Hologres is a distributed system. A Hologres instance has multiple FE nodes that parse, distribute, and route SQL statements at the access layer. In Hologres, DDL statements are executed on one FE node and then replayed on other FE nodes. When you use this metric, take note of the following items:

It is normal if the value of this metric is in milliseconds or seconds.
If the value of this metric is in minutes, a large number of DDL statements may exist. If the value of this metric continuously increases, queries may be stuck. You can identify and cancel time-consuming queries based on the hg_stat_activity view. For more information, see Manage queries.
This metric is supported only in Hologres V2.2 and later.

Replica Sync Lag(milliseconds)

This metric indicates the latency of data synchronization between the leader and follower shards after the shard-level replication feature is enabled.

In most cases, the value of this metric is in milliseconds.
The value of this metric may increase if a large amount of data is written or updated or DDL statements are frequently executed.

Instance Sync Lag(milliseconds)

This metric indicates the latency of data synchronization from the primary instance to a secondary instance. The latency is displayed in milliseconds. When you use this metric, take note of the following items:

By default, this metric is not displayed for primary instances and is displayed only for secondary instances.
This metric is initially displayed as 0 ms after a secondary instance is associated with a primary instance. When data is written to the primary instance, the value of this metric fluctuates.
In most cases, the synchronization latency between the primary and secondary instances is in milliseconds. The occasional latency jitters are usually caused by metadata modification operations, such as DDL operations, that are performed on the primary instance. These latency jitters can be ignored. If the latency is greater than 1 second over an extended period of time, this can indicate high resource usage. You can check the resource usage such as CPU utilization and memory usage and scale out the instances if necessary to reduce the latency.
If an instance is restarting or updating, the synchronization latency may increase to several minutes, but will automatically recover after the restart or update is complete.

File Sync Lag(milliseconds)

This metric indicates the latency of file synchronization between geo-disaster recovery instances. This metric is displayed only for read-only secondary instances.

Auto Analyze

Stats Miss Table Num by DB(countS)

This metric indicates the number of tables for which no statistical information is collected in a database.

This metric is displayed only in Hologres V2.2 and later.
In Hologres V2.0 and later, the auto-analyze feature is enabled by default. When a table is newly created or a large amount of data is written or updated, the auto-analyze feature may not update the statistical information of the table in real time. In this case, you can check the table statistics after a period of time.
If statistical information is not collected for specific tables in a database for several hours or several days, the threshold for triggering the auto-analyze feature may not be reached. In this case, query the HG_STATS_MISSING view to identify the tables and manually execute the ANALYZE statement.
If statistical information is not collected for specific tables in a database for several hours or several days, the threshold for triggering the auto-analyze feature may not be reached. In this case, check table statistical information and manually execute the ANALYZE statement. For more information, see ANALYZE and auto-analyze.

Serverless Computing

Serverless Computing Longest Active Query Time(milliseconds)

Hologres supports the Serverless Computing feature. You can run queries in the serverless computing resource pool. This isolates instance resources and ensures quick query running.

This metric is displayed only in Hologres V2.1 and later.
This metric indicates the execution time of the longest-running query that uses serverless computing resources. You can use this metric with the hg_stat_activity table to obtain the running status of queries in the serverless computing resource pool.

Serverless Computing Queue Size(count)

This metric indicates the number of queries queued in a serverless computing resource pool in an instance.

This metric is displayed only in Hologres V2.2 and later.

Serverless Computing Quota Usage(%)

This metric indicates the ratio between the amount of serverless computing resources that are used by an instance and the maximum amount of resources that can be applied for within a specific period of time.

This metric is displayed only in Hologres V2.2 and later.

Binlog

Number of data records or bytes consumed in binary logs per second

Hologres allows you to subscribe to Hologres binary logs. You can use binary logs to perform data layering in real time and accelerate data forwarding.

Get Binlog RPS(countS)

This metric indicates the number of data records in binary logs that are consumed per second. This metric is displayed only in Hologres V2.2 and later.

Get Binlog BPS(byte/s)

This metric indicates the number of bytes in binary logs that are consumed per second. If large fields exist or the data volume increases, the value of this metric increases. This metric is displayed only in Hologres V2.2 and later.

Number of walsenders used and walsender usage

When you use JDBC to consume Hologres binary logs, a walsender is used to connect each shard of the table, which is similar to regular connections. The connections over walsenders are independent of regular connections. A default upper limit is configured for the number of walsenders in an instance.

Binlog WAL Sender Count by fe(Count)

This metric indicates the number of walsenders that are used in each FE node.

Binlog WAL Sender Usage(%)

This metric indicates the highest walsender usage of FE nodes.

You can use the two preceding metrics to obtain the walsender usage in an instance. If the upper limit on the number of walsenders is reached, perform troubleshooting by following the instructions in Use JDBC to consume Hologres binary logs.

Computing resource

Warehouse_timed_elastic_cores(Count)

Hologres virtual warehouse instances support the time-specific scaling feature. For more information, see Time-specific scaling of virtual warehouses (beta). This metric indicates the number of cores of elastic computing resources provided by the time-specific scaling feature.

Gateway

Gateway CPU Usage(%)

This metric indicates the CPU utilization of each gateway in an instance.

This metric is displayed only in Hologres V2.0 and later.
A gateway forwards traffic by using the round-robin method. Therefore, CPU resources are consumed even if no connection requests are initiated.
In Hologres V2.2.22 and later, a gateway can launch an increased number of worker threads by default to enhance the capabilities of processing new connections. In addition, CPU utilization is increased accordingly.

Gateway Memory Usage(%)

This metric indicates the memory usage of each gateway in an instance.

This metric is displayed only in Hologres V2.0 and later.

Gateway New Connections per Second(countS)

This metric indicates the maximum number of connections that the system can accept and successfully establish in one second.

This metric is displayed only in Hologres V2.1.12 and later.
A gateway can forward up to 100 connections per second.
If the number of new connection requests remains close to the value that is calculated by using the following formula for a long period of time: 100 × Number of gateways, gateway resources become a bottleneck for the system to process new connection requests. In this case, you must configure a proper connection pool mechanism for the application or increase the number of gateways.

Gateway Input Traffic BPS(B/s)

This metric indicates the amount of data that enters the system from gateways per second.

This metric is displayed only in Hologres V2.1 and later.
If the inbound traffic rate of gateways remains close to the value that is calculated by using the following formula for a long period of time: 200 MiB/s × Number of gateways, the network processing capabilities of gateways become a bottleneck of the system. In this case, you must increase the number of gateways.

Gateway Output Traffic BPS(B/s)

This metric indicates the amount of data sent to an external system from gateways per second.

This metric is displayed only in Hologres V2.1 and later.
If the outbound traffic rate of gateways remains close to the value that is calculated by using the following formula for a long period of time: 200 MiB/s × Number of gateways, the network processing capabilities of gateways become a bottleneck of the system. In this case, you must increase the number of gateways.

Metric FAQ

The Metric FAQ topic provides answers to some frequently asked questions about the metrics used in Hologres to help you diagnose and troubleshoot issues. This improves your O&M capabilities.

Metric alerts

You can use CloudMonitor to configure alerts based on the metrics. This helps you detect and troubleshoot exceptions at the earliest opportunity and minimize the negative impact of exceptions on your business. For more information, see CloudMonitor.