Hologres exposes monitoring metrics through the console and Cloud Monitor so you can track resource usage, query execution, and system health in real time.
Metrics overview
| Category | Metric | Description | Supported instance types | Notes |
|---|---|---|---|---|
| CPU | Instance CPU Usage (%) | CPU usage of the instance. | General-purpose, follower, compute group | -- |
| CPU | Worker Node CPU Usage (%) | CPU usage of each Worker node. | General-purpose, follower, compute group | V1.1+ |
| CPU | Cluster CPU Usage (%) | CPU usage of each Cluster in the compute group. | Compute group | V4.0+ |
| Memory | Instance Memory Usage (%) | Total memory usage of the instance. | General-purpose, follower, compute group | -- |
| Memory | Worker Node Memory Usage (%) | Memory usage of each Worker node. | General-purpose, follower, compute group | V1.1+ |
| Memory | Detailed Compute Group Memory Usage (%) | Memory usage broken down by System, Meta, Cache, Query, and Background. | General-purpose, follower, compute group | V2.0+ |
| Memory | QE Query Memory Usage (bytes) | Memory used by QE engine queries. | General-purpose, follower, compute group | V2.0.44+ / V2.1.22+ |
| Memory | QE Query Memory Usage (%) | Percentage of memory used by QE engine queries. | General-purpose, follower, compute group | V2.0.44+ / V2.1.22+ |
| Memory | Cluster Memory Usage (%) | Memory usage of each Cluster in the compute group. | Compute group | V4.0+ |
| Query QPS and RPS | Query QPS (count/s) | Total queries per second. Query QPS >= QE QPS + FixedQE QPS. | General-purpose, follower, compute group, shared cluster | -- |
| Query QPS and RPS | QE Query QPS (count/s) | Queries per second executed by the QE engine. | General-purpose, follower, compute group | V2.2+ |
| Query QPS and RPS | FixedQE Query QPS (count/s) | Queries per second executed by the FixedQE engine. | General-purpose, follower, compute group | V2.2+ |
| Query QPS and RPS | DML RPS (count/s) | Total rows per second for DML operations. DML RPS = QE RPS + FixedQE RPS. | General-purpose, compute group | -- |
| Query QPS and RPS | QE DML RPS (count/s) | DML rows per second by the QE engine. | General-purpose, compute group | V2.2+ |
| Query QPS and RPS | FixedQE DML RPS (count/s) | DML rows per second by the FixedQE engine. | General-purpose, compute group | V2.2+ |
| Query Latency | Query Latency (milliseconds) | Average latency of all queries. Query Latency >= MAX(QE Latency, FixedQE Latency). | General-purpose, follower, compute group, shared cluster | -- |
| Query Latency | QE Query Latency (milliseconds) | Average latency of QE engine queries. | General-purpose, follower, compute group | V2.2+ |
| Query Latency | FixedQE Query Latency (milliseconds) | Average latency of FixedQE engine queries. | General-purpose, follower, compute group | V2.2+ |
| Query Latency | Optimization Phase Duration (milliseconds) | Time spent in the query optimization phase. | General-purpose, follower, compute group, shared cluster | V2.0.44+ / V2.1.22+ |
| Query Latency | Start Query Phase Duration (milliseconds) | Time spent in query initialization (locking, schema alignment). | General-purpose, follower, compute group, shared cluster | V2.0.44+ / V2.1.22+ |
| Query Latency | Get Next Phase Duration (milliseconds) | Time from initialization to result delivery. | General-purpose, follower, compute group, shared cluster | V2.0.44+ / V2.1.22+ |
| Query Latency | Query P99 Latency (milliseconds) | 99th percentile latency of all queries. | General-purpose, follower, compute group, shared cluster | -- |
| Query Latency | Longest Running Query Duration in This Instance (milliseconds) | Duration of the longest-running active query. | General-purpose, follower, compute group, shared cluster | V1.1+ |
| Failed Query QPS | Failed Query QPS (milliseconds) | Total failed queries per second. Failed QPS >= QE Failed QPS + FixedQE Failed QPS. | General-purpose, follower, compute group, shared cluster | -- |
| Failed Query QPS | QE Failed Query QPS (count/s) | Failed queries per second by the QE engine. | General-purpose, follower, compute group | V2.2+ |
| Failed Query QPS | FixedQE Failed Query QPS (count/s) | Failed queries per second by the FixedQE engine. | General-purpose, compute group | V2.2+ |
| Locks | Maximum FE Lock Wait Time (milliseconds) | DDL lock wait time on FE nodes. | General-purpose, follower, compute group | V2.0.44+ / V2.1.22+ |
| Locks | FixedQE Backend Lock Wait Time (milliseconds) | Lock wait time for FixedQE (typically HQE locks). | General-purpose, follower, compute group | V2.0.44+ / V2.1.22+ |
| Locks | Total Backend Lock Wait Time for Instance (milliseconds) | Total HQE lock wait time, including FixedQE lock waits. | General-purpose, follower, compute group | V2.0.44+ / V2.1.22+ |
| Connection | Total Connections (count) | Total active connections in the instance. | General-purpose, follower, compute group, shared cluster | -- |
| Connection | Connections by Database (count) | Connections aggregated by database. | General-purpose, follower, compute group | -- |
| Connection | Connections by FE (count) | Connections aggregated by FE node. | General-purpose, follower, compute group | -- |
| Connection | Connection Usage Rate of FE with Highest Usage (%) | Peak connection usage rate across all FE nodes. | General-purpose, follower, compute group | -- |
| Query Queue | Queued Queries Count | Queries waiting to be executed. | General-purpose, follower, compute group | V3.0+ |
| Query Queue | Query Queue Entry QPS (count/s) | Queries submitted to the queue per second. | General-purpose, follower, compute group | V3.0+ |
| Query Queue | Queries Transitioned from Queued to Running QPS (count/s) | Queries moving from waiting to running per second. | General-purpose, follower, compute group | V3.0+ |
| Query Queue | QPS by State for Queries That Started Running (count/s) | Per-second count of queries grouped by execution state. | General-purpose, follower, compute group | V3.0+ |
| Query Queue | Average Query Queue Wait Time (milliseconds) | Average time from queue entry to processing start. | General-purpose, follower, compute group | V3.0+ |
| Query Queue | Query Queue Auto-Rate-Limit Max Concurrency (count) | Maximum concurrency for auto-rate-limited queues. | Compute group | V3.1+ |
| I/O | Standard I/O Read Throughput (bytes/s) | Read throughput for Standard storage. | General-purpose, follower, compute group | -- |
| I/O | Standard I/O Write Throughput (bytes/s) | Write throughput for Standard storage. | General-purpose, compute group | -- |
| I/O | Low-Frequency IO Read Throughput (bytes/s) | Read throughput for IA storage. | General-purpose, follower, compute group | -- |
| I/O | Write throughput for low-frequency I/O (bytes/s) | Write throughput for IA storage. | General-purpose, compute group | -- |
| Storage | Standard Storage Used Capacity (bytes) | Capacity used in Standard storage. | General-purpose, compute group | -- |
| Storage | Standard Storage Usage (%) | Usage percentage of Standard storage. | General-purpose, compute group | -- |
| Storage | IA Storage Used Capacity (bytes) | Capacity used in IA storage. | General-purpose, compute group | -- |
| Storage | IA Storage Usage (%) | Usage percentage of IA storage. | General-purpose, compute group | -- |
| Storage | Recycle Bin Storage Usage (bytes) | Storage consumed by the recycle bin. | General-purpose, compute group | V3.1+ |
| Framework | FE Replay Delay (milliseconds) | Replay delay for each FE node. | General-purpose, follower, compute group | V2.2+ |
| Framework | Shard Multi-Replica Sync Delay (milliseconds) | Sync delay between Shard replicas. | General-purpose, follower, compute group | -- |
| Framework | Primary-Follower Sync Delay (milliseconds) | Data sync delay from primary to follower instance. | General-purpose, follower, compute group | -- |
| Framework | Cross-Instance File Sync Delay (milliseconds) | File sync delay between disaster recovery instances. | General-purpose | -- |
| Auto Analyze | Tables Missing Statistics per Database (count) | Tables lacking statistics in each database. | General-purpose, compute group | V2.2+ |
| Serverless Computing | Longest Running Serverless Computing Query Duration (milliseconds) | Longest-running Serverless Computing query. | General-purpose, compute group | V2.1+ |
| Serverless Computing | Serverless Computing Query Queue Count | Queries queued in the Serverless Computing pool. | General-purpose, compute group | V2.2+ |
| Serverless Computing | Serverless Computing Resource Quota Usage (%) | Ratio of used to maximum allocatable Serverless Computing resources. | General-purpose, compute group | V2.2+ |
| Binary Logging | Binlog Consumption Rate (count/s) | Binlog entries consumed per second. | General-purpose, follower, compute group | V2.2+ |
| Binary Logging | Binlog Consumption Rate (bytes/s) | Bytes consumed from Binlog per second. | General-purpose, follower, compute group | V2.2+ |
| Binary Logging | WAL Sender Count per FE (count) | WAL senders used per FE node. | General-purpose, follower, compute group | V2.2+ |
| Binary Logging | WAL Sender Usage Rate of FE with Highest Usage (%) | Peak WAL sender usage across FE nodes. | General-purpose, follower, compute group | V2.2+ |
| Computing Resource | Elastic Core Count for Compute Groups | Cores added by time-based scaling. | Compute group | V2.2.21+ |
| Computing Resource | Compute Group Auto-Elastic Core Count (count) | Cores added by auto-scaling. | Compute group | V4.0+ |
| Gateway | Gateway CPU Usage (%) | CPU usage of each Gateway. | Compute group | V2.0+ |
| Gateway | Gateway Memory Usage (%) | Memory usage of each Gateway. | Compute group | V2.0+ |
| Gateway | Gateway New Connection Requests per Second (count/s) | New connections established per second. | Compute group | V2.1.12+ |
| Gateway | Gateway Inbound Traffic Rate (B/s) | Data entering through the Gateway per second. | Compute group | V2.1+ |
| Gateway | Gateway Outbound Traffic Rate (B/s) | Data sent from the Gateway per second. | Compute group | V2.1+ |
| Dynamic Table | Instance-Level Dynamic Table Refresh Failure QPS (count/s) | Refresh failure rate across all Dynamic Tables. | General-purpose, compute group | V4.0.8+ |
| Dynamic Table | Dynamic Table Data Latency (seconds) | Latency relative to the latest upstream data. | General-purpose, compute group | V4.0.8+ |
| Dynamic Table | Dynamic Table Current Refresh Duration (milliseconds) | Duration of the ongoing refresh task. | General-purpose, compute group | V4.0.8+ |
| Dynamic Table | Dynamic Table Refresh Failure QPM (count/minute) | Refresh failures per minute per Dynamic Table. | General-purpose, compute group | V4.0.8+ |
Cloud Monitor metric IDs
Each metric has a unique ID in Cloud Monitor. The ID prefix varies by instance type:
| Instance type | Prefix | Metric reference |
|---|---|---|
| General-purpose instance | standard_ | General-purpose instance metrics |
| Follower instance | follower_ | Follower instance metrics |
| Compute group instance | warehouse_ | Compute group instance metrics |
| Lakehouse Acceleration (Shared Cluster) | shared_ | Shared cluster metrics |
Engine categories and command types
Engine categories in monitoring metrics:
QE is a collective term for Hologres proprietary vector compute engines (HQE, SQE) under the XQE engine family. In slow query logs, queries with
Engine Type={XQE}map to the QE category.FixedQE refers to queries that use the Fixed Plan path. In slow query logs, queries with
Engine Type={FixedQE}(orSDKin versions earlier than V2.2) map to the FixedQE category.
Command Type classification:
Command Type matches the SQL statement type. For example, both
INSERT xxxandINSERT xxx ON CONFLICT DO UPDATE/NOTHINGare classified as INSERT.UNKNOWN: SQL statements that the DPI engine cannot recognize due to syntax errors.
UTILITY: Administrative, definition, and control commands other than INSERT, UPDATE, DELETE, and SELECT, including:
DDL: CREATE, ALTER, DROP, TRUNCATE, COMMENT
TCL: BEGIN, COMMIT, ROLLBACK, SAVEPOINT
Administration and maintenance: ANALYZE, VACUUM, EXPLAIN, SET, SHOW, COPY, REFRESH
Execution and procedural control: PREPARE, EXECUTE, DEALLOCATE, CALL, DECLARE CURSOR
Others: LOCK TABLE, LISTEN, NOTIFY
Access control
The Hologres console monitoring page retrieves data from Cloud Monitor. Resource Access Management (RAM) users need one of the following permissions to view monitoring information:
| Permission policy | Access level |
|---|---|
AliyunCloudMonitorFullAccess | Full management permissions for Cloud Monitor |
AliyunCloudMonitorReadOnlyAccess | Read-only access to Cloud Monitor |
For details on granting permissions, see Grant permissions to RAM users.
General notes
If a metric shows no data, the instance version may not support it, or there has been no activity for an extended period.
Monitoring data is retained for up to 30 days.
Metrics are reported every minute.
CPU
Instance CPU Usage (%)
The overall CPU load of the instance.
Background processes and asynchronous compaction tasks consume CPU even without active queries, so some usage during idle periods is normal. Hologres uses multi-core parallel computing, which means a single query can push CPU usage to 100% -- this indicates full utilization of compute resources, not necessarily an issue.
When to investigate: If CPU usage remains near 100% for 3 hours or above 90% for 12 hours, the instance is heavily loaded and CPU is likely the bottleneck. Consider whether:
Large offline data imports (INSERT) are running with growing data volumes.
High-QPS queries or writes are consuming all CPU resources.
Hybrid workloads combine the above scenarios.
If sustained high CPU is expected for your business, scale up the instance to handle larger workloads.
For more information, see FAQ for monitoring metrics.
Worker Node CPU Usage (%)
The CPU load on each Worker node. The number of Worker nodes varies by instance type. For more information, see Instance management.
Version: V1.1+
If all Worker nodes show sustained CPU usage near 100%, the instance is heavily loaded. Optimize queries or scale up the instance.
If only some Worker nodes show high CPU usage, a resource skew exists. For common causes and troubleshooting, see FAQ for monitoring metrics.
Cluster CPU Usage (%)
The CPU usage of each Cluster in the compute group.
Version: V4.0+. Compute group instances only.
Memory
Instance Memory Usage (%)
The overall memory consumption of the instance.
Hologres reserves memory for metadata, indexes, and data caches to accelerate queries. Idle memory usage of 30% to 40% is typical. If memory usage steadily climbs toward 80%, memory may become a bottleneck.
Use memory distribution metrics together with QPS and other indicators to identify high-memory consumers. For more information, see Troubleshooting guide for out-of-memory issues.
Worker Node Memory Usage (%)
The memory load on each Worker node. The number of Worker nodes varies by instance type. For more information, see Instance management.
Version: V1.1+
If all Worker nodes show sustained memory usage near 80%, the instance is heavily loaded. Optimize queries or scale up the instance.
If only some Worker nodes show high memory usage, a resource skew exists. For common causes and troubleshooting, see FAQ for monitoring metrics.
Detailed Compute Group Memory Usage (%)
Version: V2.0+ (memory distribution metrics available from V2.0.15)
Hologres divides memory into six categories:
| Category | What it tracks | Typical behavior |
|---|---|---|
| System | Holohub, Gateway, and FE (FE Master + FE Query) | Fluctuates with query activity |
| Cache | SQL caches (result cache, block cache) and Meta cache (schema/file metadata) | Fixed size, typically ~30% of total instance memory. Some usage persists when idle (mainly Meta cache). Higher cache hit rates improve query performance -- smaller Physical read bytes values in EXPLAIN ANALYZE indicate better hit rates. |
| Meta | Metadata and files. Uses lazy open mode -- frequently accessed metadata stays in memory, infrequently accessed metadata does not. | Keep under 30% of total memory. High Meta usage suggests many files or partitioned tables. Use Table statistics overview and analysis to investigate. |
| Query | Memory consumed during SQL execution, including Fixed Plan, HQE, and SQE. | Elastic allocation: minimum 20 GB per Worker, maximum depends on available free memory. High usage in other categories reduces Query memory. |
| Background | Compaction and flush tasks. | Typically under 5%. Temporarily increases during index changes, bulk writes, or updates. |
| Memtable | In-memory tables for real-time writes, updates, and deletes. | Typically under 5%. |
Troubleshooting: High Query memory usage or out-of-memory (OOM) events typically indicate complex queries or high concurrency. For optimization guidance, see Optimize query performance.
QE Query Memory Usage (bytes)
The memory (in bytes) used by queries executed by HQE, SQE, or other XQE engines.
Version: V2.0.44+ / V2.1.22+
In memory breakdowns, Query memory usage exceeds QE Query memory usage because Query includes all engine types. Higher QE Query memory usage indicates more complex queries that require more memory.
QE Query Memory Usage (%)
The percentage of memory used by QE engine queries.
Version: V2.0.44+ / V2.1.22+
High usage may lead to OOM errors. Optimize queries or scale up the instance.
Cluster Memory Usage (%)
The memory usage of each Cluster in the compute group.
Version: V4.0+. Compute group instances only.
Query QPS and RPS
Query QPS (count/s)
The average number of SQL statements executed per second across the instance, including SELECT, INSERT, UPDATE, DELETE, UTILITY, and UNKNOWN statements.
Relationship: Query QPS >= QE Query QPS + FixedQE Query QPS
The total QPS includes all queries (such as UNKNOWN, UTILITY, and Engine Type={PG}), so it is greater than or equal to the sum of QE and FixedQE QPS.
QE Query QPS (count/s)
Queries per second executed by the QE engine, including SELECT, INSERT, UPDATE, and DELETE statements.
Version: V2.2+
FixedQE Query QPS (count/s)
Queries per second executed by the FixedQE engine (Fixed Plan path, formerly SDK), including SELECT, INSERT, UPDATE, and DELETE statements.
Version: V2.2+
DML RPS (count/s)
The average number of data records imported or updated per second, including INSERT, UPDATE, and DELETE statements.
Relationship: DML RPS = QE DML RPS + FixedQE DML RPS
QE DML RPS (count/s)
Data records imported or updated per second by the QE engine, including INSERT, UPDATE, and DELETE statements.
Version: V2.2+
Common QE scenarios:
Batch import or update from MaxCompute or OSS external tables
Batch write or update using COPY
Batch import between Hologres tables
FixedQE DML RPS (count/s)
Data records imported or updated per second by the FixedQE engine (formerly SDK), including INSERT, UPDATE, and DELETE statements.
Version: V2.2+
Common FixedQE scenarios:
Offline writes using Data Integration (DataX)
Writes using SQL or JDBC with
INSERT INTO VALUES()
Query Latency
Query Latency (milliseconds)
The average latency of all queries in the instance, including SELECT, INSERT, UPDATE, DELETE, UTILITY, and UNKNOWN statements.
Relationship: Query Latency >= MAX(QE Query Latency, FixedQE Query Latency)
QE Query Latency (milliseconds)
The average latency of queries executed by the QE engine, including SELECT, INSERT, UPDATE, and DELETE statements.
Version: V2.2+
To troubleshoot elevated QE Query latency, check the Optimization Phase Duration, Start Query Phase Duration, Get Next Phase Duration, and QE QPS metrics.
FixedQE Query Latency (milliseconds)
The average latency of queries executed by the FixedQE engine, including SELECT, INSERT, UPDATE, and DELETE statements.
Version: V2.2+
Troubleshooting:
Occasional spikes: May indicate HQE locks. Check whether the FixedQE Backend Lock Wait Time has increased. If so, use Query Insight to identify the locking queries.
Persistent high latency: May result from a suboptimal table design or interference from complex queries. See Common issues and diagnostics for Blink and Flink.
Optimization Phase Duration (milliseconds)
The time spent in the Optimization phase, where the optimizer parses the SQL statement and generates a physical plan.
Version: V2.0.44+ / V2.1.22+
Long Optimization durations suggest complex queries. If queries differ only in their parameters, use Prepared Statements to reduce optimization overhead. For more information, see JDBC.
Start Query Phase Duration (milliseconds)
The time spent in the Start Query phase -- the initialization before actual query execution, including locking and schema version alignment.
Version: V2.0.44+ / V2.1.22+
Long Start Query durations often result from lock waits or high CPU usage. Use execution plans for deeper analysis.
Get Next Phase Duration (milliseconds)
The time from the end of the Start Query phase until all results are returned, including computation and result delivery.
Version: V2.0.44+ / V2.1.22+
Long Get Next durations often reflect complex computations. Correlate with QE memory usage and QE QPS. If no anomalies exist in those metrics, the client may simply be slow to consume the results.
Query P99 Latency (milliseconds)
The 99th percentile latency of all queries in the instance, including SELECT, INSERT, UPDATE, UTILITY, and system queries.
Longest Running Query Duration in This Instance (milliseconds)
The duration of the longest-running query currently executing in the instance, covering SELECT, INSERT, UPDATE, DELETE, UTILITY, and UNKNOWN statements.
Version: V1.1+
Hologres is a distributed system with multiple Worker nodes. Queries are distributed across Workers, and this metric reports the longest-running query across all Workers. For example, if Workers are running queries of 10 minutes, 5 minutes, and 30 seconds, the reported value is 10 minutes.
Combine this metric with active queries or slow query logs to diagnose long-running queries and resolve deadlocks.
Metrics are reported every minute, so the "current running duration" starts slightly after the query begins. This metric is useful for anomaly detection but does not provide precise timing.
Failed Query QPS
Failed Query QPS (milliseconds)
The average number of failed SQL statements per second in the instance, including SELECT, INSERT, UPDATE, DELETE, UTILITY, and UNKNOWN statements.
Relationship: Failed Query QPS >= QE Failed Query QPS + FixedQE Failed Query QPS
The total failed QPS includes all failed queries (such as UNKNOWN, UTILITY, and Engine Type={PG}), so it is greater than or equal to the sum of QE and FixedQE failed QPS.
Use the failed query type and frequency to find failing queries in the slow query logs, then analyze root causes to improve availability.
QE Failed Query QPS (count/s)
Failed queries per second executed by the QE engine, including SELECT, INSERT, UPDATE, and DELETE statements.
Version: V2.2+
FixedQE Failed Query QPS (count/s)
Failed queries per second executed by the FixedQE engine, including SELECT, INSERT, UPDATE, and DELETE statements.
Version: V2.2+
Locks
Maximum FE Lock Wait Time (milliseconds)
Hologres has multiple FE nodes that parse, dispatch, and route SQL statements. When multiple connections on the same FE perform DDL operations on the same table (such as CREATE or DROP), FE locks occur. This metric shows the DDL lock wait time per FE.
Version: V2.2+
If the FE lock wait time exceeds 5 minutes and the FE Replay Delay also spikes, a DDL operation may be stuck. Use Manage queries to find and terminate long-running queries.
FixedQE Backend Lock Wait Time (milliseconds)
INSERT, DELETE, or UPDATE queries using HQE take table locks, while FixedPlan queries take row locks. This metric increases when FixedPlan queries wait for row locks while HQE queries hold table locks on the same table.
Version: V2.2+
If this value is high, check slow query logs for slow FixedQE queries, then use Query Insight to identify the locking HQE queries.
Total Backend Lock Wait Time for Instance (milliseconds)
The total lock wait time for INSERT, DELETE, or UPDATE queries in the instance, including both FixedQE and HQE lock waits.
Version: V2.2+
If this value is high, check slow query logs for slow INSERT, DELETE, or UPDATE queries, then use Query Insight to identify the locking HQE queries.
Connection
Total Connections (count)
All active connections in the instance, including those in active, idle, and idle-in-transaction states. Hologres sets default connection limits based on instance type. For more information, see Instance management.
Use Manage queries to view current usage. Kill idle connections if available connections are low.
Connections by Database (count)
Connections aggregated by database, for assessing per-database connection usage.
Default connection limit per database: 128. For more information, see Instance management.
If connections approach the limit, review idle versus business connections. Use Connection management to clean up idle connections or scale up.
If connection load skews across Workers, use Connection management to rebalance.
Connections by FE (count)
Connections aggregated by FE node, for assessing per-FE connection usage.
Default connection limit per FE node: 128. For more information, see Instance management.
If connections approach the limit, review idle versus business connections. Use Connection management to clean up idle connections or scale up.
If connection load skews across Workers, use Connection management to rebalance.
Connection Usage Rate of FE with Highest Usage (%)
Reports the highest connection usage rate among all FE nodes: Max(frontend_connection_used_rate). FE nodes use round-robin load balancing to distribute new connections evenly.
Use Manage queries to view current usage. Kill idle connections if available connections are low.
Query Queue
Queued Queries Count (count)
The number of query requests waiting to be executed.
Version: V3.0+
Query Queue Entry QPS (count/s)
Queries submitted to the system queue per second. Use this to gauge the system load and query frequency.
Version: V3.0+
Queries Transitioned from Queued to Running QPS (count/s)
Queries moving from the waiting state to the running state per second.
Version: V3.0+
QPS by State for Queries That Started Running (count/s)
Per-second count of queries in the query queue, grouped by state:
kReadyToRun -- qualified to run
kQueueTimeout -- failed due to queue timeout
kCanceled -- failed due to cancellation
kExceedConcurrencyLimit -- failed due to concurrency limit
Version: V3.0+
Average Query Queue Wait Time (milliseconds)
The average time from queue entry to processing start. This does not include actual query execution time.
Version: V3.0+
Query Queue Auto-Rate-Limit Max Concurrency (count)
The maximum concurrency for auto-rate-limited query queues.
Version: V3.1+. Compute group instances only.
I/O
I/O throughput reflects disk I/O activity. Note: 1 GiB = 1024 MiB = 1024 x 1024 KiB.
I/O throughput limits:
Standard storage (hot): I/O throughput is not fixed. It depends primarily on CPU load.
IA storage (cold): Maximum I/O throughput is
80 MB/s x (number of cores / 16).
Standard I/O Read Throughput (bytes/s)
Read throughput for Standard storage data.
Standard I/O Write Throughput (bytes/s)
Write throughput for Standard storage data.
Low-Frequency IO Read Throughput (bytes/s)
Read throughput for IA storage data.
Write throughput for low-frequency I/O (bytes/s)
Write throughput for IA storage data.
Storage
The logical disk space used by instance data -- the sum of all database storage, including the recycle bin. Note: 1 GiB = 1024 MiB = 1024 x 1024 KiB. Hologres storage grows continuously with no hard cap.
For subscription instances, storage exceeding the purchased amount is automatically billed on a pay-as-you-go basis. This does not affect system stability or usability. After exceeding the storage capacity, promptly upgrade storage or delete unused data to avoid unnecessary costs.
Use pg_relation_size to view table and database storage sizes. Use Table Info for fine-grained table management.
Standard Storage Used Capacity (bytes)
The capacity used in Standard storage. Scale up storage if usage exceeds the purchased capacity.
Standard Storage Usage (%)
The usage percentage of Standard storage capacity. Scale up storage if usage exceeds the purchased capacity.
IA Storage Used Capacity (bytes)
The capacity used in IA storage. Scale up storage if usage exceeds the purchased capacity.
IA Storage Usage (%)
The usage percentage of IA storage capacity. Scale up storage if usage exceeds the purchased capacity.
Recycle Bin Storage Usage (bytes)
Version: V3.1+
Hologres supports a table recycle bin starting in V3.1. Tables dropped with DROP remain in the recycle bin for a retention period, allowing recovery of accidentally dropped tables. These tables still consume instance storage.
Monitor recycle bin usage per database. If frequent table drops cause high recycle bin usage, configure tables to skip the recycle bin upon deletion.
Framework
FE Replay Delay (milliseconds)
Version: V2.2+
Hologres has multiple FE nodes. For DDL operations, Hologres executes the operation on one FE and replays it on the others. Millisecond- or second-level replay delays are normal.
If an FE's replay delay exceeds several minutes, too many DDL operations may be overwhelming the replay process. If the delay continues to increase, a query may be stuck. Use hg_stat_activity to find and terminate long-running queries.
Shard Multi-Replica Sync Delay (milliseconds)
The sync delay between Shard replicas after Replication is enabled.
The typical Shard replica delay is in milliseconds. Heavy data writes, updates, or frequent DDL operations may increase the sync delay.
Primary-Follower Sync Delay (milliseconds)
The delay when a follower instance reads data from the primary instance. This metric appears only for follower instances, not primary instances.
Data appears only after a follower instance is bound to a primary instance (0 ms initially). The sync delay fluctuates when the primary instance receives writes.
Normal sync delay is in milliseconds. Occasional jitter from primary DDL operations is safe to ignore. Persistent high delay of more than a few seconds may indicate a high instance load or resource shortage -- check CPU and memory usage and scale up if needed.
Sync delay may spike to several minutes during restarts or upgrades and then recovers automatically.
Cross-Instance File Sync Delay (milliseconds)
The file sync delay between disaster recovery instances. This metric appears only on follower instances (read-only followers).
Auto Analyze
Tables Missing Statistics per Database (count)
The number of tables lacking statistics in each database.
Version: V2.2+
For Hologres V2.0 and later, Auto Analyze runs by default. After table creation or bulk writes/updates, statistics may temporarily lag -- observe for a short period first.
If a database consistently lacks statistics for hours or days, Auto Analyze may not have been triggered. Use the HG_STATS_MISSING view to list affected tables, then manually run ANALYZE. For more information, see ANALYZE and AUTO ANALYZE.
Serverless Computing
Longest Running Serverless Computing Query Duration (milliseconds)
The duration of the longest-running query in Serverless Computing. Serverless Computing runs specific queries in a dedicated resource pool, isolated from the main instance.
Version: V2.1+
Use hg_stat_activity to inspect the status of Serverless Computing queries.
Serverless Computing Query Queue Count (count)
The number of queries queued in the Serverless Computing resource pool.
Version: V2.2+
Serverless Computing Resource Quota Usage (%)
The ratio of actual Serverless Computing resources used to the maximum allocatable resources.
Version: V2.2+
Binary Logging
Binlog Consumption Rate (count/s)
The number of Binlog entries consumed per second. Hologres supports subscribing to Hologres Binlog for real-time data tiering and accelerated data forwarding.
Version: V2.2+
Binlog Consumption Rate (bytes/s)
The bytes consumed from Binlog per second. Larger fields or higher data volumes increase the byte count.
Version: V2.2+
WAL Sender Count per FE (count)
The number of WAL senders used per FE node. Each shard of each table consumes one WAL sender connection when consuming Binlog using JDBC. WAL sender connections are independent of regular connections and have a default limit.
Version: V2.2+
WAL Sender Usage Rate of FE with Highest Usage (%)
The peak WAL sender utilization across all FE nodes.
Version: V2.2+
If WAL sender usage reaches the limit, see Consume Hologres Binlog via JDBC for troubleshooting.
Computing Resource
Elastic Core Count for Compute Groups
The number of cores added by time-based scaling in the compute group. For more information, see Time-based elasticity (Beta).
Version: V2.2.21+. Compute group instances only.
Compute Group Auto-Elastic Core Count (count)
The number of cores added by auto-scaling in the compute group. For more information, see Multi-cluster and auto-elasticity (Beta).
Version: V4.0+. Compute group instances only.
Gateway
Gateway CPU Usage (%)
The CPU usage of each Gateway in the instance.
Version: V2.0+. Compute group instances only.
Gateways use round-robin traffic forwarding, so CPU usage occurs even without new connections. Starting in V2.2.22, Gateways launch more worker threads by default to improve connection handling, which increases baseline CPU usage.
Gateway Memory Usage (%)
The memory usage of each Gateway in the instance.
Version: V2.0+. Compute group instances only.
Gateway New Connection Requests per Second (count/s)
The maximum number of new connections that the system can accept and successfully establish per second.
Version: V2.1.12+. Compute group instances only.
A single Gateway handles approximately 100 new connections per second. If new connection requests approach 100 x Gateway count, the Gateways are the bottleneck. Configure a connection pool or scale up the number of Gateways.
Gateway Inbound Traffic Rate (B/s)
The volume of data entering through the Gateway per second.
Version: V2.1+. Compute group instances only.
If inbound traffic approaches 200 MiB/s x Gateway count, the Gateway network capacity is the bottleneck. Scale up the number of Gateways.
Gateway Outbound Traffic Rate (B/s)
The volume of data sent from the Gateway per second.
Version: V2.1+. Compute group instances only.
If outbound traffic approaches 200 MiB/s x Gateway count, the Gateway network capacity is the bottleneck. Scale up the number of Gateways.
Dynamic Table monitoring and alerting
Starting in Hologres V4.0.8, Dynamic Tables offer monitoring metrics for managing refresh tasks. For more information, see Monitoring and alerting.
Common monitoring metric issues
The FAQ for monitoring metrics topic covers common issues, root causes, and fixes.
Monitoring metric alerting
Set alerts for monitoring metrics in Cloud Monitor to detect anomalies early. For more information, see Cloud Monitor.