Fluctuations in the performance or status of a Hardware Security Module (HSM) can affect business stability and may cause service interruptions if not detected promptly. Cloud HSM provides monitoring and alerting features to track key metrics in real time, including the health, performance, and load of instances and clusters. You can set alert rules to receive immediate notifications about anomalies to ensure business continuity and stability.
Function overview
HSM monitoring provides monitoring services for both instances and clusters. It supports core metrics such as Basic Monitoring (which covers CPU usage, memory usage, TCP connections, HSM health, and cluster synchronization status) and TPS monitoring. This provides comprehensive insight into HSM resource usage and operational status, and provides data to support proactive risk alerts and capacity planning. Key benefits include the following:
Proactive risk alerts: You can analyze metric trends to promptly identify potential operational risks. The integrated alerting mechanism notifies the relevant personnel of faults to help ensure business continuity and stability.
Fault diagnosis: When an anomaly occurs, detailed monitoring data helps you quickly find the root cause and reduce troubleshooting time.
View instance or cluster monitoring metrics
You can view metric data for the last 30 days.
Procedure
Go to the Security Audit page of the Cloud Hardware Security Module console. In the top navigation bar, select a region.
Click an instance ID. On the details page, view the Instance Monitoring Information and Cluster Monitoring Information.
Select a time range. The data granularity of HSM metric data varies with the selected time range.
Time Range
Statistical period
1 hour, 3 hours, 6 hours, 12 hours
5 minutes
1 day
10 minutes
3 days
30 minutes
7 days
60 minutes
More than 7 days to 30 days
120 minutes
(Optional) In the upper-right corner, turn on the Auto Refresh switch. HSM automatically retrieves the latest metric data every minute.
Monitoring information details
Instance Monitoring Information
Basic Monitoring
NoteInfrastructure monitoring provides comprehensive alerting capabilities for all metrics. It supports both out-of-the-box Proactive Alerting rules and custom alert rules.
The statistic period for the Proactive Alerting policy is five minutes by default.
Metric
Description
Proactive Alerting rule (default alert policy)
CPU usage
The CPU utilization percentage of the HSM instance.
Alert level: Warning (WARN)
Trigger rule: The CPU usage is greater than 85% for five consecutive statistic periods.
Memory usage
The memory utilization percentage of the HSM instance.
Alert level: Warning (WARN)
Trigger rule: The memory usage is greater than 85% for five consecutive statistic periods.
TCP connections
The total number of established TCP connections for the HSM instance.
Alert level: Info
Trigger rule: The number of TCP connections is greater than 200 for five consecutive statistic periods.
HSM health
Indicates the running status of the HSM instance.
1: Normal
0: Abnormal. For more information, see HSM health is 0 (HSM instance is abnormal).
Alert level: Warning (WARN)
Trigger rule: The health status is 0 for five consecutive statistic periods.
TPS monitoring
ImportantThis feature is available only for HSMs in the Chinese mainland.
Proactive Alerting is not supported. You must log on to the CloudMonitor console to set custom alert rules.
Metric
Description
Symmetric algorithms
The performance data of symmetric algorithm operations performed by the instance. This includes AES, SM1, and SM4 operations.
SM2
The performance data of SM2 algorithm operations performed by the instance. This includes key generation, encryption/decryption, and signing/signature verification.
RSA
The performance data of RSA algorithm operations performed by the instance. This includes key pair generation, public key operations, and private key operations.
ECC
The performance data of ECC algorithm operations performed by the instance. This includes key pair generation and signing/signature verification.
Hash algorithm
The performance data of hash operations performed by the instance.
Cluster Monitoring Information
Basic Monitoring
NoteInfrastructure monitoring provides comprehensive alerting capabilities for all metrics. It supports both out-of-the-box Proactive Alerting rules and custom alert rules in CloudMonitor.
The statistic period for the Proactive Alerting policy is five minutes by default.
Metric
Description
One-click alert rule (default alert policy)
Synchronization Status
Indicates whether the cluster is synchronized. Valid values:
1: The cluster is normal. The digests of the primary and secondary HSMs are consistent.
0: The cluster is not synchronized. This includes inconsistencies in digests or configurations between primary and secondary HSMs, or cluster synchronization failures.
Alert level: Info
Trigger rule: The value is 0 for five consecutive statistic periods.
TPS monitoring
ImportantThis feature is available only when all instances in the cluster support TPS monitoring (that is, all instances are HSMs in the Chinese mainland).
Proactive Alerting is not supported. You must set custom alert rules in CloudMonitor.
Metric
Description
Symmetric algorithms
The sum of Transactions Per Second (TPS) for symmetric algorithm operations performed by all instances in the cluster. This includes AES, SM1, and SM4 operations.
SM2
The sum of TPS for SM2 algorithm operations performed by all instances in the cluster. This includes key generation, encryption/decryption, and signing/signature verification.
RSA
The sum of TPS for RSA algorithm operations performed by all instances in the cluster. This includes key pair generation, public key operations, and private key operations.
ECC
The sum of TPS for ECC algorithm operations performed by all instances in the cluster. This includes key pair generation and signing/signature verification.
Hash algorithm
The sum of TPS for hash (digest) operations performed by all instances in the cluster.
Set alerts for monitoring metrics
Method 1: Enable one-click alerting in HSM
HSM has built-in default alert rules for Basic Monitoring. For more information about the alert rules, see Monitoring information details.
After you enable one-click alerting, the alert rules apply to all HSM instances under your Alibaba Cloud account.
If you previously enabled one-click alerting and modified the alert rules, enabling it again restores the rules to the system defaults.
Go to the Security Audit page of the Cloud Hardware Security Module console. In the top navigation bar, select a region.
Click an instance ID. In the upper-right corner of the Instance Monitoring Information and Cluster Monitoring Information tabs, click Proactive Alerting.
Configure alert rules
Turn on the Proactive Alerting switch.
(Optional) Modify the rule content: If you want to set alerts for only specific monitoring metrics or need more fine-grained alert rules, you can disable or modify the alert rules.
NoteThe default alert recipient for One-click Alerting rules is the system-created Default Alert Contacts. To modify the member information, go to the CloudMonitor console. For more information, see Modify an alert contact or a contact group.
(Optional) Enable Send alert notifications
Click Configure Alert Rules to go to the CloudMonitor console and find the target default alert rule.
In the Actions column, click Modify.
Set NoDataPolicy to Send alert notifications.
Method 2: Set alerts in CloudMonitor
In the upper-right corner of the Instance Monitoring Information and Cluster Monitoring Information tabs, click Configure Alert Rules to go to the CloudMonitor console.
On the Alert Rule page, configure the rule as described in Create an alert rule. Note the following configurations:
Product: HSM Instance or HSM Cluster.
NoDataPolicy: We recommend that you select Send alert notifications. This option prevents threshold evaluation from being affected when metric data is empty, which ensures the timeliness and accuracy of alerts.
Handle alert notifications
The following are common methods for handling alerts:
HSM health is 0 (HSM instance is abnormal)
Common causes:
Hardware failure: Internal physical components, such as processors, memory, or cryptographic cards, are damaged or malfunctioning.
NoteIn this scenario, the system automatically isolates the faulty instance to ensure service continuity and security.
Software/firmware bugs: Errors (bugs) in the device firmware, driver, or management software cause functional abnormalities or unresponsiveness.
Network connectivity issues: The connection to the application server or network device is interrupted, unstable, or has high latency.
Power supply issues: Power interruptions, unstable voltage, or faulty power supply equipment cause the device to fail to start or shut down unexpectedly.
Abnormal operating environment: The device is operating at an excessively high temperature, improper humidity, or with poor ventilation, affecting its performance and stability.
Solutions:
Initial diagnosis: Immediately check whether the instance status is "Running" in the Cloud HSM console. Also, check the Alibaba Cloud status page or internal messages to determine whether any service failures or scheduled maintenance are occurring in the current region.
Network troubleshooting: Check the security group and network ACL rules of the VPC where the application server and HSM instance reside. Ensure that network access to the service port is allowed.
High CPU/memory usage
Association analysis: On the monitoring page, compare the CPU usage curve with the TPS monitoring curve for the time period when the issue occurred.
If CPU usage and TPS increase at the same time, this is usually caused by a peak in service traffic and is considered normal.
If CPU usage is high but TPS is not, the application may be performing many complex key generation or asymmetric encryption/decryption operations.
Categorized handling:
Short-term response: If the issue is caused by a sudden increase in service traffic, evaluate whether to temporarily add nodes to the cluster to distribute the load.
Long-term optimization: If the issue is due to application call logic, perform code optimization. If you have a long-term capacity shortage, scale out your resources promptly.
Abnormal cluster synchronization status: For manually synchronized HSM clusters in the Chinese mainland, go to the instance list page and click Synchronize Cluster to perform a manual synchronization.
Quotas and limits
Data retention period: Metric data can be viewed and stored for a maximum of 30 days.
Regional and feature limitations:
TPS monitoring: Available only for HSM instances in the Chinese mainland.
Cluster TPS monitoring: Available only when all instances in the cluster are HSMs in the Chinese mainland.
Alert configuration: TPS monitoring metrics do not support one-click alerting. You must set custom alert rules in CloudMonitor.