All Products
Search
Document Center

Key Management Service:HSM monitoring and alerts

Last Updated:Dec 16, 2025

Fluctuations in the performance or status of a Hardware Security Module (HSM) can affect business stability and may cause service interruptions if not detected promptly. Cloud HSM provides monitoring and alerting features to track key metrics in real time, including the health, performance, and load of instances and clusters. You can set alert rules to receive immediate notifications about anomalies to ensure business continuity and stability.

Function overview

HSM monitoring provides monitoring services for both instances and clusters. It supports core metrics such as Basic Monitoring (which covers CPU usage, memory usage, TCP connections, HSM health, and cluster synchronization status) and TPS monitoring. This provides comprehensive insight into HSM resource usage and operational status, and provides data to support proactive risk alerts and capacity planning. Key benefits include the following:

  • Proactive risk alerts: You can analyze metric trends to promptly identify potential operational risks. The integrated alerting mechanism notifies the relevant personnel of faults to help ensure business continuity and stability.

  • Fault diagnosis: When an anomaly occurs, detailed monitoring data helps you quickly find the root cause and reduce troubleshooting time.

View instance or cluster monitoring metrics

Warning

You can view metric data for the last 30 days.

Procedure

  1. Go to the Security Audit page of the Cloud Hardware Security Module console. In the top navigation bar, select a region.

  2. Click an instance ID. On the details page, view the Instance Monitoring Information and Cluster Monitoring Information.

  3. Select a time range. The data granularity of HSM metric data varies with the selected time range.

    Time Range

    Statistical period

    1 hour, 3 hours, 6 hours, 12 hours

    5 minutes

    1 day

    10 minutes

    3 days

    30 minutes

    7 days

    60 minutes

    More than 7 days to 30 days

    120 minutes

  4. (Optional) In the upper-right corner, turn on the Auto Refresh switch. HSM automatically retrieves the latest metric data every minute.

Monitoring information details

Instance Monitoring Information

  • Basic Monitoring

    Note
    • Infrastructure monitoring provides comprehensive alerting capabilities for all metrics. It supports both out-of-the-box Proactive Alerting rules and custom alert rules.

    • The statistic period for the Proactive Alerting policy is five minutes by default.

    Metric

    Description

    Proactive Alerting rule (default alert policy)

    CPU usage

    The CPU utilization percentage of the HSM instance.

    • Alert level: Warning (WARN)

    • Trigger rule: The CPU usage is greater than 85% for five consecutive statistic periods.

    Memory usage

    The memory utilization percentage of the HSM instance.

    • Alert level: Warning (WARN)

    • Trigger rule: The memory usage is greater than 85% for five consecutive statistic periods.

    TCP connections

    The total number of established TCP connections for the HSM instance.

    • Alert level: Info

    • Trigger rule: The number of TCP connections is greater than 200 for five consecutive statistic periods.

    HSM health

    Indicates the running status of the HSM instance.

    • Alert level: Warning (WARN)

    • Trigger rule: The health status is 0 for five consecutive statistic periods.

  • TPS monitoring

    Important
    • This feature is available only for HSMs in the Chinese mainland.

    • Proactive Alerting is not supported. You must log on to the CloudMonitor console to set custom alert rules.

    Metric

    Description

    Symmetric algorithms

    The performance data of symmetric algorithm operations performed by the instance. This includes AES, SM1, and SM4 operations.

    SM2

    The performance data of SM2 algorithm operations performed by the instance. This includes key generation, encryption/decryption, and signing/signature verification.

    RSA

    The performance data of RSA algorithm operations performed by the instance. This includes key pair generation, public key operations, and private key operations.

    ECC

    The performance data of ECC algorithm operations performed by the instance. This includes key pair generation and signing/signature verification.

    Hash algorithm

    The performance data of hash operations performed by the instance.

Cluster Monitoring Information

  • Basic Monitoring

    Note
    • Infrastructure monitoring provides comprehensive alerting capabilities for all metrics. It supports both out-of-the-box Proactive Alerting rules and custom alert rules in CloudMonitor.

    • The statistic period for the Proactive Alerting policy is five minutes by default.

    Metric

    Description

    One-click alert rule (default alert policy)

    Synchronization Status

    Indicates whether the cluster is synchronized. Valid values:

    • 1: The cluster is normal. The digests of the primary and secondary HSMs are consistent.

    • 0: The cluster is not synchronized. This includes inconsistencies in digests or configurations between primary and secondary HSMs, or cluster synchronization failures.

    • Alert level: Info

    • Trigger rule: The value is 0 for five consecutive statistic periods.

  • TPS monitoring

    Important
    • This feature is available only when all instances in the cluster support TPS monitoring (that is, all instances are HSMs in the Chinese mainland).

    • Proactive Alerting is not supported. You must set custom alert rules in CloudMonitor.

    Metric

    Description

    Symmetric algorithms

    The sum of Transactions Per Second (TPS) for symmetric algorithm operations performed by all instances in the cluster. This includes AES, SM1, and SM4 operations.

    SM2

    The sum of TPS for SM2 algorithm operations performed by all instances in the cluster. This includes key generation, encryption/decryption, and signing/signature verification.

    RSA

    The sum of TPS for RSA algorithm operations performed by all instances in the cluster. This includes key pair generation, public key operations, and private key operations.

    ECC

    The sum of TPS for ECC algorithm operations performed by all instances in the cluster. This includes key pair generation and signing/signature verification.

    Hash algorithm

    The sum of TPS for hash (digest) operations performed by all instances in the cluster.

Set alerts for monitoring metrics

Method 1: Enable one-click alerting in HSM

HSM has built-in default alert rules for Basic Monitoring. For more information about the alert rules, see Monitoring information details.

Important
  • After you enable one-click alerting, the alert rules apply to all HSM instances under your Alibaba Cloud account.

  • If you previously enabled one-click alerting and modified the alert rules, enabling it again restores the rules to the system defaults.

  1. Go to the Security Audit page of the Cloud Hardware Security Module console. In the top navigation bar, select a region.

  2. Click an instance ID. In the upper-right corner of the Instance Monitoring Information and Cluster Monitoring Information tabs, click Proactive Alerting.

  3. Configure alert rules

    1. Turn on the Proactive Alerting switch.

    2. (Optional) Modify the rule content: If you want to set alerts for only specific monitoring metrics or need more fine-grained alert rules, you can disable or modify the alert rules.

      Note

      The default alert recipient for One-click Alerting rules is the system-created Default Alert Contacts. To modify the member information, go to the CloudMonitor console. For more information, see Modify an alert contact or a contact group.

  4. (Optional) Enable Send alert notifications

    1. Click Configure Alert Rules to go to the CloudMonitor console and find the target default alert rule.

    2. In the Actions column, click Modify.

    3. Set NoDataPolicy to Send alert notifications.

Method 2: Set alerts in CloudMonitor

  1. In the upper-right corner of the Instance Monitoring Information and Cluster Monitoring Information tabs, click Configure Alert Rules to go to the CloudMonitor console.

  2. On the Alert Rule page, configure the rule as described in Create an alert rule. Note the following configurations:

    • Product: HSM Instance or HSM Cluster.

    • NoDataPolicy: We recommend that you select Send alert notifications. This option prevents threshold evaluation from being affected when metric data is empty, which ensures the timeliness and accuracy of alerts.

Handle alert notifications

The following are common methods for handling alerts:

  • HSM health is 0 (HSM instance is abnormal)

    • Common causes:

      • Hardware failure: Internal physical components, such as processors, memory, or cryptographic cards, are damaged or malfunctioning.

        Note

        In this scenario, the system automatically isolates the faulty instance to ensure service continuity and security.

      • Software/firmware bugs: Errors (bugs) in the device firmware, driver, or management software cause functional abnormalities or unresponsiveness.

      • Network connectivity issues: The connection to the application server or network device is interrupted, unstable, or has high latency.

      • Power supply issues: Power interruptions, unstable voltage, or faulty power supply equipment cause the device to fail to start or shut down unexpectedly.

      • Abnormal operating environment: The device is operating at an excessively high temperature, improper humidity, or with poor ventilation, affecting its performance and stability.

    • Solutions:

      1. Initial diagnosis: Immediately check whether the instance status is "Running" in the Cloud HSM console. Also, check the Alibaba Cloud status page or internal messages to determine whether any service failures or scheduled maintenance are occurring in the current region.

      2. Network troubleshooting: Check the security group and network ACL rules of the VPC where the application server and HSM instance reside. Ensure that network access to the service port is allowed.

  • High CPU/memory usage

    • Association analysis: On the monitoring page, compare the CPU usage curve with the TPS monitoring curve for the time period when the issue occurred.

      • If CPU usage and TPS increase at the same time, this is usually caused by a peak in service traffic and is considered normal.

      • If CPU usage is high but TPS is not, the application may be performing many complex key generation or asymmetric encryption/decryption operations.

    • Categorized handling:

      • Short-term response: If the issue is caused by a sudden increase in service traffic, evaluate whether to temporarily add nodes to the cluster to distribute the load.

      • Long-term optimization: If the issue is due to application call logic, perform code optimization. If you have a long-term capacity shortage, scale out your resources promptly.

Abnormal cluster synchronization status: For manually synchronized HSM clusters in the Chinese mainland, go to the instance list page and click Synchronize Cluster to perform a manual synchronization.

Quotas and limits

  • Data retention period: Metric data can be viewed and stored for a maximum of 30 days.

  • Regional and feature limitations:

    • TPS monitoring: Available only for HSM instances in the Chinese mainland.

    • Cluster TPS monitoring: Available only when all instances in the cluster are HSMs in the Chinese mainland.

  • Alert configuration: TPS monitoring metrics do not support one-click alerting. You must set custom alert rules in CloudMonitor.