Monitor data transformation jobs (new version) - Simple Log Service

This topic describes the metrics of data transformation jobs and how to view the data transformation dashboard and configure job monitoring.

Metric data

To view metric data of data transformation jobs, you must first enable the service log feature and select job operational logs as the log type. For more information, see Use the service log feature.

Dashboard

After you create a data transformation job, Simple Log Service creates a dashboard for the job on the job details page. You can view metrics of the job on the dashboard.

Procedure

Log on to the Simple Log Service console.
In the Projects section, click the project that you want to manage.
In the left-side navigation pane, choose Job Management > Data Transformation.
Click the data transformation job that you want to manage. Then, view the dashboard in the Execution Status section.

Overall metrics

The overall metrics are as follows:

Transforming Speed (events/s): the transformation rate in events per second. The default statistical window is one hour.
- ingest: the number of data records that were read from each shard of the source Logstore.
- deliver: the number of data records that were successfully written to the destination Logstore.
- failed: the number of data records that were read but failed to be transformed from each shard of the source Logstore.
Total Events Read: the total number of data records that were read from each shard of the source Logstore. The default statistical window is one day.
Total Events Delivered: the total number of data records written to all destination Logstores. The default statistical window is one day.
Total Events Failed: the total number of data records that were read but failed to be transformed from each shard of the source. The default statistical window is one day.
Event Delivered Ratio: the ratio of the number of data records that were successfully delivered to the destination Logstore to the number of data records that were read from the source Logstore. The default statistical window is one day.

Shard details analysis

This section displays metrics of each shard in each minute when the transformation job reads data from the source Logstore.

Shard Consuming Latency (s): the difference between the time when the last data record written to each shard was received and the time when the data record being processed by the shard was received. Unit: seconds.
Shard Transforming Stats (events): the statistics about active shards. The default statistical window is one hour.
- shard: the sequence number of the shard.
- ingest: the number of raw data records that were read from the shard.
- failed: the number of raw data records that were read from the shard but failed to be transformed.

Transformation errors

You can view the details of an error based on the message field.

Alert monitoring rules

Data transformation (new version) jobs rely on task metrics to be monitored. For more information, see Metric data. You can use the alerting feature of Simple Log Service to monitor jobs. For more information, see Alerting. This section describes the following alert rules for data transformation (new version) jobs: shard consumption latencies, data transformation exceptions, transformation traffic (absolute values), and transformation traffic (same-period comparisons). For more information about how to create an alert rule, see Create an alert rule.

Important

To create an alert rule, you must add the same project and Logstore for statistical purposes as those saved in the run logs of the project. For more information about how to save job operational logs, see Use the service log feature.

Shard consumption latencies

Item	Description
Purpose	A rule of this type monitors the latency that occurs when data is consumed from shards in data transformation jobs. If the latency during data transformation exceeds the threshold in the rule, an alert is triggered.
Associated dashboard	For more information, see Shard details analysis.
Sample SQL	In the following template, replace `{job_name}` with the name of the data transformation job that you want to monitor. `__topic__: etl_metrics and job_name: {job_name} and "_etl_:connector_meta.action": ingest \| select split_part( "_etl_:connector_meta.task_name", '#', 2 ) as shard, max_by("_etl_:connector_metrics.lags", __time__) as lags group by shard having shard is not null limit all`
Monitoring rule	For Trigger Condition, set When to the query result contains and set the expression to `lags > 120`, which means that the threshold is when the latency reaches or exceeds 120 seconds. Set the query interval to 5 minutes. Set the query frequency to 5 minutes. Note We recommend that you configure the rule based on these settings to prevent metrics from being updated and processed based on a one-minute window and to prevent false alerts caused by latencies attributed to data spikes.
Handling method	You can clear triggered alerts based on the following rules: If the data transformation job is newly created, check for one hour whether the metric value drops below the threshold. A newly created job requires some time to process historical data. If the job is not newly created, proceed to the next step. If the data volume in the source Logstore significantly increases, perform the following operations based on your business requirements: If Transforming Speed (events/s) increases and Shard Consuming Latency (s) decreases, the job is automatically scaling out resources to respond to the increasing data volume in the source Logstore. In this case, wait five minutes and then check whether the latency drops below the threshold. If the latency does not, proceed to the next step. If Transforming Speed (events/s) does not increase or Shard Consuming Latency (s) still increases, the job fails to scale out resources due to shard shortage in the source Logstore. In this case, you must manually split shards in the source Logstore. For more information, see Split a shard. After you split the shards, wait five minutes and then check whether the latency drops below the threshold. If the latency does not, proceed to the next step. If data transformation exceptions exist, troubleshoot the exceptions first. After troubleshooting, wait five minutes and then check whether the latency drops below the threshold. If the latency does not, proceed to the next step. If you are unable to clear the alert, submit a ticket with project, Logstore, and job information for technical support.

Data transformation exceptions

Item	Description
Purpose	A rule of this type monitors exceptions in data transformation jobs. If an exception occurs during data transformation, an alert is triggered.
Associated dashboard	For more information, see Transformation errors.
Sample SQL	In the following template, replace `{job_name}` with the name of the data transformation job that you want to monitor. `__topic__: etl_metrics and job_name: {job_name} and "_etl_:connector_metrics.error": * \| select distinct "_etl_:connector_metrics.error" as errors`
Monitoring rule	For Trigger Condition, set When to data is returned. Set the query interval to 10 minutes. Set the query frequency to 10 minutes.
Handling method	Fix exceptions based on the error message. If the error message contains Invalid SPL query, the Processing Language (SPL) rule configured for the job has a syntax error and needs to be modified based on the error message. For more information, see SPL syntax. If the error message contains Unauthorized, InvalidAccessKeyId, or SignatureNotMatch, the data transformation job does not have the required permissions to read data from the source Logstore or write data to the destination Logstore. For more information, see Authorization. If the error message contains ProjectNotExist or LogStoreNotExist, the related project or Logstore of the data transformation job does not exist. In this case, log on to the Simple Log Service console to troubleshoot the issue. If you are unable to clear the alert, submit a ticket with project, Logstore, and job information for technical support.

Events delivered ratios (same-period comparisons)

Item	Description
Purpose	A rule of this type monitors the comparisons between the ratios of data records written to the destination Logstore with those of the same period the previous day or week. An alert is triggered if the value goes beyond the specified threshold for increase or drops below the specified threshold for decrease.
Associated dashboard	Event Delivered Ratio: the ratio of the number of data records that were successfully delivered to the destination Logstore to the number of data records that were read from the source Logstore. The default statistical window is one day.
Sample SQL	In the dialog box for creating the alert rule, enter the following SQL statement for queries: In the following template, replace `{job_name}` with the name of the data transformation job that you want to monitor. __topic__: etl_metrics and job_name: {job_name} \| select round(diff [1], 1) as percent, round(coalesce(diff [3], 0), 1) as ratio_1d, round(coalesce(diff [5], 0), 1) as ratio_1w from( select compare(percent, 86400, 604800) as diff FROM ( select deliver /(ingest + 0.0001) as percent from( select sum( if( "_etl_:connector_meta.action" = 'ingest', "_etl_:connector_metrics.native_bytes", 0 ) ) as ingest, sum( if( "_etl_:connector_meta.action" = 'deliver', "_etl_:connector_metrics.native_bytes", 0 ) ) as deliver FROM log ) ) )
Monitoring rule	For Trigger Condition, set When to the query result contains and set the expression to `(ratio_1d>120 \|\| ratio_1d<80) && (ratio_1w>120 \|\| ratio_1w<80)`, which means that the threshold is 20% for both increases and decreases. Set the query interval to 1 hour. Set the query frequency to 1 hour. Note We recommend that you set a threshold of 20% or higher for daily or weekly comparisons or adjust the comparison cycle to match the raw traffic cycle. This helps prevent false alerts caused by periodic fluctuations in the raw data traffic.
Handling method	You can clear triggered alerts based on the following rules: If the data volume in the source Logstore changes, check whether a new data pattern is added or whether an added data pattern is interrupted. If such circumstances exist and the volume changes caused thereby match the metric changes, the metric changes are caused by the data pattern alterations. If such circumstances do not exist, proceed to the next step. If transformation latencies or exceptions exist, troubleshoot these issues first. After you troubleshoot these issues, wait 15 minutes and check whether the latency is less than 1 minute. If so, check whether the changes in processed data volume match the data volume changes in the source Logstore. If they do not match, proceed to the next step. If you are unable to clear the alert, submit a ticket with project, Logstore, and job information for technical support.

Events read ratios (same-period comparisons)

Item	Description
Purpose	A rule of this type monitors the comparisons between the ratios of data records read from the source Logstore with those of the same period the previous day or week. An alert is triggered if the value goes beyond the specified threshold for increase or drops below the specified threshold for decrease.
Associated dashboard	Total Events Read: the total number of data records that were read from each shard of the source Logstore. The default statistical window is one day.
Sample SQL	In the dialog box for creating the alert rule, enter the following SQL statement for queries: In the following template, replace `{job_name}` with the name of the data transformation job that you want to monitor. `__topic__: etl_metrics and job_name: {job_name} and "_etl_:connector_meta.action": ingest \| select diff [1] as events, round(coalesce(diff [3], 0), 1) as ratio_1d, round(coalesce(diff [5], 0), 1) as ratio_1w from( select compare(events, 86400, 604800) as diff FROM ( select sum("_etl_:connector_metrics.events") as events FROM log ) )`
Monitoring rule	For Trigger Condition, set When to the query result contains and set the expression to `(ratio_1d>120 \|\| ratio_1d<80) && (ratio_1w>120 \|\| ratio_1w<80)`, which means that the threshold is 20% for both increases and decreases. Set the query interval to 1 hour. Set the query frequency to 1 hour. Note We recommend that you set a threshold of 20% or higher for daily or weekly comparisons or adjust the comparison cycle to match the raw traffic cycle. This helps prevent false alerts caused by periodic fluctuations in the raw data traffic.
Handling method	You can clear triggered alerts based on the following rules: If the value changes are consistent with the changes in the data volume in the source Logstore, the value changes are caused by the data volume changes in the source Logstore. If they are not, proceed to the next step. If transformation latencies or exceptions exist, troubleshoot these issues first. After you troubleshoot these issues, wait 15 minutes and check whether the latency is less than 1 minute. If so, check whether the changes in processed data volume match the data volume changes in the source Logstore. If they do not match, proceed to the next step. If you are unable to clear the alert, submit a ticket with project, Logstore, and job information for technical support.

Events delivered (same-period comparisons)

Item	Description
Purpose	A rule of this type monitors the comparisons between the number of data records written to the destination Logstore with those of the same period the previous day or week. An alert is triggered if the value goes beyond the specified threshold for increase or drops below the specified threshold for decrease.
Associated dashboard	Dashboard > Overall metrics > Total Events Delivered
Sample SQL	In the dialog box for creating the alert rule, enter the following SQL statement for queries: In the following template, replace `{job_name}` with the name of the data transformation job that you want to monitor. `__topic__: etl_metrics and job_name: {job_name} and "_etl_:connector_meta.action": deliver \| select diff [1] as events, round(coalesce(diff [3], 0), 1) as ratio_1d, round(coalesce(diff [5], 0), 1) as ratio_1w from( select compare(events, 86400, 604800) as diff FROM ( select sum("_etl_:connector_metrics.events") as events FROM log ) )`
Monitoring rule	For Trigger Condition, set When to the query result contains and set the expression to `(ratio_1d>120 \|\| ratio_1d<80) && (ratio_1w>120 \|\| ratio_1w<80)`, which means that the threshold is 20% for both increases and decreases. Set the query interval to 1 hour. Set the query frequency to 1 hour. Note We recommend that you set a threshold of 20% or higher for daily or weekly comparisons or adjust the comparison cycle to match the raw traffic cycle. This helps prevent false alerts caused by periodic fluctuations in the raw data traffic.
Handling method	You can clear triggered alerts based on the following rules: If the value changes are consistent with the changes in the data volume in the source Logstore, the value changes are caused by the data volume changes in the source Logstore. If they are not, proceed to the next step. If transformation latencies or exceptions exist, troubleshoot these issues first. After you troubleshoot these issues, wait 15 minutes and check whether the latency is less than 1 minute. If so, check whether the changes in processed data volume match the data volume changes in the source Logstore. If they do not match, proceed to the next step. If you are unable to clear the alert, submit a ticket with project, Logstore, and job information for technical support.