You can create a notification rule for a Platform for AI (PAI) workspace to track and monitor the status of Deep Learning Containers (DLC) jobs. This topic describes how to configure a notification rule.
Configure a notification rule
On the Workspace Details page, choose . Then, click Create Event Rule.
In the Create Event Rule panel, configure the following parameters, and click Submit.
Parameter
Description
Rule Name
Follow the on-screen instructions to specify a rule name.
Event Type
Select DLC Jobs for Event Type. Then, select one of the following:
Job Process
Enter Queue: The job enters the queued status.
Start Bidding: The job enters the bidding status.
Start Environment Preparation: The job enters the preparing environment status.
Start Run: The job enters the running status.
Job Failure: The job execution failed.
Job Completed (Succeeded or Failed): The job succeeded or failed.
Automatic Fault Tolerance: When a DLC job encounters an exception or error and performs automatic fault tolerance processing, a notification is sent.
Job Timeout: If you select this option, you must first set the timeout rule on the scheduling settings page of the workspace. For more information, see Configure a timeout rule.
Queue Timeout: The queue duration > the specified maximum queue duration.
Environment Preparation Timeout: The environment preparation duration > the specified maximum preparation duration.
Wait Timeout: The waiting duration from job creation to running > the specified maximum waiting duration.
Run Timeout: The job running duration > the specified maximum running duration, triggering automatic stop.
Other Events
Job Preempted: When an idle job or bidding job is preempted, a notification is sent.
Job Manually Stopped
Job Priority Modified
Event Scope
Valid values:
Created by Me: Only the DLC jobs you created.
In the current workspace: All DLC jobs in the current workspace.
Event Target
Notifications can be sent through DingTalk notification, voice call, text message, and email.
After you create a notification rule, the system will automatically send an alert to the designated contact when a job activates the rule. We recommend that you go to the Deep Learning Containers (DLC) page to check whether your jobs are performing as expected. For further troubleshooting, refer to the monitoring status and logs of the jobs. For more information, see View training jobs.
Configure a timeout rule
To configure a timeout rule for specific event types, follow these steps:
Go to the Configure Workspace page, select the DataWorks Scheduling Settings tab. Then, configure the maximum running duration and maximum job wait time in the DLC section.
Policy
Description
Resource Quota
Configure the maximum waiting duration for jobs using specified resources. Valid values:
Public Resource Group
Resource Quota: Select a resource quota associated to this workspace.
Timeout Rule Configuration
Set the timeout duration for specified event types. Valid values:
Job Waiting Duration (Queue Duration + Environment Preparation Duration)
Queue Duration
Environment Preparation Duration
To add multiple timeout rules, click Add.
Click Save.
Then, go to the Configure Event Notification tab to configure corresponding timeout rules. Otherwise, no alerts will be sent. For more information, see Configure a notification rule.