If you have a shortage of computing power, you can use the preemptible job feature of Platform for AI (PAI), which allocates computing resources through a bidding system. Preemptible resources usually offer a price advantage over public pay-as-you-go resources. This enables cost-effective access to AI computing power and reduces the overall expense of jobs. This topic describes how to use preemptible resouces when creating a job in Deep Learning Containers (DLC) with Lingjun AI Computing Service resources.
Limits
Only users in a whitelist can use preemptible jobs. Contact your account manager before using the feature.
The preemptible job feature is only available in the China (Ulanqab) and Singapore regions.
The preemptible job feature only supports Lingjun AI Computing Service resources.
Preemptible jobs are subject to the following constraints:
Cannot be converted into subscription instances.
Instance and bandwidth specifications cannot be modified.
Does not support ICP Filing services.
No discounts for major customers.
Usage notes
The price of preemptible resources fluctuate with current supply and demand, and can offer up to a 90% reduction in instance costs compared to public pay-as-you-go instances.
Because preemptible resources can be preempted by all users of Alibaba Cloud, the availability of preemptible resources is not guaranteed. If you need to run a DLC job with preemptible resources, pay attention to the following points:
Resource request: After you submit a DLC job with preemptible resources, the system begins to preempt resources. If the resource inventory is insufficient, the job enters a pending state until resources are available.
Resource revocation: Preemptible resources may be revoked based on market price, inventory, and the maximum bid price and duration of the instance. Even when your DLC job is running, if the maxium bid price falls below the average market price or inventory is insufficient, resources may be revoked without notice, resulting in job failure. To improve job stability, you can:
Enable Automatic Fault Tolerance when submitting preemptible jobs. This allows your task to re-enter the bidding process and potentially run again. For more information, see AIMaster: Elastic fault tolerance engine.
Use the EasyCkpt framework for PyTorch model training, which supports frequent checkpoint saving and resuming training after interruptions. For more details, see Use EasyCkpt to save and resume foundation model trainings.
Billing
Price description:
To use preemptible resources, you need to set a maximum bid price (preemptibleWithPriceLimit). The market price for preemptible resources fluctuates with supply and demand, and multiple jobs using the same resources may incur identical costs for a given period. The following table describes resource specifications and price ranges for preemptible resources:
The market price for preemptible resources fluctuates with supply and demand in real time. The maxium bid price is ranges from 10% to 90% of the market price with a 10% interval. The actual market price and maxium bid price are displayed in the console.
Resource specification | Market price range (USD/hour) | Maxium bid price range (USD/hour) | Region |
ml.gu7ef.8xlarge-gu100 | 5.700~57.000 | 5.7000~51.300 | China (Ulanqab) |
ml.gu7xf.8xlarge-gu108 | 5.040~50.400 | 5.040~45.360 | |
ml.gu8xf.8xlarge-gu108 | 12.240~122.400 | 12.240~110.160 | |
ml.gu8ef.8xlarge-gu100 | 23.220~232.200 | 23.220~208.980 | Singapore |
View billing details:
You can go to the Expense and Costs page the following day after job execution to review the costs incurred by preemptible resources. Similar to pay-as-you-go resources in DLC, the billing details of preemptible resource orders are displayed on the page. For more information, see View billing details.
Scenarios
Applicable scenarios:
We recommend that you use preemptible resources to reduce costs in the following scenarios:
Jobs with short runtimes.
Jobs during debugging.
Jobs that allow interruptions.
Jobs that support resumption from interruptions, such as jobs using the EasyCkpt framework for PyTorch model training, which supports frequent checkpoint saving and resuming training after interruptions. For more information, see Use EasyCkpt to save and resume foundation model trainings.
Inapplicable scenarios:
Do not use preemptible resources for services that requires high stability.
Procedure
To use preemptible resources for DLC jobs with Lingjun AI Computing Service, follow the following steps:
Go to the Create Job page. For more infomation, see Step 1: Go to the Create Job page.
Configure the following key parameters. For more information, see Submit training jobs.
Parameter
Description
Resource Information
Resource Type
Select Lingjun AI Computing Service.
Source
Select Preemptible Resources.
Job Resource
In the Resource Type column, click to select an instance type and set the Maximum Bid Price. The maxium bid price is ranges from 10% to 90% of the market price with a 10% interval. You can get the preemptible resources if your bid meets or exceeds the market price and inventory is available.
VPC
VPC(ID)
From the dropdown list, select your virtual private cloud (VPC), vSwitch, and security group.
Security Group
vSwitch
Fault Tolerance and Diagnosis
Automatic Fault Tolerance
We recommend that you enable Automatic Fault Tolerance, which allows preemptible jobs to re-enter the bidding queue after resource revocation. The job can resume when the average market price falls below your maximum bid price. For more information, see AIMaster: Elastic fault tolerance engine.
After you configure the parameters, click Confirm to submit the job.
Then, DLC starts to request preemptible resources. If no resources is available, the job enters a pending state.