Use a preemptible job

If you do not have sufficient computing power, you can use the preemptible job feature of Platform for AI (PAI), which allocates computing resources by using a bidding system. In most cases, preemptible resources offer a price advantage over public pay-as-you-go resources. This allows cost-effective access to AI computing power and reduces the total cost of jobs. This topic describes how to use preemptible resources when you create a Deep Learning Containers (DLC) job.

Limits

Preemptible resources have the following limits:

Type	Lingjun resources	General-purpose computing resources

Type	Lingjun resources	General-purpose computing resources
Supported regions	China (Ulanqab) Singapore	China (Beijing) China (Shanghai) China (Hangzhou) China (Shenzhen) China (Ulanqab) China (Guangzhou)
Framework type	PyTorch MPIJob	PyTorch
AIMaster-based automatic fault tolerance	Supported	Not supported
Limits on features	Preemptible instances cannot be converted into subscription instances. Instance and bandwidth specifications cannot be modified. The ICP filing service is not supported. No discounts are provided for major customers.

Features

Using preemptible resources
You can use general-purpose computing resources or Lingjun resources to create DLC jobs. The market prices of preemptible resources change based on the supply and demand. Preemptible instances can cost up to 90% lower than pay-as-you-go instances. Preemptible resources can be preempted by all Alibaba Cloud users and are released when their protection periods end. You must take note of the following considerations when you use preemptible resources to submit DLC jobs:
- When DLC fails to preempt preemptible instances due to insufficient instance resources, the preemptible jobs enter the waiting state and DLC continues to apply for preemptible resources.
- After the system applies for preemptible resources, DLC jobs are created and run.
- After preemptible resources are released, DLC jobs fail and stop running.
Applying for preemptible resources
When you use preemptible resources to create DLC jobs, DLC starts to preempt instance resources after you submit the jobs. If you want to create preemptible instances, the following requirements must be met:
- The maximum bidding price that you configured for preemptible resources must be greater than or equal to the market price.
- The inventory of preemptible resources is sufficient.
Releasing preemptible resources
Preemptible resources can be interrupted and released based on the market price, resource inventory, maximum bidding price configured for an instance during job creation, and usage duration. In the following scenarios, the preemptible resources may be released without a notification:
- Lingjun resources: If the maximum bidding price of a preemptible resource is lower than the average price or the inventory of the resource is insufficient, the resource is released.
- General-purpose computing resources: If the maximum bidding price of a preemptible resource is lower than the current market price or the inventory of the preemptible resource is insufficient, the preemptible resource is released.
To ensure that your preemptible jobs can continuously and stably run, you can perform the following operations:
- Turn on Automatic Fault Tolerance when you create a job by using Lingjun resources. After you turn on Automatic Fault Tolerance, your job automatically enters the queue to bid for preemptible resources. For more information, see AIMaster: Elastic fault tolerance engine.
- When you use general-purpose computing or Lingjun resources to create a job, you can use the EasyCkpt framework to train a PyTorch large language model (LLM). The job can frequently perform and save checkpoints, and allows interruption. For more information, see Use EasyCkpt to save and resume foundation model trainings.

Billing

Pricing
The bidding mode of preemptible jobs is used to configure the maximum bidding price specified by the SpotWithPriceLimit parameter. In scenarios in which DLC jobs are created by using preemptible resources, the market prices of preemptible resources fluctuate based on the supply and demand. If you use the same preemptible resources to submit multiple DLC jobs, the fees for the jobs in a fixed period of time may be the same. The following bidding types are supported.
Note
Lingjun resources support only the spot discount bidding type.
- Bidding price based on the spot discount bidding type: The maximum bidding price is based on the market price of the instance type and ranges from 10% to 90% of the market price with a 10% interval.
- Bidding price based on the spot price bidding type: The maximum bidding price is in the market price range.
In the Resource Information section of the Create Job page in DLC, set Source to Preemptible Resources and view the preemptible resources and the market price ranges in the Job Resource section.
Note
For the specific pricing of resource specifications, please refer to the console.
- General-purpose computing resources
- Lingjun resources
Billing mode
You are charged for preemptible resources based on the pay-as-you-go billing method. You are charged based on the market price.
Viewing bills
After you run a job, go to the Billing Details page in Expenses and Costs on the next day to view the billing details of the job for which preemptible resources are used. The pay-as-you-go fees are generated by DLC. The instance tag is key:acs:pai:dlc:payType value:spot. For more information about how to view billing details, see View billing details.

Scenarios

Supported scenarios
To reduce costs, we recommend that you use preemptible resources in the following scenarios:
- Short-term computing jobs.
- Computing jobs in the debug state.
- Computing jobs that are fault-tolerant.
- Computing jobs that allow interruption. For example, in scenarios in which the EasyCkpt framework is used to train a PyTorch LLM, you can frequently save checkpoints and restore data from the checkpoints. For more information, see Use EasyCkpt to save and resume foundation model trainings.
Unsupported scenarios
Services that require high stability

Procedure

You can submit a preemptible job by using one of the following methods:

Use the PAI console

Go to the Create Job page.
1. Log on to the PAI console. Select a region and a workspace. Then, click Enter Deep Learning Containers (DLC).
2. On the Deep Learning Containers (DLC) page, click Create Job.

On the Create Job page, configure the parameters described in the following table. For information about other parameters, see Submit training jobs.

Parameter		Description

Parameter		Description
Resource Information	Resource Type	Select Lingjun AI Computing Service or General Computing. Note This parameter is available only if the workspace allows you to use Lingjun resources and general-purpose computing resources.
	Source	Select Preemptible Resources.
	Job Resource	In the Resource Type column, click the icon to select a preemptible resource and configure the Bid Price parameter. The bidding price is the maximum bidding price based on the original price of the instance type and ranges from 10% to 90% of the market price with a 10% interval. You can obtain the preemptible resource if your bid meets or exceeds the market price and the inventory is sufficient.
VPC	VPC	Configure a virtual private cloud (VPC) if you use Lingjun resources to submit DLC jobs. Select a VPC, a vSwitch, and a security group from the drop-down lists.
	Security Group
	vSwitch
Fault Tolerance and Diagnosis	Automatic Fault Tolerance	If you use Lingjun resources to submit DLC jobs, we recommend that you enable Automatic Fault Tolerance. The Automatic Fault Tolerance feature allows preemptible jobs to automatically re-enter the bidding queue after resource revocation. The jobs can resume when the average market price falls below your maximum bidding price. For more information about AIMaster, see AIMaster: Elastic fault tolerance engine.

Lingjun resources

General-purpose computing resources

Note

For the specific pricing of resource specifications, please refer to the console.

Note

For the specific pricing of resource specifications, please refer to the console.

After you configure the parameters, click Confirm.
After you submit the job, DLC applies for preemptible resources to create and run the job. If no preemptible resources are applied for, the job enters the waiting state.

Use the SDK

Step 1: Install SDK for Python.

Install the workspace SDK.

pip install alibabacloud_aiworkspace20210204==3.0.1

Install the DLC SDK.

pip install alibabacloud_pai_dlc20201203==1.4.17

Step 2: Submit a preemptible job

SpotDiscountLimit

SpotPriceLimit

#!/usr/bin/env python3

from alibabacloud_tea_openapi.models import Config
from alibabacloud_credentials.client import Client as CredClient

from alibabacloud_pai_dlc20201203.client import Client as DLCClient
from alibabacloud_pai_dlc20201203.models import CreateJobRequest

region_id = '<region-id>'  # The ID of the region in which the DLC job resides, such as cn-hangzhou. 
cred = CredClient()
workspace_id = '12****'  # The ID of the workspace to which the DLC job belongs. 

dlc_client = DLCClient(
    Config(credential=cred,
           region_id=region_id,
           endpoint='pai-dlc.{}.aliyuncs.com'.format(region_id),
           protocol='http'))

create_job_resp = dlc_client.create_job(CreateJobRequest().from_map({
    'WorkspaceId': workspace_id,
    'DisplayName': 'sample-spot-job',
    'JobType': 'PyTorchJob',
    'JobSpecs': [
        {
            "Type": "Worker",
            "Image": "dsw-registry-vpc.<region-id>.cr.aliyuncs.com/pai/pytorch-training:1.12-cpu-py39-ubuntu20.04",
            "PodCount": 1,
            "EcsSpec": 'ecs.g7.xlarge',
            "SpotSpec": {
                "SpotStrategy": "SpotWithPriceLimit",
                "SpotDiscountLimit": 0.4,
            }
        },
    ],
    'UserVpc': {
        "VpcId": "vpc-0jlq8l7qech3m2ta2****",
        "SwitchId": "vsw-0jlc46eg4k3pivwpz8****",
        "SecurityGroupId": "sg-0jl4bd9wwh5auei9****",
    },
    "UserCommand": "echo 'Hello World' && ls -R /mnt/data/ && sleep 30 && echo 'DONE'",
}))
job_id = create_job_resp.body.job_id
print(f'jobId is {job_id}')

#!/usr/bin/env python3

from alibabacloud_tea_openapi.models import Config
from alibabacloud_credentials.client import Client as CredClient

from alibabacloud_pai_dlc20201203.client import Client as DLCClient
from alibabacloud_pai_dlc20201203.models import CreateJobRequest

region_id = '<region-id>'
cred = CredClient()
workspace_id = '12****'

dlc_client = DLCClient(
    Config(credential=cred,
           region_id=region_id,
           endpoint='pai-dlc.{}.aliyuncs.com'.format(region_id),
           protocol='http'))

create_job_resp = dlc_client.create_job(CreateJobRequest().from_map({
    'WorkspaceId': workspace_id,
    'DisplayName': 'sample-spot-job',
    'JobType': 'PyTorchJob',
    'JobSpecs': [
        {
            "Type": "Worker",
            "Image": "dsw-registry-vpc.<region-id>.cr.aliyuncs.com/pai/pytorch-training:1.12-cpu-py39-ubuntu20.04",
            "PodCount": 1,
            "EcsSpec": 'ecs.g7.xlarge',
            "SpotSpec": {
                "SpotStrategy": "SpotWithPriceLimit",
                "SpotPriceLimit": 0.011,
            }
        },
    ],
    'UserVpc': {
        "VpcId": "vpc-0jlq8l7qech3m2ta2****",
        "SwitchId": "vsw-0jlc46eg4k3pivwpz8****",
        "SecurityGroupId": "sg-0jl4bd9wwh5auei9****",
    },
    "UserCommand": "echo 'Hello World' && ls -R /mnt/data/ && sleep 30 && echo 'DONE'",
}))
job_id = create_job_resp.body.job_id
print(f'jobId is {job_id}')

The following table describes the key parameters. For information about other parameters, see Use SDK for Python.

Parameter	Description

Parameter	Description
SpotStrategy	The bidding policy. The bidding type parameters take effect only if you set this parameter to SpotWithPriceLimit.
SpotDiscountLimit	The spot discount bidding type. Note You cannot specify the SpotDiscountLimit and SpotPriceLimit parameters at the same time. The SpotDiscountLimit parameter is valid only for Lingjun resources.
SpotPriceLimit	The spot price bidding type.
UserVpc	This parameter is required when you use Lingjun resources to submit jobs. Configure the VPC, vSwitch, and security group ID for the region in which the job resides.

Limits

Features

Billing

Scenarios

Procedure

Use the PAI console

Use the SDK

Step 1: Install SDK for Python.

Step 2: Submit a preemptible job

Sales Support

Technical Support

Connect & Report Abuse

About Alibaba Cloud

Our Global Network

Quick Start

Global Offices

Olympic Games Paris 2024 New

Stade Roland Garros – Glitz from the Past New

Place de la Concorde – “Breaking” the Barriers New

Vaires-sur-Marne Nautical Stadium – Sports with Sustainability New

International Broadcast Center – Images, Sounds, and Data that Captivate Billions New

Customer Success Stories New

Trust Center

Security & Compliance Center

Cloud Compliance Resources

Security Compliance FAQs

Product & Feature Update New

Cloud Forward

Press Room

Alibaba Cloud e-Magazine New

Alibaba Cloud in Analyst Research

Notice

Go Global Service New

Go Global Alliance with Alibaba Cloud

Asia Accelerator Hot

Information Compliance

China Gateway - MLPS 2.0 Compliance New

China Gateway - Networking

China Gateway - Global Application Acceleration New

China Gateway - Security

China Gateway - Data Security New

ICP Support Hot

China Gateway - Omnichannel Data Mid-End New

China Gateway - Organizational Data Mid-End New

China Gateway - Business Mid-End New

China Gateway - AI Service for Conversational Chatbots New

China Gateway - Online Education

China Gateway - Domain Registration

Work at Alibaba Cloud

Experienced Professionals

Students and Graduates

Free Trial

Pricing

Promo Center

Price Reduction

Pay Less and Deploy More

FinOps

Elastic Compute Service (ECS)

Simple Application Server (SAS)

Elastic GPU Service

Elastic Desktop Service (EDS)

Object Storage Service (OSS)

Cloud Enterprise Network (CEN)

Web Application Firewall (WAF)

Domain Names

Lingma

Container Compute Service (ACS)

Secure Access Service Edge (SASE)

Intelligent Media Services(IMS)

Edge Security Acceleration (ESA)(Original DCDN)

Intelligent Media Management

DingTalk Enterprise

YiDA

Alibaba Cloud Model Studio

Apsara Prime - For Easy Cloud Product Selection

Alibaba Cloud ECS - Cater All Your Cloud Hosting Needs

1TB CDN—Get Free 1 TB Outbound Traffic Plan Now

Security—Under Attack? Get Free Security Support

Short Message Service - Free Testing is Available

Elastic Compute Service (ECS) Hot

CloudBox

Compute Nest

Dedicated Host Hot

ECS Bare Metal Instance

Elastic GPU Service Featured

Simple Application Server (SAS) Hot

Auto Scaling

Cloud Phone Beta

Elastic Desktop Service (EDS) Featured

Batch Compute

Elastic High Performance Computing (E-HPC)

Super Computing Cluster (SCC)