Unlock the Power of AI

1 million free tokens

88% Price Reduction

Activate Now

Use a preemptible job

Updated at: 2025-01-23 08:27

If you do not have sufficient computing power, you can use the preemptible job feature of Platform for AI (PAI), which allocates computing resources by using a bidding system. In most cases, preemptible resources offer a price advantage over public pay-as-you-go resources. This allows cost-effective access to AI computing power and reduces the total cost of jobs. This topic describes how to use preemptible resources when you create a Deep Learning Containers (DLC) job.

Limits

Preemptible resources have the following limits:

Type

Lingjun resources

General-purpose computing resources

Type

Lingjun resources

General-purpose computing resources

Supported regions

  • China (Ulanqab)

  • Singapore

  • China (Beijing)

  • China (Shanghai)

  • China (Hangzhou)

  • China (Shenzhen)

  • China (Ulanqab)

  • China (Guangzhou)

Framework type

  • PyTorch

  • MPIJob

PyTorch

AIMaster-based automatic fault tolerance

Supported

Not supported

Limits on features

  • Preemptible instances cannot be converted into subscription instances.

  • Instance and bandwidth specifications cannot be modified.

  • The ICP filing service is not supported.

  • No discounts are provided for major customers.

Features

  • Using preemptible resources

    You can use general-purpose computing resources or Lingjun resources to create DLC jobs. The market prices of preemptible resources change based on the supply and demand. Preemptible instances can cost up to 90% lower than pay-as-you-go instances. Preemptible resources can be preempted by all Alibaba Cloud users and are released when their protection periods end. You must take note of the following considerations when you use preemptible resources to submit DLC jobs:

    • When DLC fails to preempt preemptible instances due to insufficient instance resources, the preemptible jobs enter the waiting state and DLC continues to apply for preemptible resources.

    • After the system applies for preemptible resources, DLC jobs are created and run.

    • After preemptible resources are released, DLC jobs fail and stop running.

  • Applying for preemptible resources

    When you use preemptible resources to create DLC jobs, DLC starts to preempt instance resources after you submit the jobs. If you want to create preemptible instances, the following requirements must be met:

    • The maximum bidding price that you configured for preemptible resources must be greater than or equal to the market price.

    • The inventory of preemptible resources is sufficient.

  • Releasing preemptible resources

    Preemptible resources can be interrupted and released based on the market price, resource inventory, maximum bidding price configured for an instance during job creation, and usage duration. In the following scenarios, the preemptible resources may be released without a notification:

    • Lingjun resources: If the maximum bidding price of a preemptible resource is lower than the average price or the inventory of the resource is insufficient, the resource is released.

    • General-purpose computing resources: If the maximum bidding price of a preemptible resource is lower than the current market price or the inventory of the preemptible resource is insufficient, the preemptible resource is released.

    To ensure that your preemptible jobs can continuously and stably run, you can perform the following operations:

    • Turn on Automatic Fault Tolerance when you create a job by using Lingjun resources. After you turn on Automatic Fault Tolerance, your job automatically enters the queue to bid for preemptible resources. For more information, see AIMaster: Elastic fault tolerance engine.image

    • When you use general-purpose computing or Lingjun resources to create a job, you can use the EasyCkpt framework to train a PyTorch large language model (LLM). The job can frequently perform and save checkpoints, and allows interruption. For more information, see Use EasyCkpt to save and resume foundation model trainings.

Billing

  • Pricing

    The bidding mode of preemptible jobs is used to configure the maximum bidding price specified by the SpotWithPriceLimit parameter. In scenarios in which DLC jobs are created by using preemptible resources, the market prices of preemptible resources fluctuate based on the supply and demand. If you use the same preemptible resources to submit multiple DLC jobs, the fees for the jobs in a fixed period of time may be the same. The following bidding types are supported.

    Note

    Lingjun resources support only the spot discount bidding type.

    • Bidding price based on the spot discount bidding type: The maximum bidding price is based on the market price of the instance type and ranges from 10% to 90% of the market price with a 10% interval.

    • Bidding price based on the spot price bidding type: The maximum bidding price is in the market price range.

    In the Resource Information section of the Create Job page in DLC, set Source to Preemptible Resources and view the preemptible resources and the market price ranges in the Job Resource section.

    Note

    For the specific pricing of resource specifications, please refer to the console.

    • General-purpose computing resources

      image

    • Lingjun resourcesimage

  • Billing mode

    You are charged for preemptible resources based on the pay-as-you-go billing method. You are charged based on the market price.

  • Viewing bills

    After you run a job, go to the Billing Details page in Expenses and Costs on the next day to view the billing details of the job for which preemptible resources are used. The pay-as-you-go fees are generated by DLC. The instance tag is key:acs:pai:dlc:payType value:spot. For more information about how to view billing details, see View billing details.

Scenarios

  • Supported scenarios

    To reduce costs, we recommend that you use preemptible resources in the following scenarios:

    • Short-term computing jobs.

    • Computing jobs in the debug state.

    • Computing jobs that are fault-tolerant.

    • Computing jobs that allow interruption. For example, in scenarios in which the EasyCkpt framework is used to train a PyTorch LLM, you can frequently save checkpoints and restore data from the checkpoints. For more information, see Use EasyCkpt to save and resume foundation model trainings.

  • Unsupported scenarios

    Services that require high stability

Procedure

You can submit a preemptible job by using one of the following methods:

Use the PAI console

  1. Go to the Create Job page.

    1. Log on to the PAI console. Select a region and a workspace. Then, click Enter Deep Learning Containers (DLC).

    2. On the Deep Learning Containers (DLC) page, click Create Job.

  2. On the Create Job page, configure the parameters described in the following table. For information about other parameters, see Submit training jobs.

    Parameter

    Description

    Parameter

    Description

    Resource Information

    Resource Type

    Select Lingjun AI Computing Service or General Computing.

    Note

    This parameter is available only if the workspace allows you to use Lingjun resources and general-purpose computing resources.

    Source

    Select Preemptible Resources.

    Job Resource

    In the Resource Type column, click the image icon to select a preemptible resource and configure the Bid Price parameter. The bidding price is the maximum bidding price based on the original price of the instance type and ranges from 10% to 90% of the market price with a 10% interval. You can obtain the preemptible resource if your bid meets or exceeds the market price and the inventory is sufficient.

    VPC

    VPC

    Configure a virtual private cloud (VPC) if you use Lingjun resources to submit DLC jobs. Select a VPC, a vSwitch, and a security group from the drop-down lists.

    Security Group

    vSwitch

    Fault Tolerance and Diagnosis

    Automatic Fault Tolerance

    If you use Lingjun resources to submit DLC jobs, we recommend that you enable Automatic Fault Tolerance. The Automatic Fault Tolerance feature allows preemptible jobs to automatically re-enter the bidding queue after resource revocation. The jobs can resume when the average market price falls below your maximum bidding price. For more information about AIMaster, see AIMaster: Elastic fault tolerance engine.

    Lingjun resources
    General-purpose computing resources
    Note

    For the specific pricing of resource specifications, please refer to the console.

    image

    Note

    For the specific pricing of resource specifications, please refer to the console.

    image

  3. After you configure the parameters, click Confirm.

    After you submit the job, DLC applies for preemptible resources to create and run the job. If no preemptible resources are applied for, the job enters the waiting state.

Use the SDK

Step 1: Install SDK for Python.

  • Install the workspace SDK.

    pip install alibabacloud_aiworkspace20210204==3.0.1
  • Install the DLC SDK.

    pip install alibabacloud_pai_dlc20201203==1.4.17

Step 2: Submit a preemptible job

SpotDiscountLimit
SpotPriceLimit
#!/usr/bin/env python3

from alibabacloud_tea_openapi.models import Config
from alibabacloud_credentials.client import Client as CredClient

from alibabacloud_pai_dlc20201203.client import Client as DLCClient
from alibabacloud_pai_dlc20201203.models import CreateJobRequest

region_id = '<region-id>'  # The ID of the region in which the DLC job resides, such as cn-hangzhou. 
cred = CredClient()
workspace_id = '12****'  # The ID of the workspace to which the DLC job belongs. 

dlc_client = DLCClient(
    Config(credential=cred,
           region_id=region_id,
           endpoint='pai-dlc.{}.aliyuncs.com'.format(region_id),
           protocol='http'))

create_job_resp = dlc_client.create_job(CreateJobRequest().from_map({
    'WorkspaceId': workspace_id,
    'DisplayName': 'sample-spot-job',
    'JobType': 'PyTorchJob',
    'JobSpecs': [
        {
            "Type": "Worker",
            "Image": "dsw-registry-vpc.<region-id>.cr.aliyuncs.com/pai/pytorch-training:1.12-cpu-py39-ubuntu20.04",
            "PodCount": 1,
            "EcsSpec": 'ecs.g7.xlarge',
            "SpotSpec": {
                "SpotStrategy": "SpotWithPriceLimit",
                "SpotDiscountLimit": 0.4,
            }
        },
    ],
    'UserVpc': {
        "VpcId": "vpc-0jlq8l7qech3m2ta2****",
        "SwitchId": "vsw-0jlc46eg4k3pivwpz8****",
        "SecurityGroupId": "sg-0jl4bd9wwh5auei9****",
    },
    "UserCommand": "echo 'Hello World' && ls -R /mnt/data/ && sleep 30 && echo 'DONE'",
}))
job_id = create_job_resp.body.job_id
print(f'jobId is {job_id}')
#!/usr/bin/env python3

from alibabacloud_tea_openapi.models import Config
from alibabacloud_credentials.client import Client as CredClient

from alibabacloud_pai_dlc20201203.client import Client as DLCClient
from alibabacloud_pai_dlc20201203.models import CreateJobRequest

region_id = '<region-id>'
cred = CredClient()
workspace_id = '12****'

dlc_client = DLCClient(
    Config(credential=cred,
           region_id=region_id,
           endpoint='pai-dlc.{}.aliyuncs.com'.format(region_id),
           protocol='http'))

create_job_resp = dlc_client.create_job(CreateJobRequest().from_map({
    'WorkspaceId': workspace_id,
    'DisplayName': 'sample-spot-job',
    'JobType': 'PyTorchJob',
    'JobSpecs': [
        {
            "Type": "Worker",
            "Image": "dsw-registry-vpc.<region-id>.cr.aliyuncs.com/pai/pytorch-training:1.12-cpu-py39-ubuntu20.04",
            "PodCount": 1,
            "EcsSpec": 'ecs.g7.xlarge',
            "SpotSpec": {
                "SpotStrategy": "SpotWithPriceLimit",
                "SpotPriceLimit": 0.011,
            }
        },
    ],
    'UserVpc': {
        "VpcId": "vpc-0jlq8l7qech3m2ta2****",
        "SwitchId": "vsw-0jlc46eg4k3pivwpz8****",
        "SecurityGroupId": "sg-0jl4bd9wwh5auei9****",
    },
    "UserCommand": "echo 'Hello World' && ls -R /mnt/data/ && sleep 30 && echo 'DONE'",
}))
job_id = create_job_resp.body.job_id
print(f'jobId is {job_id}')

The following table describes the key parameters. For information about other parameters, see Use SDK for Python.

Parameter

Description

Parameter

Description

SpotStrategy

The bidding policy. The bidding type parameters take effect only if you set this parameter to SpotWithPriceLimit.

SpotDiscountLimit

The spot discount bidding type.

Note
  • You cannot specify the SpotDiscountLimit and SpotPriceLimit parameters at the same time.

  • The SpotDiscountLimit parameter is valid only for Lingjun resources.

SpotPriceLimit

The spot price bidding type.

UserVpc

This parameter is required when you use Lingjun resources to submit jobs. Configure the VPC, vSwitch, and security group ID for the region in which the job resides.

  • On this page (1)
  • Limits
  • Features
  • Billing
  • Scenarios
  • Procedure
  • Use the PAI console
  • Use the SDK
Feedback
phone Contact Us

Chat now with Alibaba Cloud Customer Service to assist you in finding the right products and services to meet your needs.

alicare alicarealicarealicare