You can use the Deep Learning Containers (DLC) client to view DLC job logs, job lists, and job details. This topic describes the details about the commands that are used to query logs or jobs, including the syntax and the parameters. This topic also provides examples.
The logs command
Feature description
The command is used to query the logs of a training job.
Syntax
./dlc logs <yourJobId> <yourPodId> [--max_events_num <yourMaxNum>] [--start_time <yourStartTime>] [--end_time <yourEndTime>]
Parameters
Parameter
Required
Description
Type
<yourJobId>
Yes
The ID of the training job that you want to query.
STRING
<yourPodId>
Yes
The ID of the pod whose logs you want to view. You need to specify multiple pods in scenarios where distributed jobs are created.
STRING
max_events_num <yourMaxNum>
No
The maximum number of log entries to return. Default value: 2000.
INT
start_time <yourStartTime>
No
The start time of the query. The default value is 7 days before the current time. Example: start_time 2020-11-08T16:00:00Z.
STRING
end_time <yourEndTime>
No
The end time of the query. The default value is the current time. Example: end_time 2020-11-08T17:00:00Z.
STRING
Examples
Obtain 10 lines of logs for Worker Node 0 of a distributed training job.
./dlc logs dlcdys3r9jlu**** dlcdys3r********-worker-0 --max_events_num 10
The system returns information similar to the following output:
WARN: ./requirements.txt not found, skip installing requirements. ================================================ | PAI Tensorflow powered by Aliyun PAI Team. | ================================================ Network is under initialization... Network successfully initialized. [2021-04-16 12:27:56.368026] [INFO] [7#7] [tensorflow/core/platform/cpu_feature_guard.cc:141] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2 AVX512F FMA [2021-04-16 12:27:56.375586] [INFO] [7#7] [tensorflow/core/distributed_runtime/master.cc:80] ====================CPU Architecture===================== [2021-04-16 12:27:56.375600] [INFO] [7#7] [tensorflow/core/distributed_runtime/master.cc:84] Disable AVX512. [2021-04-16 12:27:56.375605] [INFO] [7#7] [tensorflow/core/distributed_runtime/master.cc:87] CPU Vendor ID: GenuineIntel
The get job command
Feature description
The command is used to obtain information about a training job. If you do not specify a job ID, all jobs are queried. If you specify a job ID, only the specified job is queried.
Syntax
./dlc get job [JOB_ID] [--workspace_id <yourWorkspaceId>] [--display_name <yourJobName>] [--job_type <yourJobType>] [--status <yourJobStatus>] [--start_time <yourStartTime>] [--end_time <yourEndTime>] [--page_num <yourPageNum>] [--page_size <yourPageSize>] [--max_events_num <yourMaxNum>] [--events] [--events_only]
Parameter description
Parameter
Required
Description
Type
JOB_ID
No
The ID of the training job that you want to query.
STRING
workspace_id <yourWorkspaceId>
No
The workspace ID.
STRING
display_name <yourJobName>
No
The name of the job. Fuzzy query is supported. The name is case-insensitive. Wildcards are not supported.
STRING
job_type <yourJobType>
No
The type of the job. You can query jobs of all types. This parameter is empty by default, which indicates all types.
STRING
status <yourJobStatus>
No
The status of the job. Valid values: This parameter is empty by default, which indicates all states.
STRING
start_time <yourStartTime>
No
The start time of the query. Example: start_time 2022-08-04T02:09:32Z.
STRING
end_time <yourEndTime>
No
The end time of the query. Example: end_time 2022-08-04T02:09:32Z.
STRING
page_num <yourPageNum>
No
The number of the page to return for the current query. Page numbers start from 1. Default value: 1.
INT
page_size <yourPageSize>
No
The number of entries to return on each page. Default value: 10.
INT
max_events_num <yourMaxNum>
No
The maximum number of rows of system events to return. Default value: 2000.
INT
events
No
Specifies whether to query the system events of a job. This parameter takes effect only when a single job is queried. Default value: false.
BOOL
events_only
No
Specifies whether to query only the system events of a job. This parameter takes effect only when a single job is queried. Default value: false.
BOOL
Examples
Query training jobs by name based on fuzzy match.
./dlc get job --display_name epl
The system returns information similar to the following output:
+--------------------+------------------+-------------+------------------+------------+----------------+---------+----------+-----------+------------------+----------------------+----------------------+----------------------+----------------------+-------------+------------+----------------------+-------------------+ | Name | JobId | WorkspaceId | WorkspaceName | ResourceId | ResourceName | JobType | Priority | JobStatus | UserId | CreateTime | SubmittedTime | RunningTime | SuccessedTime | StoppedTime | FailedTime | FinishTime | Duration(seconds) | +--------------------+------------------+-------------+------------------+------------+----------------+---------+----------+-----------+------------------+----------------------+----------------------+----------------------+----------------------+-------------+------------+----------------------+-------------------+ | test_epl_test-**** | dlc02xipvt5z**** | 23**** | doc_test_**** | | public-cluster | TFJob | 1 | Succeeded | 144963168668**** | 2022-08-01T06:41:05Z | 2022-08-01T06:45:08Z | 2022-08-01T06:48:57Z | 2022-08-01T06:53:21Z | | | 2022-08-01T06:53:21Z | 736 | | test_epl_**** | dlc1iyv3szl2**** | 23**** | doc_test_**** | | public-cluster | TFJob | 1 | Succeeded | 144963168668**** | 2022-08-01T03:23:51Z | 2022-08-01T03:27:22Z | 2022-08-01T03:27:50Z | 2022-08-01T03:33:48Z | | | 2022-08-01T03:33:48Z | 597 | +--------------------+------------------+-------------+------------------+------------+----------------+---------+----------+-----------+------------------+----------------------+----------------------+----------------------+----------------------+-------------+------------+----------------------+-------------------+
Query a specified training job.
./dlc get job dlc02xipvt5z****
The system returns information similar to the following output:
{ "ClusterId": "", "CodeSource": { "Branch": "main", "CodeSourceId": "code-29****c****c4****ae0c9ec75a5****", "MountPath": "" }, "DataSources": [ { "DataSourceId": "d-ya7gc2p2iqq240****", "MountPath": "" } ], "DisplayName": "test_epl_test-****", "Duration": 736, "ElasticSpec": { "AIMasterType": "", "EnableElasticTraining": false, "MaxParallelism": 0, "MinParallelism": 0 }, "EnabledDebugger": false, "GmtCreateTime": "2022-08-01T06:41:05Z", "GmtFinishTime": "2022-08-01T06:53:21Z", "GmtRunningTime": "2022-08-01T06:48:57Z", "GmtSubmittedTime": "2022-08-01T06:45:08Z", "GmtSuccessedTime": "2022-08-01T06:53:21Z", "JobId": "dlc02xipvt5z****", "JobSpecs": [ { "AssignNodeSpec": { "EnableAssignNode": false, "NodeNames": "" }, "EcsSpec": "ecs.gn6v-c8g1.2xlarge", "Image": "registry.cn-shanghai.aliyuncs.com/pai-dlc/tensorflow-training:1.15-gpu-py36-cu100-ubuntu1****", "PodCount": 2, "ResourceConfig": { "CPU": "", "GPU": "", "GPUType": "", "Memory": "", "SharedMemory": "" }, "Type": "Worker", "UseSpotInstance": false } ], "JobType": "TFJob", "Pods": [ { "GmtCreateTime": "2022-08-01T06:45:08Z", "GmtFinishTime": "2022-08-01T06:53:20Z", "GmtStartTime": "2022-08-01T06:52:06Z", "Ip": "10.224.xx.xx", "PodId": "dlc02xipvt5z****-worker-0", "PodUid": "", "Status": "Succeeded", "Type": "worker" }, { "GmtCreateTime": "2022-08-01T06:45:08Z", "GmtFinishTime": "2022-08-01T06:53:20Z", "GmtStartTime": "2022-08-01T06:48:57Z", "Ip": "10.224.xx.xx", "PodId": "dlc02xipvt5z****-worker-1", "PodUid": "", "Status": "Succeeded", "Type": "worker" } ], "ReasonCode": "JobSucceeded", "ReasonMessage": "TFJob dlc02xipvt5z**** successfully completed.", "RequestId": "76FC3500-xxxx-533F-B24A-AC9B2A72****", "ResourceId": "", "Priority": 1, "ResourceLevel": "", "Settings": { "BusinessUserId": "", "Caller": "", "EnableErrorMonitoringInAIMaster": false, "EnableTideResource": false, "ErrorMonitoringArgs": "", "PipelineId": "" }, "Status": "Succeeded", "ThirdpartyLibDir": "", "UserCommand": "cd /root/xxxx/xxxx/\npip install .\ncd examples/resnet\nbash scripts/xxxx_dp.sh", "UserId": "144963168668****", "WorkspaceId": "23****", "WorkspaceName": "doc_test_****" }
References
You can view job details in the console. For more information, see View training details.