All Products
Search
Document Center

Container Service for Kubernetes:Containerize and deploy Slurm on an ACK cluster

Last Updated:Feb 28, 2026

Container Service for Kubernetes (ACK) provides the Slurm on Kubernetes solution and the ack-slurm-operator component. Together, they allow you to deploy and manage the Simple Linux Utility for Resource Management (Slurm) scheduling system in ACK clusters for high performance computing (HPC) and large-scale AI and machine learning (ML) workloads.

Introduction to Slurm

Slurm is a powerful open source platform for cluster resource management and job scheduling. It is designed to optimize the performance and efficiency of supercomputers and large compute clusters. The following figure shows how its key components work together.

image
  • slurmctld: The Slurm control daemon. As the central management component of Slurm, slurmctld monitors system resources, schedules jobs, and manages the cluster status. You can configure a secondary slurmctld for failover to ensure high availability.

  • slurmd: The Slurm node daemon. Deployed on each compute node, slurmd receives instructions from slurmctld and manages the job lifecycle, including starting and executing jobs, reporting job status, and preparing for new job assignments. Jobs are scheduled through slurmd.

  • slurmdbd: The Slurm database daemon. This optional component maintains a centralized database for job history and accounting information. It is essential for long-term management and auditing of large clusters. slurmdbd can aggregate data across multiple Slurm-managed clusters to simplify data management.

  • Slurm CLI: Slurm provides the following command-line tools for job management and system monitoring:

    • scontrol: Manages clusters and controls cluster configurations.

    • squeue: Queries the status of jobs in the queue.

    • srun: Submits and manages jobs.

    • sbatch: Submits jobs in batches for scheduling and managing computing resources.

    • sinfo: Queries the overall status of a cluster, including node availability.

Introduction to Slurm on ACK

The Slurm Operator uses the SlurmCluster CustomResource (CR) to define the configuration files required for managing Slurm clusters. This simplifies the deployment and maintenance of Slurm-managed clusters and resolves control plane management issues. The following figure shows the architecture of Slurm on ACK.

image

A cluster administrator deploys and manages a Slurm-managed cluster by defining a SlurmCluster CR. The Slurm Operator then creates the Slurm control components in the cluster based on this CR. A Slurm configuration file can be mounted to a control component by using a shared volume or a ConfigMap.

Prerequisites

An ACK cluster that runs Kubernetes 1.22 or later is created, and the cluster contains one GPU-accelerated node. For more information, see Create an ACK cluster with GPU-accelerated nodes and Update clusters.

Step 1: Install the ack-slurm-operator component

  1. Log on to the ACK console. In the left-side navigation pane, choose Marketplace > Marketplace.

  2. On the Marketplace page, search for ack-slurm-operator and click the component. On the details page, click Deploy in the upper-right corner. In the Deploy panel, configure the parameters. You need to specify only the Cluster parameter. Use the default settings for all other parameters.

  3. After you configure the parameters, click OK.

Step 2: Create a Slurm-managed cluster

You can create a Slurm-managed cluster either manually or by using Helm. Choose the method that best fits your needs.

Manually create a Slurm-managed cluster

Create a MUNGE authentication Secret

MUNGE (MUNGE Uid 'N' Gid Emporium) provides authentication between Slurm components. You must create a Kubernetes Secret to store the MUNGE key.

  1. Run the following command to generate a key by using OpenSSL:

       openssl rand -base64 512 | tr -d '\r\n'
  2. Run the following command to create a Secret that stores the generated key:

    • Replace <$MungeKeyName> with a custom name for your key, such as mungekey.

    • Replace <$MungeKey> with the key string generated in the previous step.

       kubectl create secret generic <$MungeKeyName> --from-literal=munge.key=<$MungeKey>

After you create the Secret, you can configure or associate it with the Slurm-managed cluster for MUNGE-based authentication.

Create a ConfigMap for the Slurm-managed cluster

In this example, a ConfigMap is mounted to a pod by specifying the slurmConfPath parameter in the CR. This ensures that the pod configuration is automatically restored to the expected state even if the pod is recreated.

The data parameter in the following sample code specifies a sample ConfigMap. To generate a ConfigMap, we recommend that you use the Easy Configurator or Full Configurator tool.

View sample code

kubectl create -f - << EOF
apiVersion: v1
data:
  slurm.conf: |
    ProctrackType=proctrack/linuxproc
    ReturnToService=1
    SlurmctldPidFile=/var/run/slurmctld.pid
    SlurmctldPort=6817
    SlurmdPidFile=/var/run/slurmd.pid
    SlurmdPort=6818
    SlurmdSpoolDir=/var/spool/slurmd
    SlurmUser=root # test2
    StateSaveLocation=/var/spool/slurmctld
    TaskPlugin=task/none
    InactiveLimit=0
    KillWait=30
    MinJobAge=300
    SlurmctldTimeout=120
    SlurmdTimeout=300
    Waittime=0
    SchedulerType=sched/builtin
    SelectType=select/cons_tres
    JobCompType=jobcomp/none
    JobAcctGatherFrequency=30
    SlurmctldDebug=info
    SlurmctldLogFile=/var/log/slurmctld.log
    SlurmdDebug=info
    SlurmdLogFile=/var/log/slurmd.log
    TreeWidth=65533
    MaxNodeCount=10000
    PartitionName=debug Nodes=ALL Default=YES MaxTime=INFINITE State=UP

    ClusterName=slurm-job-demo
    # Set SlurmctldHost to the name of the Slurm-managed cluster with the -0 suffix. For high-availability deployment,
    # you can use the following configuration. The suffix depends on the number of slurmctld replicas:
    # SlurmctldHost=slurm-job-demo-0
    # SlurmctldHost=slurm-job-demo-1
    SlurmctldHost=slurm-job-demo-0
kind: ConfigMap
metadata:
  name: slurm-test
  namespace: default
EOF

Expected output:

configmap/slurm-test created

This output indicates that the ConfigMap is created.

Submit the SlurmCluster CR

Note In this example, an Ubuntu image that contains Compute Unified Device Architecture (CUDA) 11.4 and Slurm 23.06 is used. The image includes a component developed by Alibaba Cloud for auto scaling of on-cloud nodes. If you want to use a custom image, you can create and upload one.
  1. Create a file named slurmcluster.yaml and copy the following content to the file. This SlurmCluster CR creates a Slurm-managed cluster with one head node and four worker nodes. The cluster runs as a pod in the ACK cluster. The values of the mungeConfPath and slurmConfPath parameters in the SlurmCluster CR must match the mount targets specified in the slurmctld and workerGroupSpecs sections.

    View the YAML file content

       # This Kubernetes resource deploys a Slurm-managed cluster on ACK by using a CustomResource (CR) of kai.
       apiVersion: kai.alibabacloud.com/v1
       kind: SlurmCluster
       metadata:
         name: slurm-job-demo # The name of the cluster.
         namespace: default # The namespace in which the cluster is deployed.
       spec:
         mungeConfPath: /var/munge # The path of the MUNGE configuration file.
         slurmConfPath: /var/slurm # The path of the Slurm configuration file.
         slurmctld: # The specifications of the head node. The controller creates a StatefulSet to manage the head node.
           template:
             metadata: {}
             spec:
               containers:
               - image: registry-cn-hangzhou.ack.aliyuncs.com/acs/slurm-cuda:23.06-aliyun-cuda-11.4
                 imagePullPolicy: Always
                 name: slurmctld
                 ports:
                 - containerPort: 8080
                   protocol: TCP
                 resources:
                   requests:
                     cpu: "1"
                     memory: 1Gi
                 volumeMounts:
                 - mountPath: /var/slurm # The mount target of the Slurm configuration file.
                   name: config-slurm-test
                 - mountPath: /var/munge # The mount target of the MUNGE authentication key.
                   name: secret-slurm-test
               volumes:
               - configMap:
                   name: slurm-test
                 name: config-slurm-test
               - name: secret-slurm-test
                 secret:
                   secretName: slurm-test
         workerGroupSpecs: # The specifications of the worker nodes. In this example, two node groups are defined: cpu and cpu1.
         - groupName: cpu
           replicas: 2
           template:
             metadata: {}
             spec:
               containers:
               - env:
                 - name: NVIDIA_REQUIRE_CUDA
                 image: registry-cn-hangzhou.ack.aliyuncs.com/acs/slurm-cuda:23.06-aliyun-cuda-11.4
                 imagePullPolicy: Always
                 name: slurmd
                 resources:
                   requests:
                     cpu: "1"
                     memory: 1Gi
                 volumeMounts:
                 - mountPath: /var/slurm
                   name: config-slurm-test
                 - mountPath: /var/munge
                   name: secret-slurm-test
               volumes:
               - configMap:
                   name: slurm-test
                 name: config-slurm-test
               - name: secret-slurm-test
                 secret:
                   secretName: slurm-test
         - groupName: cpu1 # The cpu1 node group definition, similar to cpu. Modify resources or configurations based on your business requirements.
           replicas: 2
           template:
             metadata: {}
             spec:
               containers:
               - env:
                 - name: NVIDIA_REQUIRE_CUDA
                 image: registry-cn-hangzhou.ack.aliyuncs.com/acs/slurm-cuda:23.06-aliyun-cuda-11.4
                 imagePullPolicy: Always
                 name: slurmd
                 resources:
                   requests:
                     cpu: "1"
                     memory: 1Gi
                 securityContext: # The security context configuration that allows the pod to run in privileged mode.
                   privileged: true
                 volumeMounts:
                 - mountPath: /var/slurm
                   name: config-slurm-test
                 - mountPath: /var/munge
                   name: secret-slurm-test
               volumes:
               - configMap:
                   name: slurm-test
                 name: config-slurm-test
               - name: secret-slurm-test
                 secret:
                   secretName: slurm-test
  2. Run the following command to deploy the slurmcluster.yaml file to the cluster: Expected output:

       kubectl apply -f slurmcluster.yaml
       slurmcluster.kai.alibabacloud.com/slurm-job-demo created
  3. Run the following command to verify that the Slurm-managed cluster runs as expected: Expected output: This output indicates that the Slurm-managed cluster is deployed and its five nodes are ready.

       kubectl get slurmcluster
       NAME             AVAILABLE WORKERS   STATUS   AGE
       slurm-job-demo   5                   ready    14m
  4. Run the following command to verify that all pods in the Slurm-managed cluster named slurm-job-demo are in the Running state: Expected output: This output confirms that the head node and four worker nodes are running as expected.

       kubectl get pod
       NAME                                          READY   STATUS      RESTARTS     AGE
       slurm-job-demo-head-x9sgs                     1/1     Running     0            14m
       slurm-job-demo-worker-cpu-0                   1/1     Running     0            14m
       slurm-job-demo-worker-cpu-1                   1/1     Running     0            14m
       slurm-job-demo-worker-cpu1-0                  1/1     Running     0            14m
       slurm-job-demo-worker-cpu1-1                  1/1     Running     0            14m

Create a Slurm-managed cluster by using Helm

To quickly install and manage a Slurm-managed cluster with flexible configuration, you can use Helm to install the SlurmCluster chart provided by Alibaba Cloud. Download the Helm chart from charts-incubator (the Alibaba Cloud chart repository). After you configure the parameters, Helm creates resources such as role-based access control (RBAC), ConfigMap, Secret, and the Slurm-managed cluster.

Resources created by the Helm chart

Resource typeResource nameDescription
ConfigMap{{ .Values.slurmConfigs.configMapName }}When .Values.slurmConfigs.createConfigsByConfigMap is set to True, this ConfigMap stores user-defined Slurm configurations. It is mounted to the path specified by .Values.slurmConfigs.slurmConfigPathInPod, which is also rendered as .Spec.SlurmConfPath of the Slurm-managed cluster. When the pod starts, the ConfigMap is copied to /etc/slurm/ and access is restricted.
ServiceAccount{{ .Release.Namespace }}/{{ .Values.clusterName }}Allows the slurmctld pod to modify SlurmCluster CR configurations, enabling auto scaling of on-cloud nodes.
Role{{ .Release.Namespace }}/{{ .Values.clusterName }}Grants the slurmctld pod permissions to modify SlurmCluster CR configurations for auto scaling.
RoleBinding{{ .Release.Namespace }}/{{ .Values.clusterName }}Binds the Role to the ServiceAccount for auto scaling permissions.
Role{{ .Values.slurmOperatorNamespace }}/{{ .Values.clusterName }}Allows the slurmctld pod to modify Secrets in the SlurmOperator namespace. When Slurm and Kubernetes are deployed on the same batch of physical servers, the Slurm-managed cluster can use this resource to renew tokens.
RoleBinding{{ .Values.slurmOperatorNamespace }}/{{ .Values.clusterName }}Binds the operator namespace Role to the ServiceAccount for token renewal.
Secret{{ .Values.mungeConfigs.secretName }}Stores the MUNGE authentication key for Slurm component communication. When .Values.mungeConfigs.createConfigsBySecret is set to True, this Secret is created automatically with "munge.key"={{ .Values.mungeConfigs.content }}. The mount path is rendered from .Spec.MungeConfPath, and the pod startup commands initialize /etc/munge/munge.key from this path.
SlurmClusterThe rendered Slurm-managed cluster.

Helm chart parameters

ParameterExampleDescription
clusterName""The cluster name. Used to generate Secrets and roles. The value must match the cluster name in your Slurm configuration files.
headNodeConfigN/ARequired. Configures the slurmctld pod.
workerNodesConfigN/AConfigures the slurmd pods.
workerNodesConfig.deleteSelfBeforeSuspendtrueWhen set to true, a preStop hook is automatically added to the worker pod. This triggers automatic node draining before the node is removed and marks the node as unschedulable.
slurmdbdConfigsN/AConfigures the slurmdbd pod. If left empty, no slurmdbd pod is created.
slurmrestdConfigsN/AConfigures the slurmrestd pod. If left empty, no slurmrestd pod is created.
headNodeConfig.hostNetwork / slurmdbdConfigs.hostNetwork / slurmrestdConfigs.hostNetwork / workerNodesConfig.workerGroups[].hostNetworkfalseRendered as the hostNetwork parameter of the respective pod.
headNodeConfig.setHostnameAsFQDN / slurmdbdConfigs.setHostnameAsFQDN / slurmrestdConfigs.setHostnameAsFQDN / workerNodesConfig.workerGroups[].setHostnameAsFQDNfalseRendered as the setHostnameAsFQDN parameter of the respective pod.
headNodeConfig.nodeSelector / slurmdbdConfigs.nodeSelector / slurmrestdConfigs.nodeSelector / workerNodesConfig.workerGroups[].nodeSelectornodeSelector:
example: example
Rendered as the nodeSelector parameter of the respective pod.
headNodeConfig.tolerations / slurmdbdConfigs.tolerations / slurmrestdConfigs.tolerations / workerNodesConfig.workerGroups[].tolerationstolerations:
- key:
value:
operator:


Rendered as the tolerations of the respective pod.
headNodeConfig.affinity / slurmdbdConfigs.affinity / slurmrestdConfigs.affinity / workerNodesConfig.workerGroups[].affinityaffinity:
nodeAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
nodeSelectorTerms:
- matchExpressions:
- key: topology.kubernetes.io/zone
operator: In
values:
- zone-a
preferredDuringSchedulingIgnoredDuringExecution:
- weight: 1
preference:
matchExpressions:
- key: another-node-label-key
operator: In
values:
- another-node-label-value















Rendered as the affinity rules of the respective pod.
headNodeConfig.resources / slurmdbdConfigs.resources / slurmrestdConfigs.resources / workerNodesConfig.workerGroups[].resourcesresources:
requests:
cpu: 1
limits:
cpu: 1



Rendered as the resources of the primary container in the respective pod. The resource limit of the worker pod primary container is rendered as the Slurm node resource limit.
headNodeConfig.image / slurmdbdConfigs.image / slurmrestdConfigs.image / workerNodesConfig.workerGroups[].image"registry-cn-hangzhou.ack.aliyuncs.com/acs/slurm:23.06-1.6-aliyun-49259f59"Rendered as the container image. You can also build a custom image from ai-models-on-ack/framework/slurm/building-slurm-image.
headNodeConfig.imagePullSecrets / slurmdbdConfigs.imagePullSecrets / slurmrestdConfigs.imagePullSecrets / workerNodesConfig.workerGroups[].imagePullSecretsimagePullSecrets:
- name: example
Rendered as the Secret used to pull the container image.
headNodeConfig.podSecurityContext / slurmdbdConfigs.podSecurityContext / slurmrestdConfigs.podSecurityContext / workerNodesConfig.workerGroups[].podSecurityContextpodSecurityContext:
runAsUser: 1000
runAsGroup: 3000
fsGroup: 2000
supplementalGroups: [4000]



Rendered as the pod-level security context.
headNodeConfig.securityContext / slurmdbdConfigs.securityContext / slurmrestdConfigs.securityContext / workerNodesConfig.workerGroups[].securityContextsecurityContext:
allowPrivilegeEscalation: false
Rendered as the security context of the primary container.
headNodeConfig.volumeMounts / slurmdbdConfigs.volumeMounts / slurmrestdConfigs.volumeMounts / workerNodesConfig.workerGroups[].volumeMountsN/ARendered as the volume mounting configurations of the primary container.
headNodeConfig.volumes / slurmdbdConfigs.volumes / slurmrestdConfigs.volumes / workerNodesConfig.workerGroups[].volumesN/ARendered as the volumes mounted to the pod.
slurmConfigs.slurmConfigPathInPod""The mount path of the Slurm configuration file in the pod. If the configuration file is mounted as a volume, set this value to the path where slurm.conf is mounted. The pod startup commands copy the file to /etc/slurm/ and restrict access.
slurmConfigs.createConfigsByConfigMaptrueSpecifies whether to automatically create a ConfigMap for Slurm configurations.
slurmConfigs.configMapName""The name of the ConfigMap that stores the Slurm configurations.
slurmConfigs.filesInConfigMap""The content in the automatically created ConfigMap.
mungeConfigs.mungeConfigPathInPodN/AThe mount path of the MUNGE configuration file in the pod. If the configuration file is mounted as a volume, set this value to the path where munge.key is mounted. The pod startup commands copy the file to /etc/munge/ and restrict access.
mungeConfigs.createConfigsBySecretN/ASpecifies whether to automatically create a Secret for MUNGE configurations.
mungeConfigs.secretNameN/AThe name of the Secret that stores the MUNGE configurations.
mungeConfigs.contentN/AThe content in the automatically created Secret.

For more information about slurmConfigs.filesInConfigMap, see Slurm System Configuration Tool (schedmd.com).

Important

If you modify slurmConfigs.filesInConfigMap after the pod is created, you must recreate the pod for the change to take effect. We recommend that you verify the modification before recreating the pod.

Install the Helm chart

  1. Run the following command to add the Alibaba Cloud Helm repository to your local Helm client: This allows you to access various charts provided by Alibaba Cloud, including the Slurm-managed cluster chart.

       helm repo add aliyun https://aliacs-app-catalog.oss-cn-hangzhou.aliyuncs.com/charts-incubator/
  2. Run the following command to pull and decompress the Helm chart: This creates a directory named ack-slurm-cluster in the current directory. The directory contains all chart files and templates.

       helm pull aliyun/ack-slurm-cluster --untar=true
  3. Modify the chart parameters in the values.yaml file. The values.yaml file contains the default chart configurations. Modify this file to customize parameter settings such as Slurm configurations, resource requests and limits, and storage based on your requirements.

       cd ack-slurm-cluster
       vi values.yaml
  4. Use Helm to install the chart: This deploys the Slurm-managed cluster.

       cd ..
       helm install my-slurm-cluster ack-slurm-cluster # Replace my-slurm-cluster with your desired release name.
  5. Verify that the Slurm-managed cluster is deployed. After the deployment is complete, use kubectl to check the deployment status and confirm that the Slurm-managed cluster runs as expected:

       kubectl get pods -l app.kubernetes.io/name=slurm-cluster

Step 3: Log on to the Slurm-managed cluster

Log on as a Kubernetes cluster administrator

A Kubernetes cluster administrator has the permissions to manage the entire Kubernetes cluster. Because a Slurm-managed cluster runs as a pod in the Kubernetes cluster, the administrator can use kubectl to log on to any pod of any Slurm-managed cluster and has root permissions by default.

Run the following command to log on to a pod of the Slurm-managed cluster:

# Replace slurm-job-demo-head-x9sgs with the name of the pod in your cluster.
kubectl exec -it slurm-job-demo-xxxxx -- bash

Log on as a regular user of the Slurm-managed cluster

Administrators or regular users of a Slurm-managed cluster may not have permissions to run the kubectl exec command. In this case, you can log on to the Slurm-managed cluster by using SSH. Two methods are available:

  • LoadBalancer Service: Use an external IP address of a Service to access the head pod. This method is suitable for long-term, stable connections. You access the Slurm-managed cluster from anywhere within the internal network by using a Classic Load Balancer (CLB) instance and its external IP address.

  • Port forwarding: Use the kubectl port-forward command for temporary access. This method is suitable for short-term operations and maintenance (O&M) or debugging because it requires continuous execution of the port-forward command.

Log on to the head pod by using a LoadBalancer Service

  1. Create a LoadBalancer Service to expose internal services in the cluster to external access. For more information, see Use an existing SLB instance to expose an application or Use an automatically created SLB instance to expose an application.

    • The LoadBalancer Service must use an internal-facing Classic Load Balancer (CLB) instance.

    • Add the following labels to the Service so that it routes incoming requests to the expected pod:

      • kai.alibabacloud.com/slurm-cluster: ack-slurm-cluster-1

      • kai.alibabacloud.com/slurm-node-type: head

  2. Run the following command to obtain the external IP address of the LoadBalancer Service:

       kubectl get svc
  3. Run the following command to log on to the head pod by using SSH:

       # Replace $YOURUSER with your username and $EXTERNAL_IP with the external IP address of the Service.
       ssh $YOURUSER@$EXTERNAL_IP

Forward requests by using the port-forward command

Warning

To use the port-forward command, you must save the kubeconfig file of the Kubernetes cluster to your local host. This may cause security risks. We recommend that you do not use this method in production environments.

  1. Run the following command to enable a local port for request forwarding and map it to port 22 of the head pod running slurmctld. SSH uses port 22 by default.

       # Replace $NAMESPACE, $CLUSTERNAME, and $LOCALPORT with the actual values.
       kubectl port-forward -n $NAMESPACE svc/$CLUSTERNAME $LOCALPORT:22
  2. While the port-forward command is running, run the following command to log on. All users on the current host can log on to the cluster and submit jobs.

       # Replace $YOURUSER with the username you want to use to log on to the head pod.
       ssh -p $LOCALPORT $YOURUSER@localhost

Step 4: Use the Slurm-managed cluster

The following sections describe how to synchronize users across nodes, share logs across nodes, and perform auto scaling for the Slurm-managed cluster.

Synchronize users across nodes

Slurm does not provide a centralized user authentication service. When you use the sbatch command to submit jobs, the jobs may fail if the submitting user's account does not exist on the node selected to execute the jobs. To resolve this issue, you can configure Lightweight Directory Access Protocol (LDAP) for the Slurm-managed cluster. LDAP serves as a centralized backend service for authentication, allowing Slurm to authenticate user identities.

Deploy the LDAP backend

  1. Create a file named ldap.yaml and copy the following content to the file. This creates a basic LDAP instance that stores and manages user information. The ldap.yaml file defines an LDAP backend pod and its associated Service. The pod contains an LDAP container, and the Service exposes the LDAP service within the network.

    View the LDAP backend pod and its associated Service

       ---
       apiVersion: apps/v1
       kind: Deployment
       metadata:
         namespace: default
         name: ldap
         labels:
           app: ldap
       spec:
         selector:
           matchLabels:
             app: ldap
         revisionHistoryLimit: 10
         template:
           metadata:
             labels:
               app: ldap
           spec:
             securityContext:
               seLinuxOptions: {}
             imagePullSecrets: []
             restartPolicy: Always
             initContainers: []
             containers:
               - image: 'osixia/openldap:1.4.0'
                 imagePullPolicy: IfNotPresent
                 name: ldap
                 volumeMounts:
                   - name: openldap-data
                     mountPath: /var/lib/ldap
                     subPath: data
                   - name: openldap-data
                     mountPath: /etc/ldap/slapd.d
                     subPath: config
                   - name: openldap-data
                     mountPath: /container/service/slapd/assets/certs
                     subPath: certs
                   - name: secret-volume
                     mountPath: /container/environment/01-custom
                   - name: container-run
                     mountPath: /container/run
                 args:
                   - '--copy-service'
                 resources:
                   limits:
                   requests:
                 env: []
                 readinessProbe:
                   tcpSocket:
                     port: openldap
                   initialDelaySeconds: 20
                   timeoutSeconds: 1
                   periodSeconds: 10
                   successThreshold: 1
                   failureThreshold: 10
                 livenessProbe:
                   tcpSocket:
                     port: openldap
                   initialDelaySeconds: 20
                   timeoutSeconds: 1
                   periodSeconds: 10
                   successThreshold: 1
                   failureThreshold: 10
                 lifecycle: {}
                 ports:
                   - name: openldap
                     containerPort: 389
                     protocol: TCP
                   - name: ssl-ldap-port
                     containerPort: 636
                     protocol: TCP
             volumes:
               - name: openldap-data
                 emptyDir: {}
               - name: secret-volume
                 secret:
                   secretName: ldap-secret
                   defaultMode: 420
                   items: []
               - name: container-run
                 emptyDir: {}
             dnsPolicy: ClusterFirst
             dnsConfig: {}
             terminationGracePeriodSeconds: 30
         progressDeadlineSeconds: 600
         strategy:
           type: RollingUpdate
           rollingUpdate:
             maxUnavailable: 25%
             maxSurge: 25%
         replicas: 1
       ---
       apiVersion: v1
       kind: Service
       metadata:
         annotations: {}
         labels:
           app: ldap
         name: ldap-service
         namespace: default
       spec:
         ports:
           - name: openldap
             port: 389
             protocol: TCP
             targetPort: openldap
           - name: ssl-ldap-port
             port: 636
             protocol: TCP
             targetPort: ssl-ldap-port
         selector:
           app: ldap
         sessionAffinity: None
         type: ClusterIP
       ---
       metadata:
         name: ldap-secret
         namespace: default
         annotations: {}
       data:
         env.startup.yaml: >-
           IyBUaGlzIGlzIHRoZSBkZWZhdWx0IGltYWdlIHN0YXJ0dXAgY29uZmlndXJhdGlvbiBmaWxlCiMgdGhpcyBmaWxlIGRlZmluZSBlbnZpcm9ubWVudCB2YXJpYWJsZXMgdXNlZCBkdXJpbmcgdGhlIGNvbnRhaW5lciAqKmZpcnN0IHN0YXJ0KiogaW4gKipzdGFydHVwIGZpbGVzKiouCgojIFRoaXMgZmlsZSBpcyBkZWxldGVkIHJpZ2h0IGFmdGVyIHN0YXJ0dXAgZmlsZXMgYXJlIHByb2Nlc3NlZCBmb3IgdGhlIGZpcnN0IHRpbWUsCiMgYWZ0ZXIgdGhhdCBhbGwgdGhlc2UgdmFsdWVzIHdpbGwgbm90IGJlIGF2YWlsYWJsZSBpbiB0aGUgY29udGFpbmVyIGVudmlyb25tZW50LgojIFRoaXMgaGVscHMgdG8ga2VlcCB5b3VyIGNvbnRhaW5lciBjb25maWd1cmF0aW9uIHNlY3JldC4KIyBtb3JlIGluZm9ybWF0aW9uIDogaHR0cHM6Ly9naXRodWIuY29tL29zaXhpYS9kb2NrZXItbGlnaHQtYmFzZWltYWdlCgojIFJlcXVpcmVkIGFuZCB1c2VkIGZvciBuZXcgbGRhcCBzZXJ2ZXIgb25seQpMREFQX09SR0FOSVNBVElPTjogRXhhbXBsZSBJbmMuCkxEQVBfRE9NQUlOOiBleGFtcGxlLm9yZwpMREFQX0JBU0VfRE46ICNpZiBlbXB0eSBhdXRvbWF0aWNhbGx5IHNldCBmcm9tIExEQVBfRE9NQUlOCgpMREFQX0FETUlOX1BBU1NXT1JEOiBhZG1pbgpMREFQX0NPTkZJR19QQVNTV09SRDogY29uZmlnCgpMREFQX1JFQURPTkxZX1VTRVI6IGZhbHNlCkxEQVBfUkVBRE9OTFlfVVNFUl9VU0VSTkFNRTogcmVhZG9ubHkKTERBUF9SRUFET05MWV9VU0VSX1BBU1NXT1JEOiByZWFkb25seQoKIyBCYWNrZW5kCkxEQVBfQkFDS0VORDogaGRiCgojIFRscwpMREFQX1RMUzogdHJ1ZQpMREFQX1RMU19DUlRfRklMRU5BTUU6IGxkYXAuY3J0CkxEQVBfVExTX0tFWV9GSUxFTkFNRTogbGRhcC5rZXkKTERBUF9UTFNfQ0FfQ1JUX0ZJTEVOQU1FOiBjYS5jcnQKCkxEQVBfVExTX0VORk9SQ0U6IGZhbHNlCkxEQVBfVExTX0NJUEhFUl9TVUlURTogU0VDVVJFMjU2Oi1WRVJTLVNTTDMuMApMREFQX1RMU19QUk9UT0NPTF9NSU46IDMuMQpMREFQX1RMU19WRVJJRllfQ0xJRU5UOiBkZW1hbmQKCiMgUmVwbGljYXRpb24KTERBUF9SRVBMSUNBVElPTjogZmFsc2UKIyB2YXJpYWJsZXMgJExEQVBfQkFTRV9ETiwgJExEQVBfQURNSU5fUEFTU1dPUkQsICRMREFQX0NPTkZJR19QQVNTV09SRAojIGFyZSBhdXRvbWF0aWNhbHkgcmVwbGFjZWQgYXQgcnVuIHRpbWUKCiMgaWYgeW91IHdhbnQgdG8gYWRkIHJlcGxpY2F0aW9uIHRvIGFuIGV4aXN0aW5nIGxkYXAKIyBhZGFwdCBMREFQX1JFUExJQ0FUSU9OX0NPTkZJR19TWU5DUFJPViBhbmQgTERBUF9SRVBMSUNBVElPTl9EQl9TWU5DUFJPViB0byB5b3VyIGNvbmZpZ3VyYXRpb24KIyBhdm9pZCB1c2luZyAkTERBUF9CQVNFX0ROLCAkTERBUF9BRE1JTl9QQVNTV09SRCBhbmQgJExEQVBfQ09ORklHX1BBU1NXT1JEIHZhcmlhYmxlcwpMREFQX1JFUExJQ0FUSU9OX0NPTkZJR19TWU5DUFJPVjogYmluZGRuPSJjbj1hZG1pbixjbj1jb25maWciIGJpbmRtZXRob2Q9c2ltcGxlIGNyZWRlbnRpYWxzPSRMREFQX0NPTkZJR19QQVNTV09SRCBzZWFyY2hiYXNlPSJjbj1jb25maWciIHR5cGU9cmVmcmVzaEFuZFBlcnNpc3QgcmV0cnk9IjYwICsiIHRpbWVvdXQ9MSBzdGFydHRscz1jcml0aWNhbApMREFQX1JFUExJQ0FUSU9OX0RCX1NZTkNQUk9WOiBiaW5kZG49ImNuPWFkbWluLCRMREFQX0JBU0VfRE4iIGJpbmRtZXRob2Q9c2ltcGxlIGNyZWRlbnRpYWxzPSRMREFQX0FETUlOX1BBU1NXT1JEIHNlYXJjaGJhc2U9IiRMREFQX0JBU0VfRE4iIHR5cGU9cmVmcmVzaEFuZFBlcnNpc3QgaW50ZXJ2YWw9MDA6MDA6MDA6MTAgcmV0cnk9IjYwICsiIHRpbWVvdXQ9MSBzdGFydHRscz1jcml0aWNhbApMREFQX1JFUExJQ0FUSU9OX0hPU1RTOgogIC0gbGRhcDovL2xkYXAuZXhhbXBsZS5vcmcgIyBUaGUgb3JkZXIgbXVzdCBiZSB0aGUgc2FtZSBvbiBhbGwgbGRhcCBzZXJ2ZXJzCiAgLSBsZGFwOi8vbGRhcDIuZXhhbXBsZS5vcmcKCgojIFJlbW92ZSBjb25maWcgYWZ0ZXIgc2V0dXAKTERBUF9SRU1PVkVfQ09ORklHX0FGVEVSX1NFVFVQOiB0cnVlCgojIGNmc3NsIGVudmlyb25tZW50IHZhcmlhYmxlcyBwcmVmaXgKTERBUF9DRlNTTF9QUkVGSVg6IGxkYXAgIyBjZnNzbC1oZWxwZXIgZmlyc3Qgc2VhcmNoIGNvbmZpZyBmcm9tIExEQVBfQ0ZTU0xfKiB2YXJpYWJsZXMsIGJlZm9yZSBDRlNTTF8qIHZhcmlhYmxlcy4K
         env.yaml: >-
           IyBUaGlzIGlzIHRoZSBkZWZhdWx0IGltYWdlIGNvbmZpZ3VyYXRpb24gZmlsZQojIFRoZXNlIHZhbHVlcyB3aWxsIHBlcnNpc3RzIGluIGNvbnRhaW5lciBlbnZpcm9ubWVudC4KCiPCoEFsbCBlbnZpcm9ubWVudCB2YXJpYWJsZXMgdXNlZCBhZnRlciB0aGUgY29udGFpbmVyIGZpcnN0IHN0YXJ0CiMgbXVzdCBiZSBkZWZpbmVkIGhlcmUuCiMgbW9yZSBpbmZvcm1hdGlvbiA6IGh0dHBzOi8vZ2l0aHViLmNvbS9vc2l4aWEvZG9ja2VyLWxpZ2h0LWJhc2VpbWFnZQoKIyBHZW5lcmFsIGNvbnRhaW5lciBjb25maWd1cmF0aW9uCiMgc2VlIHRhYmxlIDUuMSBpbiBodHRwOi8vd3d3Lm9wZW5sZGFwLm9yZy9kb2MvYWRtaW4yNC9zbGFwZGNvbmYyLmh0bWwgZm9yIHRoZSBhdmFpbGFibGUgbG9nIGxldmVscy4KTERBUF9MT0dfTEVWRUw6IDI1Ngo=
       type: Opaque
       kind: Secret
       apiVersion: v1
  2. Run the following command to deploy the LDAP backend Service: Expected output:

       kubectl apply -f ldap.yaml
       deployment.apps/ldap created
       service/ldap-service created
       secret/ldap-secret created

(Optional) Deploy the LDAP frontend

  1. Create a file named phpldapadmin.yaml and copy the following content to the file. This deploys an LDAP frontend pod and its associated Service for improved management efficiency through a web interface.

    View the LDAP frontend pod and its associated Service

       ---
       apiVersion: apps/v1
       kind: Deployment
       metadata:
         namespace: default
         name: phpldapadmin
         labels:
           io.kompose.service: phpldapadmin
       spec:
         selector:
           matchLabels:
             io.kompose.service: phpldapadmin
         revisionHistoryLimit: 10
         template:
           metadata:
             labels:
               io.kompose.service: phpldapadmin
           spec:
             securityContext:
               seLinuxOptions: {}
             imagePullSecrets: []
             restartPolicy: Always
             initContainers: []
             containers:
               - image: 'osixia/phpldapadmin:0.9.0'
                 imagePullPolicy: Always
                 name: phpldapadmin
                 volumeMounts: []
                 resources:
                   limits:
                   requests:
                 env:
                   - name: PHPLDAPADMIN_HTTPS
                     value: 'false'
                   - name: PHPLDAPADMIN_LDAP_HOSTS
                     value: ldap-service
                 lifecycle: {}
                 ports:
                   - containerPort: 80
                     protocol: TCP
             volumes: []
             dnsPolicy: ClusterFirst
             dnsConfig: {}
             terminationGracePeriodSeconds: 30
         progressDeadlineSeconds: 600
         strategy:
           type: RollingUpdate
           rollingUpdate:
             maxUnavailable: 25%
             maxSurge: 25%
         replicas: 1
       ---
       apiVersion: v1
       kind: Service
       metadata:
         namespace: default
         name: phpldapadmin
         annotations:
           k8s.kuboard.cn/workload: phpldapadmin
         labels:
           io.kompose.service: phpldapadmin
       spec:
         selector:
           io.kompose.service: phpldapadmin
         type: ClusterIP
         ports:
           - port: 8080
             targetPort: 80
             protocol: TCP
             name: '8080'
             nodePort: 0
         sessionAffinity: None

    Run the following command to deploy the LDAP frontend Service:

       kubectl apply -f phpldapadmin.yaml

Configure the LDAP client

  1. Log on to a pod in the Slurm-managed cluster as described in Step 3, and run the following commands to install the LDAP client package:

       apt update
       apt install libnss-ldapd
  2. After the libnss-ldapd package is installed, configure the network authentication service for the Slurm-managed cluster in the pod.

    1. Modify the following parameters in the /etc/nslcd.conf file to define the connection to the LDAP server:

       apt update
       apt install vim
       ...
       BASE	dc=example,dc=org # Replace the value with the distinguished name of the root node in the LDAP directory structure.
       URI	ldap://ldap-service # Replace the value with the uniform resource identifier (URI) of your LDAP server.
       ...
       ...
       uri ldap://ldap-service # Replace the value with the URI of your LDAP server.
       base dc=example,dc=org # Specify this parameter based on your LDAP directory structure.
       ...
       tls_cacertfile /etc/ssl/certs/ca-certificates.crt # Specify the path to the certificate authority (CA) certificate file used to verify the LDAP server certificate.
       ...

Share and access logs

By default, the job logs generated by the sbatch command are stored on the node that executes the jobs. This can make it difficult to view logs centrally. To simplify log management, you can create a File Storage NAS (NAS) file system to store all job logs in accessible directories. This allows logs to be centrally collected and accessed regardless of which node executes the computing jobs.

  1. Create a NAS file system to store and share the logs of each node. For more information, see Create a file system.

  2. Log on to the ACK console, and create a persistent volume (PV) and a persistent volume claim (PVC) for the NAS file system. For more information, see Mount a statically provisioned NAS volume.

  3. Modify the SlurmCluster CR. Configure the volumeMounts and volumes parameters in the slurmctld and workerGroupSpecs sections to reference the created PVC and mount it to the /home directory. Example:

       slurmctld:
       ...
       # Specify /home as the mount target.
         volumeMounts:
         - mountPath: /home
           name: test  # The name of the volume that references the PVC.
         volumes:
       # Add the PVC definition.
         - name: test  # Must match the name in volumeMounts.
           persistentVolumeClaim:
             claimName: test  # Replace with the name of your PVC.
       ...
       workerGroupSpecs:
         # ... Repeat the volume and volumeMounts configuration for each worker group.
  4. Run the following command to deploy the SlurmCluster CR. After the SlurmCluster CR is deployed, worker nodes can share the NAS file system.

    Important

    If the SlurmCluster CR fails to deploy, run the kubectl delete slurmcluster slurm-job-demo command to delete the CR and then redeploy it.

       kubectl apply -f slurmcluster.yaml

Perform auto scaling for the Slurm-managed cluster

The root path of the default Slurm image contains executable files and scripts such as slurm-resume.sh, slurm-suspend.sh, and slurmctld-copilot. These interact with slurmctld to scale the Slurm-managed cluster.

Auto scaling for Slurm clusters based on on-cloud nodes

Slurm on ACK supports two types of nodes:

  • Local nodes: Physical compute nodes that are directly connected to slurmctld.

  • On-cloud nodes: Logical nodes backed by VM instances that can be created and destroyed on demand by cloud service providers.

image

Auto scaling for Slurm on ACK

image

Procedure

  1. Configure auto scaling permissions. If Helm is installed, auto scaling permissions are automatically configured for the slurmctld pod and you can skip this step. The head pod requires permissions to access and update the SlurmCluster CR for auto scaling. We recommend that you use RBAC to grant the required permissions. Follow these steps: First, create the ServiceAccount, Role, and RoleBinding for the slurmctld pod. In the following example, the Slurm-managed cluster name is slurm-job-demo and the namespace is default. Create a file named rbac.yaml and copy the following content to the file: Run kubectl apply -f rbac.yaml to submit the resource list. Next, grant permissions to the slurmctld pod. Run kubectl edit slurmcluster slurm-job-demo to modify the Slurm-managed cluster. Set Spec.Slurmctld.Template.Spec.ServiceAccountName to the ServiceAccount you created: To apply the changes, rebuild the StatefulSet that manages slurmctld. Run kubectl get sts slurm-job-demo to find the StatefulSet, then run kubectl delete sts slurm-job-demo to delete it. The Slurm Operator rebuilds the StatefulSet and applies the new configurations.

       apiVersion: v1
       kind: ServiceAccount
       metadata:
         name: slurm-job-demo
       ---
       apiVersion: rbac.authorization.k8s.io/v1
       kind: Role
       metadata:
         name: slurm-job-demo
       rules:
       - apiGroups: ["kai.alibabacloud.com"]
         resources: ["slurmclusters"]
         verbs: ["get", "watch", "list", "update", "patch"]
         resourceNames: ["slurm-job-demo"]
       ---
       apiVersion: rbac.authorization.k8s.io/v1
       kind: RoleBinding
       metadata:
         name: slurm-job-demo
       subjects:
       - kind: ServiceAccount
         name: slurm-job-demo
       roleRef:
         kind: Role
         name: slurm-job-demo
         apiGroup: rbac.authorization.k8s.io
       apiVersion: kai.alibabacloud.com/v1
       kind: SlurmCluster
       ...
       spec:
         slurmctld:
           template:
             spec:
               serviceAccountName: slurm-job-demo
       ...
  2. Configure the auto scaling parameters in /etc/slurm/slurm.conf.

    Method A: Manage ConfigMaps by using a shared volume

       # The following parameters are required if you use on-cloud nodes.
       # The SuspendProgram and ResumeProgram features are developed by Alibaba Cloud.
       SuspendTimeout=600
       ResumeTimeout=600
       # The interval at which the node is automatically suspended when no job runs on the node.
       SuspendTime=600
       # Set the number of nodes that can be scaled per minute.
       ResumeRate=1
       SuspendRate=1
       # You must set the value of the NodeName parameter in the ${cluster_name}-worker-${group_name}- format. You must specify the amount of resources for the node in this line. Otherwise, the slurmctld pod
       # considers that the node has only one vCPU. Make sure that the resources that you specified on the on-cloud nodes are the same as those declared in the workerGroupSpecs parameter. Otherwise, resources may be wasted.
       NodeName=slurm-job-demo-worker-cpu-[0-10] Feature=cloud State=CLOUD
       # The following configurations are fixed. Keep them unchanged.
       CommunicationParameters=NoAddrCache
       ReconfigFlags=KeepPowerSaveSettings
       SuspendProgram="/slurm-suspend.sh"
       ResumeProgram="/slurm-resume.sh"

    Method B: Manually manage ConfigMaps

    If slurm.conf is stored in the ConfigMap named slurm-config, run kubectl edit slurm-config to add the following configurations:

       slurm.conf:
       ...
         # The following parameters are required if you use on-cloud nodes.
         # The SuspendProgram and ResumeProgram features are developed by Alibaba Cloud.
         SuspendTimeout=600
         ResumeTimeout=600
         # The interval at which the node is automatically suspended when no job runs on the node.
         SuspendTime=600
         # Set the number of nodes that can be scaled per minute.
         ResumeRate=1
         SuspendRate=1
         # You must set the value of the NodeName parameter in the ${cluster_name}-worker-${group_name}- format. You must specify the amount of resources for the node in this line. Otherwise, the slurmctld pod
         # considers that the node has only one vCPU. Make sure that the resources that you specified on the on-cloud nodes are the same as those declared in the workerGroupSpecs parameter. Otherwise, resources may be wasted.
         NodeName=slurm-job-demo-worker-cpu-[0-10] Feature=cloud State=CLOUD
         # The following configurations are fixed. Keep them unchanged.
         CommunicationParameters=NoAddrCache
         ReconfigFlags=KeepPowerSaveSettings
         SuspendProgram="/slurm-suspend.sh"
         ResumeProgram="/slurm-resume.sh"

    Method C: Use Helm to manage ConfigMaps

    1. Run the helm upgrade command to update the Slurm configuration.

       slurm.conf:
       ...
         # The following parameters are required if you use on-cloud nodes.
         # The SuspendProgram and ResumeProgram features are developed by Alibaba Cloud.
         SuspendTimeout=600
         ResumeTimeout=600
         # The interval at which the node is automatically suspended when no job runs on the node.
         SuspendTime=600
         # Set the number of nodes that can be scaled per minute.
         ResumeRate=1
         SuspendRate=1
         # You must set the value of the NodeName parameter in the ${cluster_name}-worker-${group_name}- format. You must specify the amount of resources for the node in this line. Otherwise, the slurmctld pod
         # considers that the node has only one vCPU. Make sure that the resources that you specified on the on-cloud nodes are the same as those declared in the workerGroupSpecs parameter. Otherwise, resources may be wasted.
         NodeName=slurm-job-demo-worker-cpu-[0-10] Feature=cloud State=CLOUD
         # The following configurations are fixed. Keep them unchanged.
         CommunicationParameters=NoAddrCache
         ReconfigFlags=KeepPowerSaveSettings
         SuspendProgram="/slurm-suspend.sh"
         ResumeProgram="/slurm-resume.sh"
  3. Apply the new configuration. If the name of the Slurm-managed cluster is slurm-job-demo, run kubectl delete sts slurm-job-demo to apply the new configuration for the slurmctld pod.

  4. Set the number of worker node replicas to 0 in the slurmcluster.yaml file so that you can observe node scaling activities in subsequent steps.

    Manual management

    Run kubectl edit slurmcluster slurm-job-demo and change the value of workerCount to 10 in the Slurm-managed cluster. This sets the number of worker node replicas to 0.

    Manage by using Helm

    In the values.yaml file, change .Values.workerGroup[].workerCount to 0. Then run helm upgrade slurm-job-demo . to update the current Helm chart. This sets the number of worker node replicas to 0.

  5. Submit a job by using the sbatch command. Enter the following content after the command prompt: Expected output: This output confirms that the script content is correct. Expected output: This output indicates that the job is submitted and assigned a job ID.

    1. Run the following command to submit the script to the Slurm-managed cluster for processing:

       cat << EOF > cloudnodedemo.sh
       > #!/bin/bash
       > srun hostname
       > EOF
       cat cloudnodedemo.sh
       #!/bin/bash
         srun hostname
       sbatch cloudnodedemo.sh
       Submitted batch job 1
  6. View the cluster scaling results. Expected output: This output indicates that the Slurm-managed cluster automatically added one compute node to execute the submitted job. Expected output: This output shows that the slurm-demo-worker-cpu-0 pod was added to the cluster, confirming that the cluster scaled out when the job was submitted. Expected output: This output shows that slurm-demo-worker-cpu-0 is the newly started node and another 10 on-cloud nodes are available for scale-out. Expected output: In the output, NodeList=slurm-demo-worker-cpu-0 indicates that the job was executed on the newly added node. Expected output: This output shows that the number of nodes available for scale-out has increased to 11, confirming that the automatic scale-in is complete.

    1. After a period of time, run the following command to view the scale-in results:

       cat /var/log/slurm-resume.log
       namespace: default cluster: slurm-demo
         resume called, args [slurm-demo-worker-cpu-0]
         slurm cluster metadata: default slurm-demo
         get SlurmCluster CR slurm-demo succeed
         hostlists: [slurm-demo-worker-cpu-0]
         resume node slurm-demo-worker-cpu-0
         resume worker -cpu-0
         resume node -cpu-0 end
       kubectl get pod
       NAME                                          READY   STATUS    RESTARTS        AGE
       slurm-demo-head-9hn67                         1/1     Running   0               21m
       slurm-demo-worker-cpu-0                       1/1     Running   0               43s
       sinfo
       PARTITION AVAIL  TIMELIMIT  NODES  STATE NODELIST
       debug*       up   infinite      10  idle~ slurm-job-demo-worker-cpu-[2-10]
       debug*       up   infinite      1   idle slurm-job-demo-worker-cpu-[0-1]
       scontrol show job 1
       JobId=1 JobName=cloudnodedemo.sh
          UserId=root(0) GroupId=root(0) MCS_label=N/A
          Priority=4294901757 Nice=0 Account=(null) QOS=(null)
          JobState=COMPLETED Reason=None Dependency=(null)
          Requeue=1 Restarts=0 BatchFlag=1 Reboot=0 ExitCode=0:0
          RunTime=00:00:00 TimeLimit=UNLIMITED TimeMin=N/A
          SubmitTime=2024-05-28T11:37:36 EligibleTime=2024-05-28T11:37:36
          AccrueTime=2024-05-28T11:37:36
          StartTime=2024-05-28T11:37:36 EndTime=2024-05-28T11:37:36 Deadline=N/A
          SuspendTime=None SecsPreSuspend=0 LastSchedEval=2024-05-28T11:37:36 Scheduler=Main
          Partition=debug AllocNode:Sid=slurm-job-demo:93
          ReqNodeList=(null) ExcNodeList=(null)
          NodeList=slurm-job-demo-worker-cpu-0
          BatchHost=slurm-job-demo-worker-cpu-0
          NumNodes=1 NumCPUs=1 NumTasks=1 CPUs/Task=1 ReqB:S:C:T=0:0:*:*
          ReqTRES=cpu=1,mem=1M,node=1,billing=1
          AllocTRES=cpu=1,mem=1M,node=1,billing=1
          Socks/Node=* NtasksPerN:B:S:C=0:0:*:* CoreSpec=*
          MinCPUsNode=1 MinMemoryNode=0 MinTmpDiskNode=0
          Features=(null) DelayBoot=00:00:00
          OverSubscribe=OK Contiguous=0 Licenses=(null) Network=(null)
          Command=//cloudnodedemo.sh
          WorkDir=/
          StdErr=//slurm-1.out
          StdIn=/dev/null
          StdOut=//slurm-1.out
          Power=
       sinfo
       PARTITION AVAIL  TIMELIMIT  NODES  STATE NODELIST
       debug*       up   infinite     11  idle~ slurm-demo-worker-cpu-[0-10]