All Products
Search
Document Center

Platform For AI:Manage Lingjun clusters and nodes

Last Updated:Jan 23, 2024

A Lingjun cluster is a collection of high-performance Lingjun compute nodes equipped with Lingjun optimization components. Each Lingjun node corresponds to a GPU server, which can be used to deploy heterogeneous computing services. This topic describes how to manage Lingjun clusters and Lingjun nodes. For example, you can view the information about a Lingjun cluster or node and scale out a Lingjun cluster.

Manage Lingjun clusters

image

A Lingjun cluster can be in one of the following states:

  • Initialization Failed: The cluster failed to be initialized. For information about how to view the details of the failure, see O&M Task Center.

  • Initializing: The network of the cluster is being configured, and the Lingjun compute nodes of the cluster are being initialized.

  • Running: The cluster is running. You can scale out or scale in a cluster, or reinstall or restart a node only when the cluster is in the Running state.

    Important

    If the cluster scale-out, cluster scale-in, node reinstall, and node restart tasks involve different Lingjun compute nodes, you can submit these tasks at a time to run them in parallel.

View the information about a cluster

  1. Log on to the Intelligent Computing Lingjun console.

  2. In the left-side navigation pane, choose Resources and Nodes > Cluster Management.

  3. Find the cluster that you want to manage and click Details in the Operations column. The Cluster Details page appears.

    1. View the basic information about the cluster, such as the cluster name, number of node groups, and creation information.

    2. View more information about the cluster on the Node Group, Monitoring and Alerting, Basic Metrics, RDMA, and GPU tabs.

Scale out a cluster

Note

If you scale out a cluster, you must install a Cloud Parallel File Storage (CPFS) client on each GPU node that you want to add and add related nodes to the associated CPFS cluster.

You must also add tags to the added nodes.

  1. Log on to the Intelligent Computing Lingjun console.

  2. In the left-side navigation pane, choose Resources and Nodes > Cluster Management.

  3. Find the cluster that you want to manage and click Expand in the Operations column.

    1. In the Original Group Details section, find a node group and click Scale Up in the Actions column.

    2. In the dialog box that appears, configure the Node Name Prefix, Logon Password, and Confirm Password parameters.

    3. On the Unused tab, select one or more unused nodes or click Purchase Node to purchase nodes. Then, click Yes.

  4. In the The following information displays the detailed configurations for scale-up section, click Confirm Submission.

  5. Go back to the Cluster Management page. The state of the cluster is Scaling Up. Wait until the scale-out is complete.

Scale in a cluster

Warning
  • If you scale in a cluster, the removed nodes are reinstalled and all data is cleared from the removed nodes. Make sure that the node data has been backed up before you remove a node.

  • You must also remove related nodes from the associated CPFS cluster.

  1. Log on to the Intelligent Computing Lingjun console.

  2. In the left-side navigation pane, choose Resources and Nodes > Cluster Management.

  3. Find the cluster that you want to manage and click Shrink in the Operations column.

    1. In the Original Group Details section, select one or more nodes that you want to remove from the cluster and click Batch Remove from Cluster.

    2. In the The following information displays the detailed configurations for scale-down section, click Confirm Submission.

  4. On the Confirm Scale-down Configurations page, enter DELETE in the field and click OK.

  5. Go back to the Cluster Management page. The state of the cluster is Scaling Down. Wait until the scale-in is complete.

Delete a cluster

Important
  • Before you delete a cluster, you must remove all nodes from the cluster.

  • The associated CPFS cluster is not deleted when a cluster is deleted.

  1. Log on to the Intelligent Computing Lingjun console.

  2. In the left-side navigation pane, choose Resources and Nodes > Cluster Management.

  3. Click the ID of the cluster that you want to delete. On the Cluster Details page, click Delete in the upper-right corner.

  4. In the message that appears, click OK.

Create a node group for a cluster

You can create a node group for a Lingjun cluster in one of the following ways:

  • Create a node group for a cluster when you create the cluster. For more information, see Configure clusters and node groups.

  • Create a node group for an existing cluster.

    1. Log on to the Intelligent Computing Lingjun console.

    2. In the left-side navigation pane, choose Resources and Nodes > Cluster Management.

    3. Click the ID of the cluster for which you want to create a node group.

    4. On the Cluster Details page, click the Node Group tab.

    5. On the Node Group tab, click Create Group. Configure the information about a node group, such as the name of the node group and the default model.

    6. Optional. After you create a node group, you can modify the name of the node group or delete the node group.

Manage Lingjun nodes

Important

You can perform only one operation on a Lingjun compute node at a time. For example, you can add a node to a cluster, remove a node from a cluster, reinstall a node, or restart a node.

Purchase a node

  1. Log on to the Intelligent Computing Lingjun console.

  2. In the left-side navigation pane, choose Resources and Nodes > Node Management.

  3. On the Node Management page, click Purchase Node.

  4. Follow the instructions to purchase a node.

View the details of a node

  1. Log on to the Intelligent Computing Lingjun console.

  2. In the left-side navigation pane, choose Resources and Nodes > Node Management. The Node Management page appears.

  3. Click the All tab to view all nodes.

    • You can view the basic information about a node, such as the node ID, node name, image name, and zone.

    • You can search for nodes based on keywords. First, select a category from the drop-down list, such as Image Name, Zone, or IP Address. Then, enter a keyword in the search box and click the search icon.

  4. Click the Unused tab to view the unused nodes. You can view the basic information about an unused node, such as the node type and GPU.

Log on to a node

  1. Log on to the Intelligent Computing Lingjun console.

  2. In the left-side navigation pane, choose Resources and Nodes > Node Management.

  3. Find the node that you want to manage, click the More icon in the Actions column, and then select Remote Logon.

    • Use root as the logon username.

    • Use the logon password of the cluster. For more information, see the Configure clusters and node groups section of the "Create a basic Lingjun cluster" topic.

Reinstall a node

Important
  • If you reinstall a node, the node data is deleted. Exercise caution when you reinstall a node.

  • You can reinstall a node only when the cluster is in the Running state.

  • When you reinstall a node, you must first remove the node from the associated CPFS cluster and then add the reinstalled node to the CPFS cluster.

You need to reinstall a node in the following situations:

  • Redeploy the business.

  • Change the OS version.

  • Meet O&M requirements.

Procedure

  1. Log on to the Intelligent Computing Lingjun console.

  2. In the left-side navigation pane, choose Resources and Nodes > Node Management.

  3. On the Node Management page, find the node that you want to manage and click Reinstall in the Actions column. In the dialog box that appears, select an image version, modify the node name, enter and confirm the root password of the node, and then click Reinstall.

Restart a node

Important
  • Restarting a node may affect business continuity.

  • You can restart a node only when the cluster is in the Running state.

You need to restart a node in the following situations:

  • Deploy a new application or service.

  • Modify system settings.

  • Meet O&M requirements.

Procedure

  1. Log on to the Intelligent Computing Lingjun console.

  2. In the left-side navigation pane, choose Resources and Nodes > Node Management.

  3. On the Node Management page, find the node that you want to manage and click Restart in the Actions column.