An Elastic High Performance Computing (E-HPC) cluster is a group of ECS instances that deliver high-performance computing capabilities. Compared with typical Elastic Compute Service (ECS) instances, E-HPC clusters offer higher performance, scalability, reliability, and availability. This topic describes the terms and features of an E-HPC cluster.
Nodes
Each node in an E-HPC cluster is an ECS instance. The nodes are classified into logon nodes, management nodes, and compute nodes. The following table describes each type of node and its role in an E-HPC cluster.
Node | Description |
Logon node | The logon node is used to log on to an E-HPC cluster. You can also debug, compile, and install software, and submit jobs through the logon node. |
Management node | The management node is used to manage the cluster. The scheduling service and domain account service are deployed.
Important The management node is used to schedule jobs and resolve domain accounts. To ensure business continuity, do not use management nodes to compile software, or upload or download compressed data. |
Compute node | The compute node is used to run high-performance computing jobs. |
We recommend that you choose the instance specifications of management nodes and schedule jobs based on the number of compute nodes. The following table lists the recommended instance specifications and quantity of jobs.
Number of compute nodes | Specifications of management nodes | Job quantity |
100 or less compute nodes |
|
|
500 or less compute nodes |
|
|
More than 500 compute nodes |
|
|
Images
An image includes the operating system and configuration data for your business. It is used to provide the ECS instances that make up an E-HPC cluster. E-HPC supports the following types of images:
Public images: images provided by Alibaba Cloud.
Custom images: images created from ECS instances or snapshots, or images imported from your computer.
Shared images: images shared by other Alibaba Cloud accounts.
Alibaba Cloud Marketplace images: images provided by independent software vendors (ISVs) that are licensed by Alibaba Cloud Marketplace.
Community images: images that are released on the image platform of Alibaba Cloud Community.
The image types that you can select vary based on the specified region, specified instance type for the node, and whether the current Alibaba Cloud account has available image resources. All available image types are displayed on the console.
The schedulers, domain account services, and supported shared storage and software vary based on images.
For more information, see Image overview.
Scheduler
Schedulers are used to schedule jobs on a cluster. The following table describes the schedulers that are supported by E-HPC:
Type | Scheduler | Displayed in the console |
PBS | PBS Pro19 | pbs19 |
PBS Pro18 | pbs Note The version of the scheduler software to install depends on the image that you use. | |
OpenPBS 20 | ||
OpenPBS 22 | ||
Slurm | Slurm 22 | slurm22 |
Slurm 20 | slurm20 | |
Slurm 19 | slurm19 | |
Slurm 17 | slurm | |
GridEngine | Open Grid Scheduler (SGE) | opengridscheduler |
Others | Deadline | deadline |
The supported schedulers vary based on images. For more information, see the "Schedulers, domain account services, and shared storage supported by images" section in this topic.
Domain account services
The domain account service is used to manage cluster users. E-HPC supports the following domain account services:
Network Information Service (NIS) provides centralized identity management. You can create a user on the NIS server. After a new node is added to NIS, you can use the user to log on to the node without the need to create a user on each node.
Lightweight Directory Access Protocol (LDAP) is used to authenticate E-HPC users. You can authorize and group users by using LDAP to simplify permission management within your organization.
The supported domain account services vary based on images. For more information, see the "Schedulers, domain account services, and shared storage supported by images" section in this topic.
Shared storage
The user data, scheduler information, and shared job data of E-HPC clusters are stored in the file system for shared access by all nodes in the cluster. E-HPC supports the following types of file systems:
Aspara File Storage NAS: includes General-purpose NAS and Extreme NAS.
Cloud Parallel File Storage (CPFS) file system: supports CPFS-NFS and CPFS-POSIX mounting methods.
Others: file storage that is not hosted by Alibaba Cloud, such as your self-managed NAS file system.
The supported storage varies based on images. For more information, see the "Schedulers, domain account services, and shared storage supported by images" section in this topic.
Schedulers, domain account services, and shared storage supported by images
The following table describes the supported schedulers, domain account services, and shared storage by images.
If you create an E-HPC cluster in the E-HPC console, the supported image types, schedulers, and domain account services are displayed in the console.
For images that are labeled with a custom scheduler, custom domain account service, or custom shared storage in the table, the scheduler, domain account service, and shared storage are not provided with the image. You need to install them by yourself.
CentOS 6 and CentOS 8 have reached their EOL, meaning that the Linux community is no longer maintaining these operating system versions. For security and reliability reasons, we recommend that you switch to other operating systems. For more information, see How do I change CentOS 6 repository addresses? and Change CentOS 8 repository addresses.
Public image | Scheduler | Domain account service | Shared storage |
|
|
|
|
CentOS 8.0 64-bit | Open PBS 20 | NIS |
|
CentOS 6.9 64-bit |
|
|
|
CentOS 6.10 64-bit | Custom | Custom |
|
Alibaba Cloud Linux 2.1903 LTS 64-bit | PBS Pro18 |
|
|
Alibaba Cloud Linux 3.2104 LTS 64-bit | Open Grid Scheduler (SGE) | NIS |
|
Alibaba Cloud Linux 3.2104 LTS 64-bit for ARM | Open Grid Scheduler (SGE) | NIS |
|
Ubuntu 20.04 64-bit | Slurm 22 | NIS |
|
Ubuntu 20.04 64-bit for ARM | Slurm 22 | NIS |
|
| Custom | Custom | Custom |
E-HPC cluster users
You must create a user to submit, debug, and run jobs on an E-HPC cluster. You can grant two types of permissions to users when you create the users.
Ordinary permissions: suitable for ordinary users that only need to submit and debug jobs.
Sudo permissions: suitable for administrative users who need to manage the E-HPC cluster. In addition to ordinary permissions, sudo permissions allow users to install software and restart nodes by running sudo commands.
ImportantYou can create a root user only when you create an E-HPC cluster. We recommend that you do not use the root user for day-to-day operations. This minimizes the risk of damage to cluster data due to improper or accidental operations.
For more information, see Manage users.
Software
E-HPC provides access to major computing applications, runtime libraries, and Message Passing Interface (MPI) libraries. You can install the software based on your business requirements. For more information, see Software overview.
E-HPC cluster status
Creating: The cluster is being created. The ECS instances that make up the cluster are created in this stage.
Uninitialized: The image is being installed on the instances in the cluster.
Initializing: The cluster is being initialized. The root user is initialized in this stage.
Running: The cluster is up and running.
Exception: A cluster enters the Exception state when management nodes are deleted or stopped, or the scheduler is logged off. You can try to restore the cluster. If the cluster fails to be restored, submit a ticket.
Releasing: The cluster is being shut down and will be released.