high performance computing practices based on Ni Peng processor
BINGO S  2024-08-16 10:08   published at in China

640.jpg

 

For more information, see From & ldquo; practice of domestic high-performance computing clusters based on Kunpeng processor& rdquo;.

China is in the leading position in high-performance computing the domain has made remarkable progress. Tianhe 2 and ShenWei Taihu light successively topped the Top 500 international supercomputing rankings, and won the Gordon Bell HPC application Awards in 2016 and 2017. In the plan of & ldquo; 14th Five-Year Plan & rdquo; China will continue to implement & ldquo; Domestic substitution & rdquo; Strategy in the field of high-performance computing. The new generation of E-class supercomputers will adopt domestic processors.

However, at present, domestic universities have not adopted domestic processors the school-level computing platform. There are three main reasons for this: ① the usage habits of computing platforms based on domestic processors are quite different from those of X86 CPU clusters at present, and users are not used to it; (2) currently, mainstream computing software is developed based on X86 processors, which need to be recompiled and adapted to domestic processors; (3) many applications of computing software have not yet been optimized for domestic processors, the running speed cannot be guaranteed.

To cope with the above-mentioned difficulties in user operations, application deployment, and running speed we have carried out the following work for these three challenges:

1) by mounting a unified file system and a job scheduling system different computing devices are integrated into a unified parallel file system to provide users with a consistent experience;

2) use containers to quickly deploy high-performance meters for ARM clusters computing applications, providing users with pre-compiled software in the form of modules and images;

3) the correctness of the pre-compiled application software is verified. And performance tuning.

This article will introduce Shanghai Jiaotong University's adoption of Huawei Luopeng 920 the school-level computing platform built by the computer, which is the first computing cluster based on domestic ARM processors (hereinafter referred to as ARM cluster) built by domestic universities. This ARM cluster shares a parallel file system with the original X86 CPU cluster and GPU cluster. It uses Infiniband network to connect with each other at a high speed and implements the concept of unified data base.

There are two innovations in our work:

1) for multiple heterogeneous processors and heterogeneous Internet networks computing clusters, a new network topology scheme is proposed, so that these computing clusters can share the same parallel file system;

2) take the lead in completing several common high-performance tasks on Huawei ARM clusters the correctness verification and performance optimization of computing applications have greatly promoted the software ecosystem construction of domestic high-performance computing platforms.

The school-level high-performance computing platform of Shanghai Jiaotong University was established in 2013. The first phase was set up. At that time, the hybrid computing architecture of Intel Xeon processor, XeonPhi coprocessor and NVIDIA GPU accelerator was adopted, and its computing power ranked 138th on the T op500 list in November 2013. In 2019, the school started the second phase of construction, building homogeneous clusters based on Intel Xeon processors for high-performance computing and heterogeneous clusters based on NVIDIA GPU accelerators for artificial intelligence computing. In 2020, our school will build the third computing cluster, which uses Huawei Xiaopeng 920 processor.

1. Background introduction

1.1 Compute Nodes A total of 100 compute nodes are configured in the ARM cluster. Each node is equipped dual 128-core Kunpeng 920 processor with 192GB DDR4 2933 memory. Huawei Luopeng 920 adopts a 7-nanometer chip process based on ARMv8 micro-architecture. See Table 1 for specific parameter specifications.

 

640.jpg

Compared with the mainstream Intel Xeon 6248 processor, Huawei Luopeng 920 more cores and memory channels provide higher concurrency and memory access bandwidth. However, in vectorized bit width, Luan Peng 920 is 1/4 of Intel's mainstream processors. Based on the above features, Ni Peng 920 is more suitable for memory-intensive applications

1.2 High-speed Internet

the Intel CPU cluster in the school-level computing platform of Shanghai Jiaotong University it is an Intel 100 Gbps Omini-Path high-speed Internet, while GPU clusters and ARM clusters use Mellanox 100 Gbps Infiniband EDR high-speed Internet. As the two mainstream high-speed Internet networks, its communication protocol provides an exchange-based architecture, which consists of point-to-point bidirectional serial links between processor nodes and between processor nodes and storage nodes.

1.3 File system

the school-level computing platform adopts Lustre parallel file system, which is an object-based parallel file system with high availability, high performance, and high scalability. It can provide POSIX-compatible unified file system interfaces for large-scale computing clusters. It runs on the Linux operating system and adopts a client-server network architecture. The server of Lustre consists of a group of servers to provide metadata services and object storage services. The client is the access interface of Lustre file system, which can be mounted to Lustre file system. Each node server in Lustre is interconnected by using the Lnet high-speed network protocol.

1.4 job scheduling system

The CentOS7.6 operating system is deployed on the school-level computing platform on this Linux system, we mount the SLURM job scheduling system. SLURM is an open-source, fault-tolerant, highly scalable cluster management and job scheduling system. As a cluster Workload Manager, it has three key functions: (1) it allocates exclusive and/or non-exclusive access to resources (computing nodes) to users over a period of time so that they can perform their work; (2) it provides a framework, it is used to start, execute and monitor work (usually parallel work) on the assigned node set; ③ It arbitrates resource contention by managing the queue of pending work.

2. System design

2.1 Network Topology Design

the overall idea of ARM cluster network access is similar to CPU + GPU In a heterogeneous cluster, all ARM nodes are connected to the Infiniband switch to realize the interconnection between nodes, and then are bridged to the OmniPath network through the LNet Router to achieve the interconnection between the Infiniband and OmniPath heterogeneous networks.

640.jpg

The IB network of ARM cluster contains 5 40 ports vswitches and three routing nodes. Three switches are used as access layer switches, and half of the ports are directly connected to the nodes; The remaining two are used as core layer switches for mesh connection with access layer switches. The three access layer switches are connected to the storage cluster through the corresponding routing nodes. Each physical line between nodes and switches and between switches supports a bandwidth of 200Gbps. The total communication bandwidth between the entire access layer and computing nodes is 10,000 Gbps, while the total bandwidth between the access layer and the core layer is 11,000 Gbps. Because the IB switch has its own routing function, it can ensure that the data traffic at the access layer and the switching layer is evenly distributed to each equivalent link. Therefore, under this fat tree topology, both nodes can always have an available communication bandwidth of 100Gbps.

2.2 mount a shared file system

ARM cluster Mount Lustre file system is divided into two steps: step 1: compile and install the Lustre client. Installed Lustre the client version must be adapted to the server. Therefore, you need to select an appropriate operating system version and specify the kernel and IB driver during the compilation of the Lustre client. In practice, we have adopted the CentOS 7.6 system customized by ARM Architecture and compiled and installed the Lustre client version 2.12.4. Step 2: configure the lnet route. For three groups of ARM cluster nodes different LNET tags (similar to different subnets) must be assigned and different from other clusters such as storage clusters and X86 supercomputing clusters. Then, configure the corresponding Lnet route on the storage server, ARM node, and routing node to connect the OPA and IB networks. After the preceding two steps, you can successfully mount an ARM cluster. Lustre file system to form a unified data base.

3. Performance tuning and verification

to solve the problem of slow running of ARM clusters, we choose LAMMPS and GATK are used as examples of application tuning and verification. In 2020, these two applications accounted for 35% of the total usage time of X86 CPU clusters in our school.

640.jpg

Using two basic examples of LAMMPS, EAM and LJ, test the three modes of adding User-Intel acceleration packages to ARM, X86, and X86, and compare the running speeds (Timesteps/s) of 1, 2, 4, 8, and 16 nodes. The two examples EAM and LJ are both 864,000 atomic systems and run 5,000 steps under the NVE system. ARM's single-node computing speed is twice that of Intel's mainstream processors (excluding User-Intel acceleration packages), and it still maintains an advantage of 1.5 times when it expands to 16 nodes. After X86 compilation uses User-Intel accelerator package, the computing performance of LAMMPS on ARM cluster is about 60% of that of Intel mainstream platforms.

640.jpg

Based on the above analysis process provided by Broad Institute and the corresponding test the performance of GATK 4.2 on X86 and ARM. Because the HaplotypeCaller module of GATK on ARM cluster lacks the GKL accelerator package (Intel GKL Utils) developed by Intel for X86, the speed decreases significantly. However, MarkDuplicates and BQSR-related tools have not been optimized at the underlying level, and their performance on ARM clusters is about 70% and 50% of that on x86 clusters. In order to deal with the difficulties encountered by users in the construction of ARM clusters with the three challenges of difficult deployment and slow running speed, we propose a new network topology scheme, which enables ARM clusters to share the same parallel file system with existing X86 clusters, users can access data without discrimination. In addition, more than 30 commonly used high-performance computing applications are rapidly deployed for ARM clusters by using Singularity, and the performance of LAMMPS and GATK applications with the highest usage rate are optimized and evaluated, the performance can reach 60%-70% of that of mainstream X86 clusters. The ARM cluster will be put into trial operation in the summer of 2021, during which the average monthly utilization rate of the whole machine will exceed 70%.

 

Author: Wang Yichao, Zhang Zhanbing, Hu Chenhao, Zhang Tianyang, Hu Guangchao, Su Xiaoming, Zhang Yifang, Wei Jianwen, Wen Min, Hua Lin Xinhua

shanghai Jiaotong University Network Information Center

 

source: architect Technology Alliance

 

Replies(
Sort By   
Reply
Reply