high-performance computing: how to select the RoCE v2 TCP/IP InfiniBand network?
北冥有鱼  2024-08-06 18:40   published at in China

Source: architect Technology Alliance

640.jpg

the solution of high-performance computing network platform can solve the problem that GPU-based programs must call IB stack in geophysical high-performance computing, while traditional TCP/IP stack applications cannot support high-performance computing network communication.

ROCE v2 architecture solution is gradually accepted by customers (see: detailed explanation of RoCE Network Technology and RoCE network technology and implementation), the ecology and application are constantly maturing, and the network transmission efficiency and reliability are also strengthened. The operation of ROCE v2 Technology reduces the CPU consumption of the host.

640.jpg

HPC refers to the use of aggregated computing capabilities to handle data-intensive computing tasks that cannot be completed by standard workstations, such as simulation, modeling, and rendering required in exploration business. When dealing with various computing problems, we often encounter such situations: due to the need for a large number of operations, a general-purpose computer cannot complete the work in a reasonable time, or because the required data volume is too large and the available resources are limited, the calculation cannot be performed at all.

HPC methods can effectively overcome these limitations by using specialized or high-end hardware or integrating the computing power of multiple units. Data and operations are distributed to multiple units accordingly, which requires the introduction of parallel concepts.

Different types of modeling problems have different parallelism levels. Taking parametric scanning as an example, this kind of problem can be used to solve multiple similar models with independent geometry, boundary conditions or material properties, which can be calculated almost completely in parallel. The specific implementation method is to assign each model setting to a computing unit. This type of problem is very suitable for parallel computing, so it is usually called & ldquo; Easy parallel problem & rdquo; Parallel problem is very sensitive to network speed and latency in clusters. (In other cases, the network speed is not fast enough to effectively handle communication, which is likely to slow down.) Therefore, general hardware can be connected to speed up the calculation of such problems.

In traditional networks, with the increase of network access bandwidth, **stack consumes more and more CPU. HPC networks usually use RDMA technology to reduce the CPU consumption of** stack on computing nodes, reduce network transmission latency.

RDMA allows direct data transfer between the memory of two servers (see: explain RDMA architecture and technical principles in detail, advantages and practices of high-performance RDMA networks and comprehensive analysis of RDMA), without the CPU participation of any server (also known as zero copy Network), thus achieving more efficient communication. This process is performed on a Network Interface Card (NIC) that supports RDMA and avoids migrating the stack, thus accelerating data transfer. In this way, data can be directly transferred to the remote memory on the target server, reducing the CPUI/O workload of the server used for other processing.

Traditional IB switching architecture (see: Infiniband architecture and technical practice, research on design of InfiniBand high speed interconnection network and what is the difference between 200g HDR InfiniBand?) using RDMA technology, it provides HPC with a high-performance and low-latency network platform with the smallest forwarding latency in the industry. However, Infinband switches have their own independent architecture systems and protocols (IB protocols and specifications):

1. It must be interconnected with devices that support IB protocol. 2. The Infinband system is relatively closed and difficult to replace. 3. The connection between the Infinband system and the traditional network requires a separate gateway.

For applications that are not absolutely sensitive to latency in the overall HPC computing platform, however, using expensive IB switching ports to carry a large number of these applications virtually increases the computing costs, maintenance costs, and management costs of enterprises, restricting the expansion of the HPC overall system. Judging from the development trend of Ethernet networks in the industry based on the growth trend of 10g/25g/40g/100G bandwidth, with the continuous expansion of computing scale, many original networks based on IB, no matter in the form of bandwidth media, the port density and so on all need to be expanded. For HPC applications that do not require absolute latency, Ethernet is preferred to replace the original IB switch to reduce costs.

RoCE specification implements RDMA function on ethernet. ROCE needs lossless network. The main advantage of RoCE is its low delay, so it can improve network utilization rate; at the same time, it can avoid misuse and use hardware uninstallation, so the CPU utilization is also low.

640.jpg

The new RoCEv2 standard enables the transmission of RDMA routes in layer -3 Ethernet networks. The RoCEv2 specification replaces the InfiniBand network layer with the IP header and UDP header on the Ethernet link layer. In this way, RoCE can be routed between traditional IP-based routers.

RoCE v1 protocol: RDMA is based on Ethernet and can only be deployed in Layer 2 networks. Its packet structure is to add the packet header of Layer 2 Ethernet to the original IB architecture packet, use Ethertype 0x8915 to identify RoCE packets. RoCE v2 protocol: it carries RDMA based on UDP/IP protocol and can be deployed in a three-layer network. Its packet structure is to add UDP headers to the original IB-based packets, the IP header and Layer 2 Ethernet packet header identify RoCE packets through UDP destination port number 4791. RoCE v2 supports hash based on the source port number and ECMP for load sharing, improving network utilization.

With this innovation, the industry can meet the growing demand for high performance and scale-out architecture within the enterprise. RoCEv2 can help it achieve the continuity of the fusion path and provide a highly dense data center. It also provides a fast migration method for IB-based application migration and reduces the development workload, this improves the efficiency of deploying and migrating applications.

 

640.jpg

Domestic mainstream network manufacturers such as Huawei, Inspur and huasan all support RoCE network solutions. Taking Inspur as an example, the typical solution uses CN12000 to connect to the core and forms three networks: computing network, management network and storage network, in which high density and high forwarding are realized, cooperate with the host to implement the application of key RDMA technologies, and smoothly migrate high-performance applications developed based on IB protocol to a lower-cost Ethernet switching network.

The support of high-performance network products greatly simplifies the high-performance network architecture and reduces the latency caused by multi-level architecture layers, providing strong support for smooth upgrade of access bandwidth of key computing nodes. The RoCEv2 standard is adopted as the core, and the support for computing nodes RoCEv2 and DCE/DCB eliminates the complexity and extra workload caused by program migration, reduces the CPU consumption of the host on the computing node stack.

The core network is supported by PFC/RoCE and other technologies, which makes the high-performance computing network more open and reduces the construction cost of the entire high-performance cluster platform without reducing the computing efficiency.

 

Replies(
Sort By   
Reply
Reply