NVIDIA:LLM cluster storage IO behavior (measured data)
三道杠No.1  2024-10-11 17:51   published in China

According to: This article includes NV-H100 to calculate the real data of the cluster. During actual training, the I/O/bandwidth characteristics of the storage system in three main stages: file writing, training period, and checkpoint writing, this paper also discusses how to extend the write bandwidth of checkpoints based on asynchronous writes.

Import

640.png

NVIDIA:LLM cluster storage I/O behavior-Fig. -1

  • since the ChatGPT chatbot appeared at the end of 2022, generative AI has been widely known.

  • -Large language model (LLM) applications are the core of such capabilities.

  • -Models have become larger and more complex in the past few years.

  • LLM trains computing, storage, and IO modes closer to HPC rather than single-node ML AI or inference workloads.

  • -A large-scale performance platform with performance storage is required.

  • -It needs to be integrated into a safer environment while maintaining scalability.

The figure on the right shows the growth trend of AI model training computing power (PFLOPs) from 2012 to 2020.

640.png

NVIDIA:LLM cluster storage I/O behavior-Fig. -2

training computing platform (reduced version of Eos DGX AI supercomputer)

  • H100 SuperPOD deployment

  • november 2023 T OP500 ranked 9th (121.40 PF)

  • 576 NVIDIA DGX systems, each equipped with eight H100 GPU (test the incomplete computing power below)

  • Quantum-2 NDR InfiniBand network with independent computing (8-rail) and storage (2-rail) structures

storage System

640.png

NVIDIA:LLM cluster storage I/O behavior-Fig. -3

the main features of the DDN EXAScaler Lustre storage platform include:

  1. hardware configuration:

  • use 12 DDN AI400X2 storage devices, each equipped with 24 NVMe drives to provide high-speed data access.

  • Each device is connected by eight 200 Gb/s infinibands to ensure high-bandwidth network connection.

  • Storage Architecture:

    • the Lustre File System (version 2.14.0) has been customized and optimized.

    • Use distributed namespaces (DNE) and automatic polling technology to manage metadata targets (MDT).

    • The object storage target (OST) is striped by PFL by default to optimize data distribution.

  • Performance characteristics:

    • the peak bandwidth reaches about 1TB/s, showing extremely high data transmission capability.

    • The system performance exceeds the current experimental requirements and does not constitute a bottleneck.

  • Advanced features:

    • supports PCC-RO (caching) and can be enabled separately for each Slurm job by using the user ID.

    • Provides project quota management to facilitate resource allocation and control.

    • Supports sub-directory mounting to enhance dataset management and access control capabilities.

    • Kerberos authentication is integrated to improve overall security.

    640.png

    NVIDIA:LLM cluster storage I/O behavior-Fig. -4

    use the Megatron-LM framework for large-scale language model (LLM) training. The main features include:

    1. open-source framework: use Megatron-LM open-source framework developed by NVIDIA to train and run LLM models.

    2. Model size growth:

    • the previous model (LUG2021) has 13 billion parameters.

    • The new model is greatly extended to 340 billion parameters, and the dataset size reaches 8T.

  • Research focus: focus on training workloads, with special attention to changes in I/O performance in large-scale model training.

  • Parallelization policy:

    • adopt a combination strategy of tensor parallelism, pipeline parallelism, and data parallelism.

    • This multi-dimensional parallelization method allows model training to be extended to 10,000 GPUs.

  • Performance and Scalability: by using a variety of parallel technologies, the system can effectively process ultra-large-scale models and demonstrate the training capability at extreme scales.

  • I/O evolution research: by comparing the new and old model settings, we will study how I/O models and performance change with the growth of model scale.

  • Load Analysis

    640.png

    NVIDIA:LLM cluster storage I/O behavior-Fig. -5

    the load characteristics.

    The preceding figure shows the GPU computing load (including training and checkpoint writing);

    the following figure shows the storage read/write bandwidth during training, which will be analyzed in detail later.

    640.png

    NVIDIA:LLM cluster storage I/O behavior-Fig. -6

    IO behavior characteristics (divided into 3 stages)

    • initialize the read phase-only once

    • computing phase-iterative GPU processing

    • checkpoint write phase-every N computing iterations

    three key stages and their characteristics:

    1. initialize the read phase:

    • it is executed only once at the beginning of the training.

    • The chart shows a short but significant peak read, about 100 Gb/s.

  • Computing phase:

    • involves iterative GPU processing.

    • The chart shows a longer stable period with minimal I/O activity.

  • Checkpoint write phase:

    • it is executed after every N computing iterations.

    • The following figure shows periodic write activities with a peak value of about 75 Gb/s.

    • Use Buffer I/O to manage these write operations.

    640.png

    NVIDIA:LLM cluster storage I/O behavior-Fig. -7

    focus on the read IO behavior in the initial read phase

    • the read rate in the computing phase is very low: about 3MB/s

    • I/O small:< 4KB

    • as the number of nodes increases, the total read volume increases linearly.

    640.png

    NVIDIA:LLM cluster storage I/O behavior-Fig. -8

    focus on the write I/O behavior during the checkpoint write phase.

    In the write phase, the GPU usage is extremely low and is in the waiting state.

    640.png

    NVIDIA:LLM cluster storage I/O behavior and authentication mechanism-Fig. -9

    • the checkpoint size is 4.3 TiB.

    • Checkpoints are completed by a few nodes (parallel models). Peak 75 Gb/s (approximately 12 nodes)

    • each client writes at a speed of about 6 Gb/s (the available write capacity is 93 GB/s)

    • the checkpoint lasts for 90 seconds.

    In this model version, the peak IO increases with the parallelism of the model, rather than with the number of nodes.

    Asynchronous checkpoint write

    640.png

    NVIDIA:LLM cluster storage I/O behavior-Fig. -10

    parallel checkpoint writing improves efficiency.

    • By simulating checkpoints manually, we can run a fully parallel version of the I/O written by the workload.

    • The following is an example of 10 checkpoints on 48 nodes

    • each checkpoint lasts 16 seconds, with a peak of 275 Gb/s (approximately 4 times acceleration)

    at this point, the peak IO increases with the model size and the number of nodes, and the potential optimization comes from studying asynchronous checkpoints.

    This parallel checkpoint method has several important impacts on LLM training and storage system design:

    1. 1. The checkpoint operation time is significantly reduced, which may greatly improve the overall training efficiency.

    2. 2. The storage system needs to be able to support higher instantaneous I/O bandwidth to meet the needs of parallel writing.

    3. 3. The parallelization strategy changes the I/O load distribution from a few nodes to multiple nodes.

    4. 4. The potential of asynchronous checkpoints implies the possibility of further optimization, which may improve efficiency by reducing the interference of checkpoints on computing.

    5. 5. Storage system design needs to consider more frequent, shorter but higher intensity I/O peaks.

    图片  How to implement parallel/asynchronous checkpoint?

    Implementing parallel/asynchronous Checkpoint writing in AI training scenarios is an important performance optimization strategy. This method can significantly reduce the I/O waiting time in the training process and improve the overall training efficiency. The following are key steps and considerations to achieve this goal:

    1. asynchronous write mechanism:

    • use asynchronous I/O operations to write checkpoint data to the storage system.

    • Use separate threads or processes to process checkpoint writes to avoid blocking the main training loop.

  • Data Buffer:

    • implement memory buffer to temporarily store checkpoint data.

    • Asynchronous write operations are triggered when the buffer reaches a certain size or passes through a specific time interval.

  • Parallel write:

    • split a large checkpoint file into multiple small pieces.

    • Use multithreading or distributed systems to write these data blocks in parallel.

  • Compression optimization:

    • compress checkpoint data before writing to reduce I/O load.

    • Select an algorithm that balances the compression ratio and CPU overhead, such as LZ4 or Snappy.

    Summary-I/O behavior during LLM training

    640.png

    NVIDIA:LLM cluster storage I/O behavior-Fig. -11

    1. 1. The read bandwidth in the computing phase is very low:

    • file System (FS) capabilities should focus on small IO (or other workloads).

  • 2. Checkpoint scalability can support large write peaks:

    • this puts forward higher requirements for file systems.

  • 3. Future checkpoint improvements may change the IO mode again:

    • it depends on the effective concurrency of CPU and network fabrics.

    --- [This article is finished]]---



    public Account: Wang Zhiyu, focusing on data storage, cloud computing trends and product solutions.

    The PPT was taken from the report materials of Aurelien beautmont/Nathan Dauchy, senior engineer of NVIDIA, in lug24.

    Identify access/download materials👇

    640.png


    Replies(
    Sort By   
    Reply
    Reply