Title:Storage Requirements for AI
speaker:John Cardente, Member of Technical Staff, Dell Technologies
meeting:SNIA Compute, Memory, and Storage Summit
date: May 21, 2024
Deep Learning I /O (DLIO) Benchmark
https://github.com/argonne-lcf/dlio_benchmark
the booming development of AI has brought huge demand for GPU, prompting people to make full use of its performance.
GPU is a necessary component of AI
modern deep learning AI models require millions of vector operations. In order to make AI feasible in computing, vector operations must be parallelized. GPU is designed for fast and cost-effective parallel vector operations. It is because of the existence of GPU that AI becomes economically feasible.
GPU is expensive and scarce
major companies are competing to build AI data centers that can accommodate hundreds or even thousands of GPUs. However, the demand for GPU is constantly exceeding the supply, making it expensive and difficult to obtain.
Maximize GPU utilization
-----
with the booming development of AI, many companies are building AI data centers to provide more advanced product functions and simplify operations. Modern deep learning AI models require millions of vector operations. In order to make these calculations feasible within a reasonable time, parallel processing is required. GPU is designed to perform parallel vector operations extremely quickly and cost-effectively. It is this technology that makes advanced AI computing and economic benefits possible.
With the large-scale AI data center construction competition, GPU supply is in short supply, resulting in scarcity and high procurement costs. Therefore, operators must make the best use of their GPU resources. In AI data center design, maximizing GPU utilization is becoming the main goal.
To maximize GPU utilization, data center design requires a balanced end-to-end architecture to avoid performance bottlenecks. Although people often pay attention to high-performance GPU servers and the East-West network required for large-scale GPU communication, storage also plays a key role.
Data Preparation
key tasks:
scalable and high-performance storage that allows you to convert data to an AI format
protect valuable original and derived training datasets
key features:
stores large structured and unstructured datasets in various formats
scaling under the pressure of MapReduce-like Distributed Processing (commonly used to convert AI data)
supports file and object access protocols to simplify integration.
Training and fine tuning
key tasks:
provides training data to make full use of expensive GPU
save and restore model checkpoints to protect training inputs
key features:
maintain the necessary read bandwidth to keep GPU resources in training busy.
Minimize the time to save checkpoint data to reduce training suspension
scale out to meet the needs of parallel data training in large clusters
reasoning
key tasks:
secure storage and fast delivery of model files for inference services
provides batch Inference Data.
Key features:
reliable storage of high-cost model file data
minimize the read latency of model files for rapid deployment of inference
maintain the necessary read bandwidth to keep GPU resources in inference busy.
-----
just as data is the "fuel" of AI, storage runs through the entire AI lifecycle. It starts from the data preparation phase, where raw data is converted. Generally, a distributed processing framework is used to convert it into a format suitable for AI model training and batch reasoning. It is essential to store and efficiently access unstructured and structured data in an open data format. This method of accessing files and objects at the same time can avoid additional data copies, which is especially important when using a mixture of analysis and AI software stacks that support different storage protocols.
After appropriate conversion, the data will be used to train or fine-tune the AI model. During training, a large amount of read bandwidth may be required to provide sufficient training data for GPU. Similarly, during training for up to days or weeks, high-performance storage and write bandwidth may be required to quickly save model checkpoints. The focus of today's discussion is to better understand these performance requirements.
After the training, the AI model will be used to generate inference results from new data that has never been seen before. To deploy the inference service, you need to quickly read model files from the storage. Batch inference may require large read bandwidth to provide GPU data.
For better interpretation, it is helpful to understand the basic knowledge of AI model training. A training dataset is a set of samples, each containing a set of model inputs called features and related expected real inference outputs. The training dataset may be stored as a file or a group of files. Training data is read from storage and packaged into random batches containing a relatively small number of training samples. These sample batches are run through the model to generate inference outputs. These outputs are compared with real values to calculate loss scores that reflect the performance of the model. The weight (also called parameter) of the model is then updated with the loss score to improve the accuracy of the model. This process is repeated many times until the model converges to a stable and accurate level. You may need to traverse the training data multiple times to achieve this goal. Each traversal is called a epoch. For smaller training datasets, data may be cached in the server memory after the first epoch. During training, the AI model status is regularly saved or checked to prevent failures.
(1) run an AI training benchmark test designed to saturate GPU utilization. (2) extract performance test results, measured by the number of training samples processed per second. MLCommons's MLPerf training benchmark test suite is ideal for this purpose because it covers a variety of AI models to saturate GPU utilization and test results from multiple submitters are publicly available.
-----
however, information about the storage and read performance requirements required for training is very limited. If the goal is to keep the GPU busy, a reasonable method is to use GPU benchmark tests to determine the peak performance of various models, and then deduce the storage and read performance required to maintain these workloads in reverse. MLPerf training benchmark is an ideal choice. It aims to maintain high utilization of GPU and achieve optimal performance when training various popular AI models representing different use cases. Results from multiple sources are publicly available for analysis.
The following table shows the results of NVIDIA H100 80GB GPU submitted by MLPerf training benchmark 3.0, which was the highest performance GPU at that time. The first three columns show the training models, their sizes and parameters, and details of relevant training datasets. The fourth column provides the size (in bytes) of each single training sample entered into the model during training. The fifth and sixth columns specify the number of GPUs used and the throughput that can be achieved in terms of the number of training samples per second, respectively. The results of 8 GPUs are displayed to represent the typical configuration of high-end AI servers. For GPT-3, the results of 32 GPUs are displayed because a single model instance requires so many GPUs. The last column estimates the storage read bandwidth required to maintain the relevant GPU throughput. This reflects the storage and read performance requirements when training data is read during the first and subsequent periods. If the dataset is too large to be put into memory.
The estimated storage read bandwidth of these models varies greatly. For example, GPT-3 only needs about 150 Mb/s to maintain high utilization of 32 H100 GPUs. This is because the amount of computation of GPT-3 is very large, and the marked text training samples are relatively small, each with only 8KB. On the contrary, when using 8 H100 GPU to train 3D U-Net, it may take more than 40 Gb/s bandwidth to maintain high utilization. In this case, the computation of 3D U-Net is relatively small, but its training sample images are very large, each about 92MB, so higher read bandwidth is required to match the GPU speed. The ResNet-50 is between the two, which requires a bandwidth of about 6.1 Gb/s to make full use of 8 H100 GPUs. These examples show the trade-off between the complexity of model computation and the size of training samples when estimating requirements. Note that large models do not always require a large amount of storage read bandwidth.
The performance of storage systems often depends on how data is accessed. For example, we will use the DLIO benchmark test to simulate ResNet-50 I/O access patterns generated during image classification model training. DLIO is an AI storage benchmark test that simulates how GPU works, but retains all other key aspects of AI model training, including using a real deep learning framework data loader and training data file format. DLIO is designed to generate high-intensity AI workloads to measure storage performance without the need to actually use GPU. In this case, the training dataset consists of a set of TFRecord files, each containing 124 150KB Image tensors. The training data will be read through the data loader of TensorFlow.
When reading training data from NFS shares, we collected library IO traces. The upper-left chart shows the 64 to 256KB I/O streams that continuously read training data during training. The chart on the right shows that the distribution of IO size does not change significantly as the number of training samples in each batch changes. The chart at the bottom left shows that the training files are read sequentially, one by one. Therefore, based on the previous estimation of this workload, AI storage systems may need to implement a read bandwidth of 6.1 Gb/s to read 64 to 256KB of IO streams in sequence, so that GPU can be fully utilized when training ResNet-50. However, other models may use different sizes of IO to present a more random IO access mode. This means that AI storage systems need to perform well in various access modes.
Reading the same ResNet-50 training data on S3 produces very different I/O features. The upper-left chart shows that a larger 20 to 50MB IO is used to read training data. The upper-right chart shows again that the I/O size distribution does not change with the batch size. The chart at the bottom left shows that files are still accessed sequentially. The impact of this significantly different I/O access mode on performance depends on the specific object storage and S3 software library. Although both NFS and S3 focus on throughput, latency may be another factor to consider when using S3, because training cannot start or continue until these large IO are completed.
Another consideration between NFS and S3 is the benefits of operating system page caching, which is especially important when multiple models are trained on a single server. In this case, each model instance may read the same training data. When training data is accessed through NFS, the OS page cache may meet repeated I/O requests from different models to avoid additional storage read access. Because S3 bypasses the page cache of the operating system and does not have a server-side cache, repeated read IO may be sent to storage, thus increasing storage read performance requirements. This example shows that the storage protocol used to retrieve training data is another important consideration when choosing an AI storage solution.
Checkpoints contain learned model weights and Optimizer status information.
Checkpoints can be saved as one or more files, depending on the parallelism and implementation of the model.
Each checkpoint file is written sequentially by a single writer.
For data parallelism, you only need to save one model instance without saving all GPU memory.
-----
now let's talk about checkpoints. Training of large AI models may take days or weeks. During this period, the weight of the model changes as the training data is processed. Due to the computing resources consumed in generating these weights, their value will increase over time. Checkpoints can prevent the loss of valuable data by periodically saving the current model weights and other states to persistent storage. Checkpoints are usually saved as one or more files, each of which is written sequentially by a single writer. The number of files depends on the parallelism of the model and how the specific model implements checkpoints. When parallel data is used, only one model instance needs to be saved without saving the memory content of all GPUs in the system. Since training is usually paused during checkpoints, fast checkpoint saving is a key requirement for storage performance.
The following table estimates the total write bandwidth required to complete checkpoints of different sizes within different time limits. The first column on the left specifies the sizes of different models, measured by the number of parameters. The second column estimates the total size of checkpoints (including model weights and Optimizer states) based on the 14-byte rule of thumb for each parameter. The remaining columns provide the estimated total write bandwidth required to complete checkpoints within different time limits, expressed as the assumed percentage of the 2-hour checkpoint interval. For example, a 175 billion-parameter model generates 2.4 TB of checkpoint data. To save the checkpoint within 360 seconds (that is, 5% of the checkpoint interval of 2 hours), a total write bandwidth of 6.8 GB/s is required. If it can be implemented, this means that 95% of the time can be used for model training in two hours.
These estimates clearly show that the storage write bandwidth requirement varies significantly with the model size and time limit. Large models have significant write bandwidth requirements under the compact checkpoint time limit. However, when the model size is small or the checkpoint time limit is loose, the demand will be significantly reduced. Understanding these factors is essential for evaluating the cost balance between storage solutions and idle GPU.
Reinitialize the weight and Optimizer status in each GPU memory based on the corresponding checkpoint file.
Checkpoint files are usually read sequentially.
When using model parallelism, a single checkpoint file can be used to restore multiple GPUs.
The number of readers depends on the degree of data parallelism, that is, the number of readers needs to be allocated according to the number of processes used for data parallelization.
-----
the checkpoint function allows model training to be resumed after planned or unplanned interruptions. This requires reloading checkpoint data to all GPU involved in training. Each checkpoint file is read sequentially by one or more readers based on the number of model instances to be restored. For example, the diagram on the left shows that a checkpoint file is being loaded to three GPUs, each of which holds the same part of the model. Other checkpoint files are read three times in a similar way to restore the rest of the GPU state. Checkpoint files are usually read in parallel, which increases the need for concurrent sequential reading streams of multiple files. Since model training cannot be restored before checkpoints are fully restored to all GPUs, we hope this process can be completed as soon as possible.
The following table estimates the total storage read bandwidth required to restore checkpoints of different sizes within 5 minutes under different model instances. Note that the 5-minute time limit is only used as a sample, and the actual checkpoint recovery time limit depends on the frequency of recovery requirements and the expectations of AI engineers. As mentioned earlier, the first two columns show the sizes of different models and their associated checkpoint sizes. The remaining columns show the estimated total read bandwidth. For example, to restore a model with 175 billion parameters to 16 model instances within 5 minutes, a read bandwidth of 2.18 GB/s is required to read a total of 2.4 TB of checkpoint data 16 times. This again indicates that the requirements vary greatly depending on the model size, training parallelism, and allowed checkpoint recovery time.
Modern GPU clusters may contain thousands of servers and tens of thousands of GPUs.
The MLOps platform allocates and executes jobs in clusters through a distributed scheduling mechanism.
No matter which server the job is assigned to, it needs to be able to access training data, checkpoint data, and other data.
In this type of environment, multiple storage-intensive workloads such as data preparation, training, and checkpoint are usually performed simultaneously.
-----
so far, our discussion has focused on a single workload. However, in practical applications, modern GPU clusters may host a large number of AI workloads at different stages of the AI lifecycle. The distributed scheduler is used to assign these jobs to servers in the cluster. To simplify job deployment and avoid unnecessary data replication, it is generally assumed that equal access can be achieved regardless of where the data is stored. This means that the AI storage system must be able to meet the performance requirements of multiple workloads with different access modes, all of which come from a single namespace, and it must be able to expand as the business requirements of GPU clusters grow.
Read training data
adapts to a large number of differences in read bandwidth requirements and IO access modes between different AI models.
Provides a large amount of read bandwidth to a single GPU Server for the most demanding model.
Use high-performance all-flash storage to meet your needs.
For the most demanding requirements, use the storage protocol that supports RDMA as much as possible.
Save checkpoint
provides high-bandwidth sequential write capability to quickly save checkpoints.
Can process multiple large sequential write streams written to individual files, especially the same directory.
Learn about Checkpoint implementation details and behaviors for expected AI workloads.
Determine the time-limited requirements for completing checkpoints.
Restore checkpoints
provides high-bandwidth sequential read capabilities to quickly restore checkpoints.
Can process multiple large order read streams from the same checkpoint file.
Understand the frequency required for checkpoint recovery.
Determine acceptable recovery time limits.
Provide services for GPU Clusters
meet the performance requirements of hybrid storage workloads from multiple simultaneous AI jobs.
As business requirements grow, storage capacity and performance can be expanded.
-----
in short, storage runs through all stages of the AI lifecycle, and the requirements vary greatly between different stages of the lifecycle, different AI model types, and the expectations of AI infrastructure users. Concurrent sequence and random read performance are critical for providing GPU model training data and restoring checkpoints. The single-thread sequential write performance is critical for quickly saving model checkpoints. The storage system of modern AI GPU clusters must be able to handle hybrid workloads from a single namespace and expand capacity and performance as needed. Choosing an appropriate AI storage solution requires an in-depth understanding of expected workloads and service expectations to meet requirements while controlling procurement and total cost of ownership.
Data protection: protects data from damage or loss.
High Availability: ensure that the system can run continuously and provide services even if faults occur.
Data compression and deduplication: compression reduces storage space usage and eliminates redundant data replicas to improve storage efficiency.
Static encryption: encrypts data when it is not used to ensure data security.
Multi-Protocol Data Access: supports multiple data access protocols to improve data sharing flexibility.
Remote and hybrid cloud replication: copies data to remote storage or hybrid cloud environments for disaster recovery.
Security and Control: establish security policies and implement control measures to protect data security.
Long-term archive storage: long-term storage of data for future analysis or compliance requirements.
-----
finally, this speech mainly discusses performance, which is the focus of many AI storage dialogues. However, AI also requires traditional enterprise storage functions, such as data protection, high availability, encryption, data security, and data lifecycle management. Once the performance requirements are met, these features become particularly important. As I said at the beginning, data is the "fuel" of AI ". It is essential to protect and manage these data. AI storage systems not only need to provide high performance, but also need to provide comprehensive data management and protection functions.
-- [This article is finished]] ---
source: Andy730 public account