what kind of storage architecture is the best choice in the AI big model era
Anonymous published 2024-06-28 17:45

At the release conference of new AI storage products in the era of big models, we noticed a very critical sentence, namely & ldquo; A set of storage covers the entire AI process & rdquo;. What is the AI process? Why do I need a set of storage to cover the entire AI process?

Generally, the AI process includes four stages: data acquisition, data preprocessing, model training and evaluation, and model deployment and application. Each stage involves the storage and access of massive amounts of data, at present, most customers adopt chimney-type construction mode for IT systems in various AI stages, such as data collection and preprocessing stage, model training stage and reasoning stage, which all have independent storage clusters, the data of each stage needs to be coordinated with each other. Therefore, we believe that in the era of AI big model, this construction mode will encounter unprecedented challenges.

Development Trend of AI models

with the development of artificial intelligence technology, in order to make AI models have stronger emergence ability and generalization ability, and have more accurate language semantic understanding and reasoning ability, thus realizing cognitive intelligence. There are three development trends in AI models:

first, the number of large model parameters continues to grow exponentially., from 100 billion level & ldquo; Large Model & rdquo; To trillion level & ldquo; Large Model & rdquo;;

second, the large model has moved from single mode to multi-mode, in the future, it will move towards full mode; The data set used for training will increase from 3TB of NLP model to 40TB of multi-mode. In the future, full mode will reach PB level data;

third, the growth rate of computing power demand greatly exceeds the growth rate of computing power of a single GPU card., the scale of large model training clusters will become larger and larger.

Future challenges of AI development platform

from the perspective of AI full-process business, the above trends will bring the following challenges:

first, with the growth of AI big model training data sets, the current mainstream storage architecture of shared storage and local SSD disks cannot meet the development requirements of large models.

Second, under the chimney construction mode of raw data storage cluster, data preprocessing cluster and AI model training cluster, in the future, frequent PB-level data migration will become the biggest factor affecting the production efficiency of large models.

Third, larger AI clusters further shorten the system failure interval, higher-frequency checkpoints bring huge write bandwidth challenges to storage.

The total amount of data and the quality of data determine the height of AI models. Data Preparation efficiency and data flow efficiency across the entire process will become the core factors that affect the end-to-end production costs of AI models.

Introduction to key technical requirements of AI big model business

choosing a storage system that can meet the rapid development of AI models is crucial to improve the production efficiency of large models and reduce TCO of large models.

What kind of storage architecture is the best choice in the AI big model era? I think it needs to have the following five key features at the same time:

the first key feature: a storage system has both a high-performance layer and a large-capacity layer, and presents a unified namespace to the outside world. It has the capability of data lifecycle management. First, you can specify the placement policy when data is written for the first time. For example, in the data acquisition phase, if the newly acquired data needs to be processed in a short period of time, it can be directly placed in the high-performance layer; if the newly acquired data does not need to be processed in a short period of time or is used for long-term archiving, it can be directly written to the capacity layer. Secondly, you can set a variety of data classification flow policies, for example, you can set a flow policy that combines access frequency and time, or you can set a flow policy that triggers the capacity level. Furthermore, according to the grading policy formulated by the user, data can be automatically graded between the high-performance layer and the large-capacity layer. The data hierarchical migration process is completely transparent to business applications. Finally, for data that has been graded to the capacity layer, you can configure warm-up policies for a specified dataset by using commands or APIs to accelerate the cold start of scheduled tasks.

The second key feature: a set of storage can carry the entire AI process business, and supports protocols such as NAS, big data, objects, and parallel clients required by the entire AI process tool chain, and the semantics of each protocol must be lossless to meet the same ecological compatibility requirements as the native protocol. In addition, all the preceding protocols share the same storage space, and each protocol adopts a Thin Provision of space allocation mechanism, which provides the ability to dynamically and quickly allocate storage space at various stages of AI.

The third key feature is the ability to efficiently transfer data required for AI process collaboration. In each stage, the same data and metadata can be seen based on the tool chains of different protocol ecosystems. In different stages, collaboration requires zero data copy and zero format conversion, the output of the previous stage can be directly used as the input of the next stage to achieve the effect of waiting for 0 collaborative services in each AI stage.

The fourth key feature: it has the horizontal scaling capability of thousands of nodes. The system architecture needs to adopt a fully symmetric architecture design without independent metadata service nodes. As the number of storage nodes increases, the system bandwidth and metadata access capability can achieve linear growth. In the shuffle phase of AI training, it is necessary to provide efficient access to hundreds of millions of file lists; It is necessary to support hundreds of millions of training set files, the version management capability of the training set is achieved by frequently creating new hard links for each file.

The fifth key feature: a set of system and a set of parameters have the bearing capacity of high-performance dynamic hybrid load. In the data import phase, large and small files are written at the same time. In the data preprocessing phase, large and small files are read and processed in batches to generate a large number of small files. In the model training phase, large and small files are read randomly in batches; when a CheckPoint is generated, it must be able to meet the requirements of high-bandwidth writing. In the model deployment phase, even if the same model file is read concurrently, as the number of deployed devices increases, the cluster aggregate throughput bandwidth can still increase linearly.

Summary

based on the storage system with all the above features, we can build an AI Native data Lake storage platform for the AI big model: all data that needs to be processed efficiently is completed in the high-performance storage layer; data migration is no longer required at the same time in all stages of the AI process. This can greatly improve the preparation efficiency of AI big data training data and improve the GPU utilization of AI computing clusters, significantly reduce the investment cost of GPU computing power and the labor cost of data preprocessing; Shorten the development cycle of AI big models and reduce the electricity cost. Based on data storage with AI Native architecture, it is preliminarily estimated that the end-to-end TCO of a large model with hundreds of billions of parameters can be reduced by more than 10%.

For storage systems, it is not difficult to perform well in one or more scenario I/O models, however, under all I/O models generated by the full process tool chain of AI big model, it is rare for a storage system with excellent performance. If the storage system also has the above five key features, at present, Huawei OceanStor Pacific is also the only choice in the world, which is the hard strength accumulated by the company's continuous investment in distributed file systems for more than ten years.

Replies(
Sort By   
Reply
Reply