Insights into AI Storage from Oak Ridge National Laboratory
伏羲Dai  2024-08-26 16:09   published in China

Insights into AI Storage from Oak Ridge National Laboratory

Supercomputing is an important tool in the computing industry for exploration. Its development is not only indicative of a country or region's technological competitiveness, but also guides the development of other global digital systems.

Supercomputing and AI computing are converging. The integration of AI models and AI computing into supercomputing is giving rise to a new wave of industry transformations. However, this begs the question: Do we need to build new, independent storage systems for AI foundation models?

The renowned Oak Ridge National Laboratory (ORNL) in Oak Ridge, Tennessee, United States thinks that we should.

ORNL has released its plans to build next-generation data centers by 2027. In these plans, the laboratory highlights the importance of having independent AI-optimized storage (AOS), in addition to parallel file systems (PFSs) used in conventional high-performance computing (HPC) scenarios, when introducing foundation models with billions and even tens of billions of parameters. It also provides detailed definitions of and specifications for relevant concepts.

Why does this information matter and how will it affect the sustainable development of the computing and storage industries?


Supercomputing at ORNL, the Acme of Science

The blockbuster Oppenheimer directed by Christopher Nolan offered an insight into the Manhattan Project. In fact, the Manhattan Project had much greater influence than what we saw in the film.

As part of the Manhattan Project, ORNL was founded in 1943 and is now sponsored by the United States Department of Energy. It is the largest multi-program science and technology laboratory in the United States. It aims to tackle the most important scientific challenges and develop cross-generation technologies.

It has been part of many major scientific discoveries, including the development of nuclear reactors in the 1940s, pioneering neutron scattering research, and the contribution to information and technology for the semiconductor industry.

ORNL is one of the foremost authorities on supercomputing, and it is renowned for its impressive supercomputing program. Its supercomputer, Frontier, topped the TOP500 list in 2022 and achieved a high-performance LINPACK (HPL) benchmark of 1.102 exaFLOPS, which is 1.102 quintillion floating-point operations per second, making it the world's first exascale supercomputer. It provides more computing power than the next 468 best supercomputers combined. In addition, Frontier has the strongest AI computing capabilities in the world, and as a result, it has been utilized in fields such as smart transportation and healthcare.

ORNL is continuing to develop supercomputing systems and push the boundaries of AI computing and storage, and its work will undoubtedly inspire and guide the future development and construction of other supercomputing and digital systems around the world.


Defining an AI Storage Foundation

It has been clear that dedicated computing power for AI is essential, but is it necessary to build dedicated storage systems for AI as well? The jury is still out, but ORNL certainly wields a lot of influence. Its decision to highlight the importance of constructing AOS in addition to storage systems for conventional computing as it prepares to build next-generation data centers by 2027 and introduce foundation models reveals where they stand on this question. Their plans suggest that two I/O storage systems — PFS and AOS — need to be built for conventional supercomputing services and AI services, respectively.

This is likely because, as the number of AI processing tasks increases, we will not only need more powerful computing systems, but upgraded storage systems as well. Therefore, it is essential to develop new storage subsystems for AI service loads.

The two I/O storage systems have notable differences.

A PFS mainly supports a single POSIX file namespace. It features large service I/Os and large file processing capabilities, and focuses on cluster aggregation bandwidth which has low requirements on small file creation and read performance.

By contrast, the AI application loads processed by AOS are more complex and they differ in size. AOS performs a significant amount of data-intensive analysis, and this requires a large amount of data or metadata to be randomly read and written. As a result, AOS needs to have the capacity for tens of millions of IOPS and OPS, as well as 10 TB/s-or-higher bandwidth for high-speed sequential reads and writes.

AI service loads will increase the requirements for storage performance and PFSs will not be able to meet their needs alone. Therefore, storage performance needs to be significantly enhanced in order to improve the utilization of AI computing power and boost the efficiency of model training.

In addition, compute nodes may experience faults every few hours or days on average when working on AI tasks, so it is paramount to be able to frequently pause and resume model training. Model data and window data from each stage need to be periodically saved. Therefore, AI tasks require a larger storage capacity and higher efficiency than regular supercomputing tasks. A PFS cannot manage this on its own.

It is also important to have access to any file on any node during AI tasks to ensure consistent performance. Therefore, shared storage is essential.

Adding an AOS is particularly useful because it supports efficient parallel data transmission between itself and underlying file systems and ensures cross-layer file scheduling.

AOS requires a much higher storage reliability to protect valuable data assets. As AI training is widely deployed in distributed mode, data availability and task continuity must be ensured even if a single point of failure (SPOF) occurs. AOS is most useful in this case as it supports cross-node erasure coding (EC). By contrast, a conventional PFS only supports intra-node EC, and this carries a risk of data loss or damage to data integrity when a node breaks down. The ORNL has also specified the data reconstruction speed after a fault in the definition of AOS. Therefore, a PFS is insufficient on its own and an AOS needs to be developed as well.

AOS is also capable of cleaning and processing local data, including sensitive information removing, privacy information filtering, transcoding, and deduplication, to simplify data pre-training and improve the overall efficiency of AI tasks.

To sum up, AI foundation models require both dedicated computing power and storage systems. As PFSs will not be able to meet new AI service needs alone, it is essential to develop new storage systems. ORNL's ongoing work is fine-tuning the definition of AI storage and making new breakthroughs that will advance the concept of AI storage across the entire industry.


Historic Breakthroughs in Storage Development

The discoveries of the ORNL are significant breakthroughs and impact a wide range of fields, setting the direction of development for the entire storage industry.

The whole industry may come to an agreement that AI requires both professional computing and storage. We believe that AI storage will become the backbone of the storage industry in the foundation model era.

Supercomputing will be the first to benefit from AI storage. In the previous section we explored why AI storage is important for supercomputing. Storage performance needs to be significantly improved by building professional external storage in order to improve the utilization of AI computing power. For example, by preprocessing some data at the storage layer before intensive computing of AI foundation models, the overhead ratio between computing and communications can be reduced to save AI computing power. Ultimately, storage is a useful tool which can be used to make supercomputing systems more advanced and autonomous.

Other fields except supercomputing will also see the use of AI storage. The pervasive rise of AI foundation models necessitates the need for all industries to review whether their storage can adapt to AI models and computing systems, as storage upgrade that creates a virtuous cycle with computing and AI is key to intelligent development.

These inspirations are of great significance to the development of China's storage industry.


Storage, Key to Success in the AI Era

Storage is a prerequisite for and the pillar of the development of AI foundation models in many industries, where the sustainable growth of real economies is increasingly data-dependent. AI presents the most cost-effective opportunity to upgrade the storage industry, and these upgrades will in turn promote further multi-faceted AI development.

High-throughput, reliable, shareable, and large-capacity storage systems are the key to intelligent economic and industrial development.

There are three ways to upgrade storage.

1. Expanding storage capacity and increasing advanced storage

As AI foundation models expand to supercomputing and large-scale digitalization, more enterprises are turning to local AI training and data storage. The overall storage capacity and the proportion of all-flash storage solutions need to be increased to meet the requirements for intelligent development.

2. Innovating storage technologies to cope with more complex data in the AI era

AI brings a series of challenges, such as complex data and diverse application processes. For example, during the construction of a data lake, data collection from multiple data centers and service systems is slow and complex and may involve cross-service data switchover, which is inefficient. They both affect storage performance. Protocol interoperability, cross-domain data scheduling, cross-system visualized data management, and other innovation can improve storage performance to overcome these challenges.

3. Improving storage security and O&M capabilities to ensure smooth AI development

In addition to complex data, AI foundation models come with new security risks and increasingly complex O&M for storage. This highlights the urgency of proactive security and automatic O&M for data storage to ensure the healthy development of AI systems.

By doing so, AI storage will gain great momentum. AI computing power means productivity, and AI storage will become the key to unlocking the productivity and drive industry intelligence.

To sum up, to advance the industry and encourage technology development, it is important to first seek breakthroughs and understand trends. Though there is not yet one set definition for or consensus on AI storage, ORNL's future plans confirm that AI storage will play a critical role in the future, especially with the rise of AI foundation models.

Detailed requirements regarding the definition, threshold, and development specifications of AI storage will also be enriched, signifying the irresistible storage upgrade in the AI foundation model era.

Other signs for the importance of AI storage can be seen in top-level labs' exploration on this topic, the years of storage industry development towards autonomy, and the feedback from AI industry professionals.

There is no doubt that significant value will be seen if we seize the opportunity and proactively prepare for a brighter future with AI storage.

Replies(
Sort By   
Reply
Reply