AI model, storage opportunities and challenges
Anonymous published at 2024-06-28 17:44

I. Background introduction:

with breakthroughs in AI-related models and technologies such as NLP, Transformer, GPT, and reinforcement learning, a new round of AI big model technology revolution has begun. It has far exceeded the level of ordinary human beings in terms of dialogue and knowledge feedback, and will even subvert the Internet, industrial manufacturing, government and enterprise, intelligent customer service, media and other industries. At present, the first wave of AI models has begun, the direction of AI technology has been clear, and a larger wave of AI is about to rush in.

II. A I trends in large models:

in the development process of large models, several trends have emerged as a whole:

the first trend is that the data scale is getting larger and larger, from & rdquo; Dozens of TB & rdquo; To & ldquo;PB & rdquo;;

the second trend is that the scale of model parameters is getting higher and higher, from & ldquo; Million-level & rdquo; Traditional models to & ldquo; 100 billion-level & rdquo; Large models;

the third trend is that the application of models is becoming more and more professional, from & ldquo; General basic big model & rdquo; To & ldquo; Thousand lines and hundred industries big model & rdquo;.

In the first trend, the data size is getting larger and larger. The AI model has developed from single mode to multi-mode, involving text, voice, video and other types. The data volume has increased to PB level. In a complete AI process, it is generally divided into four stages: data collection, preprocessing, training and reasoning. The following features are required for data access in each stage:

first, multi-protocol convergence and scalability: in the AI big model process, different types of data involve different data access methods, such as data collection stage, text data is suitable for NFS file access, voice and video data is suitable for object access; in the data preprocessing phase, data is suitable for HDFS access, while in the training and inference phase, data is suitable for NFS access; therefore, multi-protocol access provides flexible and diverse data access capabilities for AI models. In addition, multi-modal massive data requires good scalability for storage systems to adapt to the rapid growth of PB-level massive data.

Second, global data management: in AI big model training, the original data comes from different edges, different data centers, and online data at the initial collection stage. Data collection takes a long time (more than a few weeks) and is difficult to manage. A Global File System Like GFS is required to ensure cross-domain and cross-data centers, you can manage massive amounts of data online and offline. This allows you to tag, catalog, and visualize different types of data to reduce the difficulty of data management.

In the second trend, the scale of model parameters is getting higher and higher.. This means that the training efficiency requirements become higher. There are two ways to improve training efficiency:

first, expand the cluster scale. Large clusters can increase the training concurrency. After calculation, 1024 GPUs are required for typical configuration of multi-billion parameter large model distributed training, the typical configuration of distributed training for a trillion-parameter large model requires 8192 GPUs, which brings about two problems when the cluster becomes larger:

second, improve GPU utilization. After analysis, there are two key factors to improve GPU utilization:

in the third trend, model applications are becoming more and more professional.. It means that the big model of the industry serves the production and operation of enterprises and requires higher precision of the model. At the same time, in the scenario of multi-industry segmentation, customers have different knowledge of AI, hoping to achieve Turnkey delivery capability and reduce risks in delivery, AI model application and operation. Therefore, for industry AI, the following requirements should be met:

first, efficient retrieval of vector data: General basic big models usually cannot meet the needs of the industry, so big models need to be modified and fine-tuned based on the industry. The vectorized industry knowledge base provides efficient retrieval capabilities, supports incremental training, fine-tuning training, and improves the accuracy of associated problem sets in reasoning to improve the accuracy of industry models, therefore, AI models can be widely used in thousands of industries.

Second, one-stop training and promotion integration: industry users are usually not AI experts, and the skills required for AI training and inference applications cannot meet AI requirements. Therefore, in the whole process of data collection, preprocessing, training and inference online, customers do not need to care about time-consuming work such as AI deployment, application and maintenance, but only focus more on how to use AI to maximize the operating efficiency of enterprises.

III. Storage opportunities and challenges:

to sum up, after analyzing the three trends of the current large model with hundreds of billions of parameters, it is considered that AI storage needs to have the following:

first, high-Performance File storage capability, can meet the requirements of the AI process, support PB-level massive multi-type data collection, fast reading of large amounts of small file data for training, and high-speed writing of Checkpoint-bandwidth file data; therefore, the multi-protocol converged access capability, scalability, and ultimate hybrid load performance of storage are the huge challenges faced by file storage;

second, global Data View capabilities in the data collection phase, the AI process performs cross-domain and cross-type visual management and directory management for massive amounts of data, reducing data collection time and improving data extraction efficiency; therefore, classifying and tagging massive data to meet the needs of massive data training is the key point to improve end-to-end training efficiency;

third, vector storage capability, to meet the requirements of AI big model for fine-tuning training, incremental training, and efficient supply of high-quality Association knowledge during reasoning, to achieve the high-precision and high-efficiency application of the model in thousands of industries, the efficient and accurate retrieval capability of vector data needs to be constructed. Therefore, like the high-performance and high-precision ANN retrieval algorithm, the near-data-side retrieval acceleration capability is a key challenge;

fourth, A I super-integrated training and pushing capabilities, provides one-stop deployment, one-stop training, and one-stop O & M capabilities to meet industry-oriented timeliness and ease of use.

Replies(
Sort By   
Reply
Reply