Building the Future: AIGC promotes the intelligent revolution of operators
Anonymous published at 2024-06-28 17:47

ChatGPT has triggered the upsurge of generative AI. For the first time, it shows strong semantic understanding ability, text fluency and continuous dialogue ability, representing the arrival of the singularity of general artificial intelligence. The industry has strengthened the research on relevant fields of big models and launched some new products and applications. The traditional information industry ecology is being reconstructed. As the main force of ICT infrastructure construction, operators are facing new opportunities for intelligent computing development.

A I Innovation, accelerate the digital transformation of operators

in terms of network infrastructure operation, AIGC can help product innovation and intelligence, IT/network operation and maintenance intelligence, and even provide services such as intelligent inspection of network fees and self-service robots for network complaints. Intelligent operation can not only improve operation and maintenance efficiency, but also optimize service quality and improve customer experience. For example, China Mobile's & ldquo; Nine days & rdquo;AI platform has many innovative practices in network intelligent operation.

In terms of innovative business, such as video networking, VR, live broadcast, etc., AIGC can not only improve the accuracy of machine vision algorithms, but also quickly generate digital content, providing personalized and immersive experience. These innovative services bring new impetus to operators. For example, the & ldquo; Bright kitchen & rdquo; Service of Tianyi Cloud improves food safety through a variety of recognition algorithms. For example, the VR service of South Korea KT provides rich interactive content.

In terms of UED business, & ldquo;5g/leased line + industrial internet & rdquo; Has broad application prospects, such as coal mines and smart campuses. In the face of massive edge business scenarios of operators, AIGC can not only help the edge IT Platform achieve business autonomy and reduce the number of failures, but also continuously improve the service level of edge business through autonomous training and reasoning, improved adaptability to different scenarios. For example, China Telecom iStack (smart edition) edge all-in-one machine is widely used in various edge AI scenarios.

With the advent of AIGC, what challenges does the data infrastructure face?

To seize the opportunity of AI development, we need to strengthen technological innovation and infrastructure construction. When it comes to the infrastructure of AI industry, we generally focus on AI chips, deep learning frameworks, and pre-training models, however, another key problem is often ignored: large models will bring huge data pressure, and data storage is also a pillar in the AI development process. What are the challenges of data infrastructure in the AIGC big model process from data collection, data preprocessing to data training?

Challenge 1: slow data collection

data collection requires copying PB-level raw data from cross-domain and multi-data sources, involving data migration and data aggregation. For example, data migration using hard disk Mail takes up to several weeks. Data is transmitted remotely from local to target centers, it also takes up to several days. Scattered and diversified data sources actually create isolated islands of information and cannot collect data effectively and quickly during use. Therefore, how to open up isolated islands of data and shorten the collection time? It is the first challenge we face.

Challenge 2: long data preprocessing cycle

raw data collected and crawled through the network cannot be directly used for AI model training. It is necessary to clean, deduplicate, filter, and process diversified and multi-format data, we call it & ldquo; Data preprocessing & rdquo;. Compared with traditional single-mode small model training, multi-mode large model requires more than 1000 times of training data. For example, a typical one-hundred-TB large model data set, the preprocessing duration exceeds 10 days, accounting for 30% of the AI data mining process. At the same time, data preprocessing is accompanied by frequent and highly concurrent processing, which consumes a large amount of CPU resources and occupies the expensive CPU resources of the system. How to shorten the data preprocessing time by the most economical means? It is an urgent problem to be solved.

Challenge 3: training is easy to interrupt and data recovery takes a long time.

Compared with traditional deep learning models, large model training results in exponential increase of parameters and training datasets. At present, the mainstream pre-training model has hundreds of billions of parameters and will even develop to trillions. Therefore, frequent parameter tuning, network instability, server failure and other factors lead to unstable training process and easy interruption of rework. Therefore, the Checkpoint mechanism is required to ensure that the training is rolled back to a certain point, not the initial point. Currently, the entire training cycle of large models increases sharply due to the time required for Checkpoint recovery. However, in the face of a single 10TB data volume and the frequency required for future hours, consider how to reduce the Checkpoint recovery duration.

Overall, the continuous innovation breakthrough of the AIGC model requires optimizing the whole process of collection, preprocessing and training from the perspective of data storage.

For AIGC, how should the operator intelligent computing center be built?

To strengthen the construction of AI intelligent computing centers, operators need to attach importance to the construction of data storage capacity to achieve balanced development of storage and computing. We suggest that we consider the following aspects:

first, eliminate data islands and shorten the collection time.

Facing the problem that multiple data sources are difficult to share, operators need to build intelligent data weaving capabilities to achieve global unified data View and scheduling across systems, regions, and clouds. Through GFS(Global File System), Huawei helps customers open up data Islands, improve data scheduling efficiency by three times, and provide hourly data collection time to better support data value mining for upper-layer applications.

Second, data preprocessing time is shortened by near data acceleration.

For large amounts of raw data preprocessing tasks, Huawei data storage provides an efficient data base, supports multi-protocol communication, free data format conversion, and efficiently identifies diversified data formats; Single node 3.4 million IOPS, it meets the high-performance requirements for parallel preprocessing of massive amounts of data. In the future, it will support the push-down of preprocessing tasks, and storage will undertake data preprocessing, greatly reducing CPU resource overhead and reducing costs and increasing efficiency.

Finally, use innovative A I Storage Solution solution to reduce Checkpoint recovery time

facing the requirement that large model training is easy to be interrupted and Checkpoint recovery is required, Huawei has adopted innovative AI storage solutions and high-bandwidth and large-capacity storage devices to meet the requirements of PB-level data, the CheckPoint requirement of hourly frequency improves the training preprocessing efficiency and supports trillions of parameter large model training.

 

AIGC rushes in, and the intelligent computing base will become an important cornerstone of innovation and reform in thousands of industries. Generative AI has entered the application scenarios of operators such as network operation, innovative business, and edge shifting. Through innovative storage technologies, Huawei data storage has built leading all-flash architecture products and solutions, and joined hands with AI industry ecosystem partners to take the lead in achieving a close fit between storage innovation and AI development, help operators build reliable AI data infrastructure.

Replies(
Sort By   
Reply
Reply