exploration of large data storage O & M models
Anonymous published at 2024-06-28 17:28

Abstract: With the scale of data storage products reaching XX0,000 in current shipments, how can we build a unified O&M platform that efficiently and accurately mitigates risks on the live network, rectifies major issues, and proactively manages mass devices, without increasing the overall O&M investment? Will data storage O&M head towards intelligence, and what steps are necessary to achieve this transformation?

1. Knowledge base construction for O&M

This step is the basis of the entire technical solution. Effective data processing ensures robust application capabilities. We have to address the following difficulties:

Mass data filtering and cleansing: When constructing a knowledge base, you need to filter valuable information from a large number of maintenance corpuses. This involves complex data cleansing processes, such as removing irrelevant content, deduplicating information, and filtering out abnormal characters.

Data augmentation and optimization: After cleansing, the data needs to be augmented, such as improving data quality and applicability through techniques like text slicing and filtering out invalid text.

Vectorization: The recall accuracy of vertical knowledge often falls short of expectations after vectorization. This is because the training data of the vectorization model comes from public corpuses and deviates from the vertical data space. To enhance the retrieval efficiency and accuracy of the knowledge base, advanced vectorization technologies like LEDA are employed. These technologies quickly fit the model to the vertical data space, enabling efficient similarity searches.

Automatic Q&A generation: The emergency plan knowledge base leverages large AI model technologies and vectorization to generate questions and answers for numerous emergency plans based on seed questions. These questions and answers are then used for model training. By utilizing seed data and automatic prompt engineering, SFT datasets with high service fitting can be automatically created. You only need to provide a few simple seed samples to obtain the entire dataset. This simplifies large AI model training and reduces the cost of manual data annotation.

2. Large AI model training for O&M

The key to this step is how to effectively train a large AI model to meet specific maintenance requirements.

Proper base model: Select a model of a proper scale, such as Baichuan or LLaMA, based on the requirements and resources.

Fine-tuning and optimization: Fine-tune the model for specific maintenance domains, including tuning model parameters, introducing specific corpuses for the maintenance domains, and performing SFT and RLHF training.

Evaluation and optimization: Adjust and optimize the model in various evaluation methods, such as PPL, C-EVAL, and MMLU, to ensure its effect and accuracy in specific domains.

The last step is to apply a trained model to actual maintenance domains. The difficulties and details are as follows:

Inference service deployment: Deploy the trained model as an inference service, for which you may consider whether to deploy it in cloud or locally.

Knowledge search and prompt engineering: Utilize the model to search for relevant knowledge and apply prompt engineering techniques to optimize query results. This ensures that maintenance questions are accurately answered.

Inference optimization and hardware acceleration: Optimize the model inference process to improve the response speed and accuracy. You can also consider an inference hardware acceleration solution to meet high performance requirements.

At the conclusion of this article, I would like to explore the following questions with you: How can we construct an O&M system for mass storage devices? How can we build online O&M capabilities for ultra-large clusters, such as high-performance computing and intelligent computing clusters? How can we quickly identify and handle emergencies? I look forward to your insights and comments.

 

Replies(
Sort By   
Reply
Reply