This article is reprinted from: Andy730 public account
Conference: FAST '24
Subject: Panel: Storage Systems in the LLM Era
Date: February 28, 2024
Moderator:
Keith A. Smith (MongoDB)
Panelists:
Glenn Lockwood (Microsoft)
Dean Hildebrand (Google)
Greg Ganger (Carnegie Mellon University)
Zhe Zhang (AnyScale)
Nisha Talagala (Pyxeda)
Key Contents
1. AI Storage Requirements and Challenges
(1) Special requirements on storage for AI: AI storage requires not only larger capacity and higher performance, but also efficient data processing and management to adapt to a large amount of dynamic data load in deep learning systems.
(2) Challenges in AI data management: Currently, large AI model training does not have significant data problems. However, with the development of multimodal AI, especially the emergence of Large World Model (LWM), there will be greater demands for data processing in the future.
(3) Importance of multimodal data: AI with the multimodal capability can process data from different sources (such as images, texts, and audio), which imposes higher requirements on storage and analysis capabilities. For example, the James Weber Space Telescope generates 235 GB multimodal data each day. Comprehensively analyzing the data becomes a new challenge.
(4) Evolution of storage requirements: The access mode of cold data may change with greater demands for data processing, especially when large datasets (such as YouTube videos) are used for model training, where the access frequency of the data increases significantly. Therefore, storage systems need to be redesigned to adapt to the changes.
2. User Requirements and Support
(1) User self-service processing capability: Users expect a storage solution that can reduce costs and improve capacity without requiring deep understanding on how storage is implemented. They want to implement automatic data processing through simplified parameter configuration rather than rely on details at the storage layer.
(2) Diversified customer requirements: Customers include those with technical capabilities and knowledge and those without. The latter prefers to use simple configuration files (such as YAML) to implement automatic data management processes rather than focus on complex storage details. In addition, the connection to supercomputers further complicates the storage systems, requiring storage solutions that can flexibly adapt to various application scenarios.
(3) User-friendly solution: AI-driven tools such as chatbots help users solve storage problems more quickly, less dependent on technical support, and make it easier for users to obtain required information when users face complex file systems.
(4) Automated training and user support: Automated tools help users better use storage systems, improve user experience, and reduce learning costs.
3. Data Management and Processing
(1) Streamlined data management process: During data management, multiple storage systems are required to process data in different phases, including storage systems for raw data, model training, and characteristic data. This multi-tier storage architecture complicates data tracking and management.
(2) Real-time data processing: As the demand for real-time information increases, more efficient data management is required in the future. Users want to quickly load and process data, especially when processing PB-level datasets.
(3) Data traceability: AI-generated data blurs the line between what's real and synthetic. Therefore, data traceability is one of the key concerns for identifying and managing data of different types.
4. Application of AI in Storage Systems
(1) Application of AI in storage systems: AI can be used to identify performance bottlenecks, optimize data layout, and diagnose root causes of problems, making storage systems more efficient.
(2) I/O path optimization: The I/O path of a storage system is usually complex. AI has the potential to reduce manual fine-tuning workloads by optimizing the code and simplifying the system, thereby improving system efficiency.
(3) Applicable Standards for AI: To measure the effectiveness of AI in addressing storage problems, the following two must be available: related data to be used and indicators used to measure AI performance. With them, researchers can determine whether AI improves the performance of a storage system.
5. Outlook for Natural Language Interfaces
(1) Prospection of dialog interfaces: Dialog interfaces for complex systems will become a trend. Users can communicate with storage systems through natural languages to obtain real-time status and fault information. It's estimated that this capability will be put into commercial use in two to three years.
(2) Ambiguity in natural language: The natural language ambiguity may lead to misunderstanding during interaction with computers. Therefore, natural language query systems must be designed in such a way that users can get what they want to know.
-----
Keith A. Smith (MongoDB)
The topic is storage systems in the large language model (LLM) era. I'm thinking more storage systems in AI in general. A lot of potential meets there. I guess we should start by introducing the panel. You just go start with Glenn Lockwood to Nisha. And each tells us who you are and something about your experience with AI and storage, since that's the intersection of things we're going to talk about.
Glenn Lockwood (Microsoft)
Hi, everyone. I'm Glenn Lockwood. I work at Microsoft. I am a fallen storage person so I spent most of my career doing storage architectures for high-performance computing (HPC). I came to Microsoft and did storage products for AI. Two months ago, I was no longer in storage. Now, I am 100% in AI. So, I work with our biggest AI customers to solve their infrastructure problems.
Dean Hildebrand (Google)
Hi, I'm Dean Hildebrand. I spent a little over a decade in HPC working on scalable storage systems and primarily at IBM research. Just as I was leaving there seven years ago, I had this idea that I wanted to trace I/O workloads on a GPU but I couldn't get funding for one GPU. When I joined Google, I'm in the CTO office and work on our overall cloud storage strategy. If there's a file storage offering, I probably have touched it in one way or the other. Now, as you can imagine in the last year, there's been a lot of hubbub across the entire portfolio around, like what do we do to solve this AI problem, and do we need something new? This is good and really interesting when you bring in a full platform view to this, not just reiterate individual storage systems. So, I’m glad to be here.
Greg Ganger (Carnegie Mellon University)
I'm Greg Ganger. I'm at Carnegie Mellon University. On the storage side, I've been involved in FAST since it started. I think I've come to every one of them. On the AI/ML side, 10 or 12 years ago before everybody was paying attention to it, some of the ML people at CMU, which actually has a machine learning department, came and started asking us: Can you help us figure out how to parallelize and how to create parallelization models that make sense for this weird type of computation that's statistical in nature rather than precise in nature, the way most parallelization focuses on it. We've kept working with them on further and further explorations in that sort of space. So, I guess I have an academic viewpoint on the systems for ML side of things.
Zhe Zhang (AnyScale)
Hello, my name is Zhe Zhang. I'm responsible for the Ray open-source
project at AnyScale. Ray is a distributed computing framework that's commonly used for AI workflows. For example, OpenAI uses Ray as one layer in their stack to train large models. In the past three and a half years, I've mostly been working on AI. Before that, I spent 10 plus years mostly working on storage starting from the Oak Ridge National Laboratory (ORNL) working on high-performance computing (HPC), the Lustre file system and then Hoodoop Distributed File System (HDFS). So, I think I have some viewpoints from both angles.
Nisha Talagala (Pyxeda)
Hi, everyone. I'm Nisha. I started out in the storage world mostly in distributed storage and did a whole bunch of different things in distributed storage file systems working with Keith for a while. The last projects that I did in storage were in flash and persistent memory. About nine years ago, I switched over to AI. I built an AI company and built the first machine learning operations platform. So, one of the weirdest things is when I actually wrote the first Wikipedia MLOps page, I could not get Wikipedia to accept it because they said it's not a thing and I just made that up. That company got acquired a while back. Now, I run a company called Pyxeda, we commonly go by the name AIClub. We do AI literacy and AI education. So, in that front, I do a lot of multi-disciplinary work in the application of AI and everything from satellite imagery to genetics, and also kind of like scalable platforms for launching AI education at mass scale. Right now, I'm trying to launch 100 schools in Asia, all running on the cloud. That's where my world of AI and my world of storage meet.
Storage for AI from bytes to insights: understanding storage workload dynamics in deep learning systems
The Fall and Renaissance of Storage in AI
Keith A. Smith (MongoDB)
Thank you. I've got some rough sessions. One is what does AI need from storage. That is a session about storage for AI. I’m curious about these questions: Does AI need something from storage other than just larger capacity and higher performance, are there special needs, and is this a workload that we need to do something new and different for?
Glenn Lockwood (Microsoft)
I'll start with the inflammatory statement that AI LLM training does not have storage problems. Just keep making it cheaper and bigger, and get it out of the way. That's what my customers have consistently told me about how they view storage.
Keith A. Smith (MongoDB)
How far out of the way do you get it? Is it just like, do you want a file system, do you want objects, do you just want raw bytes on the disk?
Glenn Lockwood (Microsoft)
That doesn't matter. They can figure out how to make best use of it. Just tell them what the parameters are and they will make it work. They don't want to think any more about storage or have to learn something proprietary. Just let them deal with it and they'll take care of it.
Dean Hildebrand (Google)
Maybe I'll add. I don't think I necessarily disagree with that. But I'll say that if I expand it out to a larger set, we have multiple different types of accelerators, and we have multiple different types of software environments like GKE, Kubernetes, and just regular Linux with Slurm and other schedulers. In addition to that, we also have a wide variety of customers. Some of them are on that side, they hired all the best people, and they know exactly what to do. Then there's all the other way to the other side where they're like: I just want a YAML file. I want automatic data tiering, ingestion, training, checkpointing, and restart. I want all of that just to happen from like a single YAML file. Why do I have to worry about where the storage is?
So, I think that we have a lot of the nuts and bolts. But when you deploy a supercomputer and then try to point it at the existing storage systems, things fall apart and there are challenges. So, I think that where we are at the moment right is really trying to adapt and bring up both sides of the coin and adapt to a wide variety of application scenarios. It's not just a single supercomputer. They are many supercomputers in many different forms. We are trying to handle all these.
Nisha Talagala (Pyxeda)
Yeah, I would generally agree with both of your comments. Perhaps the key may not be in the data path. For the data path, the goal is to be faster, cheaper, and larger. But the control path for AI, the problem is that AI relies on data. There are so many laws coming out now that we have absolutely no idea what data we're using and why it did the thing we did. And there's a very unpleasant tracking problem that is going to emerge. So, one of the simplest ways I can describe it, AI generates data now, and unless you're really careful, you can't tell the real data from the stuff that AI made up. At the moment of creation, you know. But afterwards you don't know. So, the control path and the ability to keep track of it all from a logical standpoint and how it links to everything are where the problems are.
Data Management for AI
Best Practices for Lifecycle Management of Machine Learning Pipelines
Generalized High-Performance Storage Tiering and Lifecycle Management for All AI Frameworks and Accelerator Platforms
Keith A. Smith (MongoDB)
That's a great segue to the next session at this hypothetical conference about data management for AI. Can you elaborate on it because that was very high level? What's a research topic that somebody back for a second?
Zhe Zhang (AnyScale)
Echoing the point that the current large model training doesn't have a data problem, I just want to make it more concrete with one kind of interactive small quiz. So how large do you think is the size of the entire Wikipedia-like English content? I have three options: over 100 GB, from 20 GB to 100 GB, and below 20 GB. I think the size is on the border of 20 GB. So basically, every single computer can load it even if the entire Internet Crow is some like a couple of TB. But I think things will change a lot with multimodality. Recently, there's a new concept called Large World Models. I think we will be witness seeing this very interesting change. It's hard to predict, but I'm personally very excited.
Nisha Talagala (Pyxeda)
I just want to echo the point about multimodality. One of the most interesting things has happened with AI in the last year or two is that previously AI used to be very unimodal in the sense of modality, and there's an AI for that. Now the AI is that we have intersperse and go from modality to modality without blinking. That's going to dramatically increase the amount of a data it can handle.
Glenn Lockwood (Microsoft)
Could you elaborate on which models are going to contribute most to the unknown future storage problems?
Nisha Talagala (Pyxeda)
One example could be that the James Weber Space Telescope is generating 235 GB of data a day, and that is inherently multimodal because it captures data on so many frequencies not just the visual spectrum. And we truly don't know how to analyze it at the moment. There's going to be lots of AIs that go across spectrum and cause requirements of handling huge data.
Dean Hildebrand (Google)
I was just going to add. In the sense, you have one type of storage system for where it lands in the source of truth, you have a training storage system, and you have a feature store that analyzes that storage system. Let's say you want to broadcast out the results of whatever you built out into a CDN and in other aspects and the data management aspect of what you're saying as well. I just wanted to amplify that is starting to get really something. When you have, on average, four different storage systems being used throughout the process, it starts to get pretty tricky, especially upon increasingly complicated data management.
Nisha Talagala (Pyxeda)
Another example maybe the data of human body. Just think about the human body like multimodal data, including your genome, phenotype information, x-rays, and blood tests. There's so much data and they're all different modalities, but they all belong to you. So you got an interesting data management problem because your privacy has to be measured across all of them but they also have to be merged into a single AI. That's going to try to understand how to cure your cancer.
Keith A. Smith (MongoDB)
You were touching the keeping track of which models generated which data problem. About 10 to 12 years ago, we saw a bunch of papers at FAST about data providence. Do we should be visiting or thinking more about this problem?
Greg Ganger (Carnegie Mellon University)
We should never have stopped. But going back to what you were saying, the bandwidths, speeds and feeds for LLM training are not noise compared to the work of training the model. But if you start going to data sets and trying to use them that historically were viewed as cold data, things would change. For example, we wanted to take a bunch of the videos at Google and a bunch of the YouTube videos stored and start training models against all those. All of these are designed to address the frequencies never found in the access of cold data. Assuming that you could keep up with the changes, we have to redesign the storage systems to fulfill the requirements of handling 100 times more voluminous than the kind of data that's being used to feed.
Glenn Lockwood (Microsoft)
I feel like that's a bit of a myth that's perpetuated by the storage industry that AI needs access to all your data all the time. In practice, that's not how LLMs are trained. I mean you don't take raw text or images and just shove them in a GPU to train the model. There's an extensive pre-processing pipeline if the amount of data that you pre-process is too large to manage in your hot path. You find a better way to represent it that's what tokens are. I mean no one takes ASCII files and puts them in a GPU and the pre-processing step is happening offline before training ever begins. From my perspective, that's just a standard big data analytics Hadoop style problem. What's new about that?
Dean Hildebrand (Google)
Yeah, but I am seeing people want more real-time access to real-time information. I think in 2030, they're going to want to be able to train smaller amounts of data sets and improve those models dynamically which is going to change the situation. So, we don't have a week to fine-tune our data into the right format and get the right columns in the right spot. I think it's going to be happened that way. In addition, like I know when we started this, we were like inference there's no storage there like it's a command pump until people start showing up with like petabytes of satellite data and genomic data and other aspects. They're like one I want to load my model onto a thousand nodes simultaneously and then I want to process this amount and how fast can you do it because the answer is time sensitive to me.
We don’t have time to take every piece of data remerged into a new form. So, we can make it efficient from the beginning.
Glenn Lockwood (Microsoft)
It has to happen somewhere though. Right?
Dean Hildebrand (Google)
Or we can just make it efficient from day one right.
Keith A. Smith (MongoDB)
So, are these actually new storage problems or is this Glenn hypothesizing?
Zhe Zhang (AnyScale)
About the new storage problem, I think another very interesting emerging type of storage is embedding plus Vector database. For example, if it's not only a kind of string in string out query problem but you need a kind of complement with retrieving your own data, it's called retrieval augmented generation (RAG). That industry is growing really fast and I think we're still not seeing the kind of peak of it.
Greg Ganger (Carnegie Mellon University)
Yeah. I don't want to use the word downsampling because it's wrong but you can take data that would have been too voluminous and bake it down into something that you can work with. As the original data gets larger, the potential consequence in what you're losing when you do that gets larger and it becomes a trade-off. Is it better to say we're going to try to find a way to have to reduce the amount of reduction before we can make forward progress, or do we increase the amount of reduction in order to avoid having to do that right? I don't know what the answer is there. Maybe you know, there's going to be a lot of trial and error that happens.
Keith A. Smith (MongoDB)
So, is that a storage problem? As Glenn said, storage has just become faster and cheaper. Is this what we have been caring about? I don't know if I'm misrepresenting you?
Glenn Lockwood (Microsoft)
I would argue that, like for example, say you want to train on a bunch of 4K images, what's the point of training on 4K movies with 32-bit color channels when all that stuff is being reduced down to 4-bit integers for training or inference? I mean you are training a model. You are not doing rocket science. You do not need FP64 so you are doing a bunch of really dumb approximations. I mean anyone who's used any kind of AI chatbot knows that these things are impressive until you actually start trying to use them for useful things, then you realize they are kind of dumb. And so, you know that that will undoubtedly change, but do we really need full fidelity in this massive data problem to be solved to get the next generation of generative cat videos?
Zhe Zhang (AnyScale)
To these points and the earlier point you made that you know there's a big data Hadoop style processing before training, I think one interesting question is: do we push down these kinds of things to the storage system or we consider that kind of pipeline separate? Concretely, from our users, one very common demand is deduplication of the WebCrow data. It is also a strong need to retrieve data with certain condition, not accurate.
Dean Hildebrand (Google)
Here, I'll end on one end that I think there's like a weird amount of inefficiency in the system that we're very uncomfortable with today and we're copying data all over the place, like we're down sampling, or whatever munging, taking 4K down to like 128 KB images. We're doing a lot of work but incredibly inefficient. The amount of, if you think about it, as write or read amplification in a single byte of data that we're trying to work with, is just amazing. I do think there's a big opportunity in that space over time to get a lot more efficient in terms of how we're just handling this. It works today but is that the bar, right? I think we have a higher bar.
Nisha Talagala (Pyxeda)
Another area that maybe think about is we always think kind of bigger, right? There is a lot of value in the land of AI in being smaller, so if you can build storage devices or have storage software that can work in tiny devices companion with the AIs that are learning how to work in tiny devices, because the energy footprint is out of insane and anything to reduce the energy footprint is a very big deal right now. So, that's another way to look at. It does not have to be bigger. There's merit in being really small.
Using AI in Storage Systems
Predictive Analytics for Storage: Harnessing AI to Anticipate and Mitigate Performance Bottlenecks
AI-based I/O Path Code Optimization that Reduces Latency and Improves Efficiency
AI-Optimized Data Placement Strategies for Heterogeneous Storage Environments
Interactive Real-Time Root-Cause Anomaly and Bottleneck Detection for Distributed Storage Systems
Keith A. Smith (MongoDB)
I think the other side of the coin for a conference, like FAST, is we've talked about sort of what does AI need from storage. I think the other side is, if you are building and designing storage systems, how can storage systems and storage systems researchers use AI in the future. ChatGPT seems to have done a match arbitrary storage problem by putting AI in the title someplace. So, find bottlenecks using AI, data placement using AI, and root cause problems using AI.
Dean Hildebrand (Google)
I think I've reviewed some of these papers. Wait a second, one of those is mine. I think it is also stealing what I am just saying.
Greg Ganger (Carnegie Mellon University)
So glad I didn't have to be the one that did the crotch the old person, saying there's nothing new under the sun. I do not know how many people in the room remember the term autonomic computing, so you can count the gray hair. Self-tuning, self-healing, self-this, self-that right, like everybody that's over a certain age worked on a project that had one of those adjectives in the name, because that was how you got attention in the period from sort of 2000 to 2006 or 2007 or something like that. It all kind of went away because frankly the machine learning wasn't good enough. You cannot automate decisions based on a predictive model if the predictive model keeps getting it wrong at too high a frequency and too large a gap from the prediction to the reality. I do not mind seeing titles that look similar because maybe it is good enough or it is reaching the point of being good enough. That some of these things that were tried in national stages. I am not saying that it would not be new when it is tried because it'll end up being done differently and the problems are different now, right? We care about using these things for but it is exciting to see that maybe just like it is shocking how well it is done in some other domains. Maybe it is going to be pleasantly shocking how well it can do with some of the things that we aspired to use some of these automation tools for a while ago can actually work out this time. I am sure that we probably like people in this room submitted papers with titles that were one step away. You are looking for but three years ago does not count. That's in the current ERA basically, but the paper from 2004 with a title that's similar to these. That's the one where you look back and you go wait a minute how come.
Keith A. Smith (MongoDB)
There's thinking about a storage system. There are policy decisions in it. What do I prefetch? What do I evict? How do I decide when to tier my data? In a distributed system, how do I distribute and place my data? etc. Is there should researchers just do the random let's take an idea and AI collide it and see if something wins or techniques? What would be the filter to say this would be a good problem in storage to try to apply an AI model to?
Nisha Talagala (Pyxeda)
I generally think that if a problem has kind of like these two characteristics, it is good. Do you have data about it? You know what good looks like? The next one is: do you know how to measure and figure out if AI did better, because you can always apply to a problem that does not mean it is going to do better but sometimes you cannot even tell. If you have those two, give it a shot.
Dean Hildebrand (Google)
I would like put one in there about I/O path. I do not know how many people have looked at critical section I/O path of storage systems but they tend to be a bit wonky in terms of maybe they are a few thousand lines longer than they should be inside a regular single function and they have a lot of optimizations. There around bits and caches and all these different aspects. I am really hoping that you know AI if anything it can either take existing code and optimize it or we do not have to sit there and cache line align a variety of different aspects of the system so like I think there is some grunt work as well that maybe can we they can help with and we can just make better file systems out of or storage systems from day one as opposed to having to tune it over the course of a decade.
Glenn Lockwood (Microsoft)
Or alternatively, keep making horribly complicated file systems but use AI to make them accessible to people. I used to work in the DOE lab system and I would get tickets saying my I/O is slow, what's going on? You do not need a highly paid PhD to answer that question. It is just basic troubleshooting steps, so why not throw an enlightened chatbot at it and help that person get better quicker use out of their complicated file system rather than have to wait for days for that one person on staff to know how it works.
Greg Ganger (Carnegie Mellon University)
That is to create automated tools to train people to use it.
Glenn Lockwood (Microsoft)
Yeah, just lower the barrier. You do not have to be able to read cryptic kernel code in order to understand why some weird behavior is manifesting when you run your application a certain way. It is not a satisfying thing from an I/O or storage research standpoint. Storage research should make the storage better but in the absence of better storage make it easier for people to break through that barrier.
Greg Ganger (Carnegie Mellon University)
No better users.
Glenn Lockwood (Microsoft)
Yeah, it is a dream, right?
Keith A. Smith (MongoDB)
No, and there's a whole body of research and different tools and techniques to spot bugs or spot performance problems that are less AI-based but it is definitely a real problem.
Zhe Zhang (AnyScale)
I just have one more detailed point to add. I think it is important to think about which of these problems are more relevant to the recent AI improvements. Versus can be solved by the previous like traditional machine learning. Basically, we need to think about in which problem the input data is text or unstructured. I think in some of the problems, the input data is structured like events, you can do it with like deep learning, then we shouldn't expect a lot of improvement with the recent AI. But if it is like logs and error messages or source code, we can expect some big jumps.
Smart Storage Systems
Empowering Natural Language Queries: Leveraging Large Language Models for Data Exploration
From Words to Actions: Managing Storage Systems through Conversational Interfaces Powered by LLMs
Nisha Talagala (Pyxeda)
Another thing that I think I am starting to see more in the product world which says it might happen but it may not be for research, is essentially conversational interfaces to complex system. So, you can call up your storage system and say: "Hey, storage system, how are you doing today?" "You know, it wasn't too bad. I had a lot of I/O last night. I've got a bad disc over here, slightly overworked CPU over there." This is real and this will show up in product in the next two to three years.
Dean Hildebrand (Google)
I've got an itch on server three.
Keith A. Smith (MongoDB)
Right. ChatGPT helpfully provided some titles around using natural language for managing your AI system or your storage systems.
Greg Ganger (Carnegie Mellon University)
For all the students in the audience, I want to make sure that we amplify that: before you tackle a problem saying "I'm going to apply AI, ML, or whatever to this to make it better", start by figuring out what "good" looks like and what the gap is, to figure out whether or not after you did the work that you did, even if it was perfect, it was worth doing. It also gives you a thing to train for. If you're doing supervised learning, it gives you the thing that says what "good" looks like. So, it serves two purposes, but it serves the filter: the decision-making process on whether or not you're going to go down that path as well.
Dean Hildebrand (Google)
I do remember my advisor saying "RAID 0 is sometimes hard to beat". Before you come up with a new striping algorithm, maybe consider on whether or not your current one is fine.
Nisha Talagala (Pyxeda)
There's a running joke in the AI world that first you get the simple machine learning decision tree to work and you spend six months getting your deep learning to work so that you can publish, because the second one is highly temperamental but the first one is not publishable.
Keith A. Smith (MongoDB)
I'm perhaps somewhat naive about how these natural language systems work. But when I think about using natural language to configure a system or to query my data, natural language is inherently ambiguous—that's part of expressiveness and poetry of the way we talk to each other. And we've traditionally interacted with computers in very unambiguous and very clear languages. So is there a problem having these two worlds meet if I can start using my language to query my data: do I not know what I'm getting, because maybe the machine heard something different than I intended, just like I could say something ambiguous and you could misunderstand me?
Nisha Talagala (Pyxeda)
Yes.
Keith A. Smith (MongoDB)
So, there's not a silver bullet here.
Nisha Talagala (Pyxeda)
It's a very important point: one of the things it implies for the previous point of what "good" looks like is that it's really hard to measure. Previously when we knew what the right answer was, we knew whether or not we got it and we could compute error, right? Answer was 0.5, we got 0.538, and we know what the error is. Here we've got one sentence it could have said, one sentence we might like, and we have no idea what the right answer was and everybody got a different opinion. So, one thing to be watched is that there's a body of researches coming out on how to measure large language models (LLM)s, and that's a very nascent field. But if you're going to build an LLM to solve a problem, probably you need to figure out what is the state of the art in measuring the thing so that you can use it.
Keith A. Smith (MongoDB)
It's not clear that humans are great with the programming languages. Either we. I work at a database company and we do get customer issues on a routine basis, where "I ran this query and got this result, and it seems wrong to me". It's like, well, you don't understand some of the language and the query you wrote. This world may not be any worse.
Greg Ganger (Carnegie Mellon University)
The biggest challenge in the whole process of systems tuning themselves to goals that are specified is eliciting the goals. One of the things that we have to work on or think about if we're going to get back into this space is how we're going to do that sort of thing. A part of why automation tools didn't be taken when we tried to do this stuff is people were very uncomfortable with them and not knowing what they were doing, but they were also not comfortable with only being able to say "I need this performance for whatever or I need this reliability" and then not knowing what the heck's going to happen based on it. When I say "people", I mean the people who are going to get blamed when it doesn't work, because those are the ones that are close to it and that are supposed to be able to specify this stuff. Coming up with analogies to how they do their work and figuring out how to incorporate that into the design of systems that are supposed to automate themselves becomes useful.
Operators of storage systems or of systems in general discover that, it's not good enough not because somebody had a pre-specified level that they were supposed to hit, but because somebody comes and complains it's not good enough, which can give you the one end of the spectrum. And what happens when you're overshooting becomes harder to figure out, how to elicit it, because the obvious way to do that is to go slower until somebody complains. But when they complain, you get in trouble and that's a place you don't want to be. So, figuring out how we're going to do that is an interesting thing.
Keith A. Smith (MongoDB)
Anyone who has worked with real-world customers knows that they can be extremely unclear about what their goals are from, whatever it is, that you're selling them.
Nisha Talagala (Pyxeda)
But they know when it hasn't been met.
Greg Ganger (Carnegie Mellon University)
One thing people are good at is complaining, so it has to be part of how we elicit things but it can't be the only answer.
Keith A. Smith (MongoDB)
Is there a way to observe what the customer is actually doing and figure out what their goals are if you're smart enough?
Nisha Talagala (Pyxeda)
One thing that might be worth looking at is: one of the reasons ChatGPT works the way that it does is it employs something called reinforcement learning with human feedback (RLHF), which is reinforcement learning with human feedback. There are armies of humans all over the world reviewing all of its answers and saying I like that one. That's the cycle that the industry has developed to incorporate human knowledge. How well it works, time will tell. But that is what they're doing.
Greg Ganger (Carnegie Mellon University)
Complementing the language side of it, we need cameras, so you can tell whether the person is smiling or their hair frowning.
Glenn Lockwood (Microsoft)
I'm sure the security people will love it.
Greg Ganger (Carnegie Mellon University)
Yes, as they love almost all of the AI stuff, right? Do data collection for it.
Dean Hildebrand (Google)
So, it can automatically turn on my laptop camera is what you're saying?
Glenn Lockwood (Microsoft)
Yeah, and send it to Google, to Microsoft.
Dean Hildebrand (Google)
Yeah.
Glenn Lockwood (Microsoft)
I think the key here is, you don't want to go full automation. I don't really see a future where complex tasks are fully automated by a blackbox LLM or AI. I mean, putting my Microsoft hat on, we do co-pilots and it's deliberately a co-pilot because you are not supposed to blindly trust it. It's not a pilot, it's going to screw things up, and this just helps you get a little closer. Language is imprecise and the results you get are going to be imprecise. And people are going to say it's not good enough and you can say "Well, that's ultimately on me as a human and I got help from this thing".
I think the issue of fully automating all of these …. Look at any automatic transparent attempt to make anything better, like caches for example. Caches are great in CPUs unless you're doing HPC in which case you need to actually know what your cache sizes are, so you can tile your data accesses so that it fits exactly in this hidden footprint that gets cache aligned and fits exactly in your L2 cache. So automation is great until you really need to get something perfectly right and I think we will always need a human to do that.
Dean Hildebrand (Google)
I was also going to just add: telemetry is hard. It can't monitor and it can't say anything about what's going on in the system or be smarter unless it has all the data. It just takes years that I'm seeing to write telemetry to very sophisticated distributed systems to know exactly what's happening on all the clients, in the storage servers, and on the back-ends. And by the time you get that one right, we've slotted out a new storage system underneath or something else to that effect. In cloud side, the amount of telemetry control plane work that has to have in order to get this thing running 24/7 is mindboggling. So, it could be doing it, but does it have that data or are we willing to invest the amount of time and energy it takes to provide the telemetry that it would need to be smart enough? I don't know, time will tell. People tend to stop somewhere and say "Good enough. You know, I don't want to pay anymore".
Nisha Talagala (Pyxeda)
Just to add to that, one of the things that might present an opportunity for research is: one of the things that AI has shown itself to be quite good at is something called "transfer learning" where it can learn from one environment and apply it to another. If that can be applied to, for example, learning from one storage system and applying it to another, that would be very cool and very practically interesting as well.
Keith A. Smith (MongoDB)
Learning from one set of customers and applying to another set of customers?
Nisha Talagala (Pyxeda)
Yeah. It's worked out well in all sorts of areas of medicine, climate, and everything.
AI and Education
Preparing Students for the AI Revolution: Redesigning the Computer Science Curriculum to Foster AI Literacy and Expertise
Keith A. Smith (MongoDB)
So, I guess the other sort of broad area I was considering is education. We all have academic affiliations and backgrounds. I don't know how any of this play into education. Nish is obviously doing a lot of work in education here but what do we need to add to the curriculum if you're teaching a storage course at CMU? Does anything in AI need to show up there?
Greg Ganger (Carnegie Mellon University)
I'm hearing from all of the corporate people, "I don't need to do anything. It's all good." I'll say the biggest thing that comes to mind when you bring up the topic of AI and education is that we have to tell people (this is also for the students in the room) that there is still a path and a need for people doing systems and storage. All we hear about now is AI/ML everywhere and I can tell you at the university we're now seeing that reflected in what the students believe. They have to do in order to parlay going to the university into having a career, whether it's master students coming in after having CS degrees or it's undergraduates coming in after high school. I've had students who really love doing systems. They have to ask, is it okay to do systems or am I making a huge career mistake? You know you can't have plumbing that works without plumbers. It is a kind of way I describe it to people and the infrastructure for doing large scale AI/ML. It's going to be a while before we have the generative AI to figure out how to make it for itself. Eventually it'll move us all out of the loop and I'll be done by then. The problem is for my kids but my bigger concern is that there is a lot of effort to figure out what the AI curriculum looks like and it's not the storage people they're going to ask about how to do it. It might be Nisha but it's not the storage community.
Keith A. Smith (MongoDB)
For folks in the industry, if you're hiring a new undergrad or a new PhD student who's got a primarily system background, is there AI-related stuff you'd like to see on their CV rather than just the traditional set of system courses?
Zhe Zhang (AnyScale)
I think the phrase "AI literacy" is good. That's what I would look for. So myself coming from a system background, I just feel there's a very big gap between even the terms and languages like general software engineers talk about and AI people talk about. For myself, it took me a while to understand what is a feature store, like what is a feature and what is forward pass and backward pass. I imagine it is the same as the other direction. So I think bridging this gap will be important in the curriculum.
Greg Ganger (Carnegie Mellon University)
Just to be clear. I was not arguing against that. I think that's going to happen. Every student is going to want to have at least the basic understanding in their bag.
Nisha Talagala (Pyxeda)
So if it helps any, I get questions from students doing AI PhDs: whether it's worth doing them because they don't know if they can create an AI innovation that can keep up with OpenAI, Mic soft, or Google. It's a fair concern. You're in your university doing the project where you have access to a microscopic fraction of the resources that these companies have. How are you going to write a thesis that's going to keep up with the next 300 things they're going to deliver the next year? To be honest, I believe it's better to be in the applied fields than in the core AI field, like applied as systems and medicine. Because AI is a tool and if you can wrap your brain around the tool and how powerful it is, you can do amazing things in your chosen discipline.
Glenn Lockwood (Microsoft)
That's a great point. Reading this title, I initially thought I have nothing to offer because I have never taken a computer science class in my life. I have a PhD in material science.
My undergraduate is in ceramic engineering. There's no reason I should be on this stage at a computing conference talking about computing. But I've relied on the ability to reason and question things and figure things out, which isn't unique to computer science. The fundamental skills will serve students in any discipline, so I mean AI literacy is critical from the standpoint of figuring out where you can use it to improve your critical thinking skills. But do you need to take a class in AI in college to understand AI at any point in your future? My answer is no. You can figure it out but you have to have the foundational skills to do that first.
Dean Hildebrand (Google)
Maybe this is like an evolution of what's already been true over the last few years. Graduating with knowing C and object-oriented programming is fine but having that mixed with material science or with biology or some other science is where a lot of interesting papers end up. Maybe this is just sort of an evolution of that. Injecting AI as part of that multidisciplinary work is what we're going to do. You don't have to be an expert but you have to be able to speak the language so that you can communicate with others.
Nisha Talagala (Pyxeda)
You need to understand the science and you need to understand enough of the AI that you can see the connection. There are so many problems that are unsolved. If we thought they were all solved, we will have a problem. But we're not close to that.
Keith A. Smith (MongoDB)
It's good to know we're not going to run out of problems.
Nisha Talagala (Pyxeda)
By the way I am hearing that lots of college students are not able to get into the AI courses. They're full. They can't get in.
Greg Ganger (Carnegie Mellon University)
Because all of the students believe that it is the one true path now that has to be taken. By the way, the students aren't the only ones questioning whether or not they can be anywhere outside of a small handful of companies and do anything advanced in AI stuff. The faculty are freaking out.
Keith A. Smith (MongoDB)
So I want to leave some time for questions. I've tried to frame some questions, but you've all sort of spoken about a lot of things. Are there other points that any of you hope to make but haven't had a chance to elicit?
Glenn Lockwood (Microsoft)
I feel like my parting thought is that I don't see a lot of problems that are unique to storage. There certainly are a lot of problems and open questions that AI brings, but I continually struggle to figure out what of those make good storage research problems. I think there are data research problems that are coming, but strictly in the context of storage, maybe the key is applying these broad AI problems.
Keith A. Smith (MongoDB)
I think the FAST call for proposal (CFP) always has a couple lines around data management and data lifecycle, so we're not traditionally restricted to actually building storage systems. There's lots of other stuff that falls into the broad domain. Maybe we should be leaning more into some of that.
Nisha Talagala (Pyxeda)
Maybe you can rename FAST as FAD, file and data.
Greg Ganger (Carnegie Mellon University)
And part of it is the Shades of Gray that Keith is bringing up and then part of it is maybe reframing it from what stuff that storage has to do for AI to what stuff could it do that may help AI. Maybe the answer is nothing. Because the Shades of Gray are a thing, maybe some functionality shift to where the boundaries are can happen. That makes it have a role to play to help reduce some of the trade-offs that have to happen in other parts of the system. I don't know what is the answer to that, but it's not infeasible.
Dean Hildebrand (Google)
I guess my final point is that there isn't a ton of new things but there is a lot of challenges to consuming and processing the data over the last year. That is not out of the box. It always goes back to that fundamental thing of HPC for the last 20 years. Can you teach a physicist to optimize their code for storage? The answer is no probably, but we can't sit around and wait for the data scientists and everyone else to optimize their code for our existing storage services. How do we just bring it and make it easy and simple? We've been doing with businesses for a long time but now we need to do it in a more efficient way and accelerate the broad to all of these new people coming online.
Greg Ganger (Carnegie Mellon University)
I think there's something significant to what you were talking about a long time ago in this panel. Develop a tool to help the persons that are trying to use the storage system in some intensive way and use it with a pattern that accommodates what they're trying to do but also it will perform well. There's only a relatively small number of patterns. Some of them are just always awful but other ones depend on what you're trying to do, which one is the right one? Teaching people that don't live in that world how to make that decision for themselves is really hard, but somebody sitting next to you notices what you're doing and tells you to use this one of the scripts instead of that one of the scripts that somebody else wrote for you, so you don't have to actually work it out. That's a doable thing and that's a storage research problem to figure out. It automatically figures out which pattern they ought to be using and then allows them to hook that into their code, so it's a minimal work.
Nisha Talagala (Pyxeda)
I think there's a lot of opportunity in the data path not in the control path. There are so many problems that are hard to measure. They're a little hard to get your brain around but they are absolutely critical and getting worse by the day.
Zhe Zhang (AnyScale)
So my final comment is, in particular, the way that AI uses storage is very inefficient and wasteful, even in the simple kind of ChatGPT query kind of use cases. Often you're loading the entire Shakespeare's collection in your GPU memory just to answer how to say hello in Spanish, so I think there's a lot to do just to duplicate and do all the kind of stuff we discussed.
---End---
Source video: https://www.usenix.org/conference/fast24/presentation/panel-storage-systems