Core view
- scale-driven model performance: the increase in computing resources and data volume directly improves the performance of the model and follows a clear logarithmic power law. Larger models and data volumes bring more performance, which is the core reason why OpenAI continues to invest.
- Predictability of model performance: the performance of small models can accurately predict the effect of large models, such as GPT-4, which reduces the risk of large-scale training and makes investment more reliable.
- Emerging capabilities: with the expansion of the model scale, GPT model not only performs well in language generation, but also shows new capabilities in reasoning, programming and multi-task processing. These new capabilities are the direct result of scale.
- Infrastructure requirements: in order to support the training and deployment of large-scale models, it is necessary to establish corresponding infrastructure. This includes powerful computing resources and optimized data processing capabilities to match the scalability requirements of the model.
- Complexity of power management: large-scale AI clusters face the problem of power demand fluctuation. Accurate power monitoring and dynamic power management must be adopted to optimize power usage and ensure the stability of system operation.
- Software and hardware collaborative optimization: facing the challenge of large-scale deployment, it is necessary to optimize both hardware and software at the same time, and improve the overall reliability and efficiency of the system through intelligent fault detection and resource scheduling mechanisms.
-
Challenges of troubleshooting: with the growth of AI clusters, hardware faults become more frequent. Efficient fault handling mechanisms are needed to ensure stable operation of the system in case of local faults.
Answer questions
question: What new experiences may we see when the language model is further expanded? What will happen if the computing power is increased by 2 to 4 orders of magnitude?
With the expansion of the language model, we can expect that the reliability and long-time reasoning ability of tasks will be significantly improved, especially the stability in multi-step tasks. At the same time, greatly improved computing power will make it possible to train larger-scale models and implement more efficient applications outside the data center, for example, real-time voice-to-voice conversion is performed on mobile phones.
Question: can the existing GPU architecture combined with Moore's law and new algorithms achieve a 10-fold performance improvement, or do we need a new paradigm? How to solve bottlenecks in computing resources and hardware?
To achieve significant computing performance improvement, it is necessary to identify and solve various bottlenecks and gradually overcome these limitations. At present, the key to improving performance is to identify the most urgent bottlenecks and promote computing performance through gradual optimization and technological breakthroughs.
Question: how to deal with OpenAI when a GPU fails? Does it roll back to the previous checkpoint and restart after a fault occurs?
OpenAI uses backup hot spare parts and component removal technologies to handle GPU faults. The specific measures vary according to the cluster size and fault conditions. During fault handling, rollback to the previous checkpoint and restart may be involved, depending on the training settings.
Question: How to find the best balance between slightly lower performance and more consistent runtime?
To find the best balance between performance and reliability, you need to evaluate the long-term throughput and calculate the performance and expected runtime in a specific software environment. Generally, system efficiency is improved by optimizing software, rather than accepting significant performance degradation.
Question: What is the utilization rate of computing resources? How can theoretical specifications be affected when thermal performance is limited?
It is difficult to accurately quantify the utilization of computing resources when the thermal performance is limited. Generally speaking, the utilization rate of reasoning is usually lower than that of training, while the actual utilization rate is often lower than that of theoretical specifications.
Ralph (host)
welcome to the keynote speech of the Hot Chips 2024 Conference: predictable expansion and infrastructure. I am pleased to introduce the first keynote speech of this year's conference. In the next hour, Trevor Cai from OpenAI will share with us. The organizing committee of the conference has always hoped to obtain opinions from large companies that build large AI models and expand them to a large scale of computing hardware. Today, Trevor will explain the model extension and the infrastructure needed to support this extension.
Trevor is currently a member of the OpenAI technical team and is responsible for leading the chip and infrastructure work of OpenAI. He is one of the core contributors to GPT-4 training on ultra-large-scale hardware systems, and also participates in the related work of GPT-2 visual models. Before joining OpenAI, Trevor was engaged in the expansion of large-scale language models in DeepMind. Before that, he studied at the University of Southern California (USC) and obtained bachelor's and master's degrees.
The theme of this speech is the extension of large language models. Trevor will introduce the expansion of these models and the future expansion prediction. You have seen the scale growth of these models, and Trevor will explain to us how long this expansion will last. He will also discuss the infrastructure needs to drive the process and the challenges OpenAI currently faces on these large-scale systems. This speech will provide our engineers with valuable insights, covering what we need to pay attention to and how to deal with these challenges.
Now let's welcome Trevor to the stage. Please applaud and welcome Trevor.
Trevor Cai
as a technical member of OpenAI, he joined the company in March 2022 and is currently focusing on the development of chips and infrastructure. Prior to this, Cai had led GPT-4 optimization of training throughput, implementation of visual functions of GPT-4, and management of extended teams. Before joining OpenAI, Cai worked as a software engineer in DeepMind and participated in the training infrastructure construction of projects such as AlphaStar and Gopher. He holds bachelor's and master's degrees in computer engineering and computer science from the University of Southern California. His research interests include reinforcement learning and large-scale model expansion, and he has published many academic papers.
Reference: OpenAI speech: predictable extensions and infrastructure (PPT)
hello everyone, it's nice to meet you today. Thank you for your invitation, and thank the organizers for making all this go so smoothly. As Ralph mentioned, my background is mainly research engineer, so speaking at the Hot Chips conference is a little special for me. However, over the past year, I have been deeply involved in OpenAI's computing strategy, especially in the formation of chip and infrastructure teams.
Predictable scale-out
today, I want to discuss with you the Scaling Laws in deep learning. We will first introduce the basic concepts of these laws of scale and then discuss their impact on infrastructure. First of all, start from a macro perspective, and then go deep into the actual impact that these developments may have on your work in the next few years.
Let's review it first. GPT-4 and ChatGPT are the results of the OpenAI research route, which has been invested for more than five years and billions of dollars to promote. We can call it "GPT Paradigm.
In the past year and a half, especially since the release of ChatGPT, everyone should have a more intuitive understanding of the specific content and implementation of this paradigm.
To develop ChatGPT, we first collected a large data set, including text, code, images, audio, mathematics and other data sources. Next, we pre-train the model to predict the next word given earlier. After the pre-training, we conducted post-training on the model to encourage the model to show ideal behavior and inhibit unsatisfactory behavior. Through this post-training, ChatGPT can follow instructions, conduct dialogues, use tools, and refuse to perform inappropriate tasks.
-----
random gradient descent(Stochastic Gradient Descent,SGD) is an iterative method for optimizing objective functions and is widely used in training neural networks. It estimates gradients by using random subsets of training data, thus reducing the amount of computation. SGD has many variants and improved methods, such as momentum, AdaGrad, RMSProp and Adam, which improve convergence speed and stability. SGD is still a common optimization tool when training large language models, especially when combining parallel computing and communication overlap technology, it can efficiently train on GPU clusters. In addition, methods such as low rank adaptation (LoRA) also make it possible to fine-tune large models efficiently under memory constraints.
although what I have just described is very concise, it is actually very complicated to implement these functions. Pre-training models need to be extended to large accelerator clusters and various forms of data and model parallelization technologies need to be deployed. Even if the pre-training is completed, the post-training still needs to collect a large amount of manual Feedback, carry out Reinforcement Learning from Human Feedback (RLHF) based on manual Feedback, and constantly iterate the model, to improve its practicability.
-----
parallelism in deep learning
Data Parallelism: split the training data into multiple subsets, each of which is processed independently on a different GPU. Each GPU maintains a complete copy of the model and synchronously updates the gradient after each training step. This method is suitable for large-scale data sets. For example, when training a 175 billion-parameter model such as GPT-3, parallel data processing on multiple machines can significantly speed up the training.
Model Parallelism: when the model parameters are too large (usually between billions and hundreds of billions of parameters) to run on a single GPU, different parts of the model are distributed to multiple GPUs for calculation. For example, OPT-175B model requires 350gb of GPU memory and is only used to store model parameters. Therefore, the model needs to be divided into multiple devices for effective training.
Pipeline Parallelism: The model is divided into multiple phases. Each phase is executed on a different GPU. When computing is completed in one phase, the next batch of data is processed in the other phase. This method can improve resource utilization and reduce waiting time, especially when processing models with repetitive structures (such as Transformer), each GPU can be allocated to process the same number of layers, this enables efficient micro-batch execution.
In practical applications, these parallel technologies are often combined to make full use of their respective advantages. For example, you can use both data parallelism and model parallelism when training ultra-large language models. Specifically, you can divide datasets (parallel data) and distribute model parameters to different GPUs (parallel models). This can not only improve the training speed, but also, it can also handle larger models.-
----
both pre-training and post-training are enough to become the theme of a long speech. However, before going deep into these topics, I would like to discuss "why" first ". Why do we spend years and billions of dollars learning how to better predict the next word? Why is word prediction so important? Why do we invest so much computing resources for this?
Back to 2017, before all GPT models appeared, Alec Radford trained a neural network to predict the next character in product reviews. After completion, he made an amazing discovery: an internal feature of the network is associated with the emotion of comments. By using this "sentiment neuron", the model achieved the best performance at that time in the task of emotion classification.
-----
Alec Radford OpenAI is a research scientist engaged in research in the field of artificial intelligence, focusing on natural language processing and computer vision. He received a bachelor's degree in computer science from the University of Virginia and a doctor's degree from the University of California, Berkeley. He participated in the development of multiple AI models, including GPT-2 and CLIP. Before joining OpenAI, he was the research director of Indico.
-----
" Sentiment neuron"Refers to a specific neuron found in a large language model, which can effectively capture and represent the emotional information of the text. This concept originated from the research of Radford et al and was found in the unsupervised training LSTM model. It expresses the emotional tendency of the text through a single number, can directly control the emotional output, and shows high efficiency in emotional analysis tasks. However, subsequent studies questioned its effectiveness, pointing out that removing the neuron has limited impact on the classifier performance and may have potential defects. At present, research hotspots have turned to more complex model structures and multi-task learning methods to obtain more robust and interpretable emotional expressions.
in this short comment, you can see the visualization of emotional neurons. Green indicates that the comment has been positive so far, while red indicates negative. Comments were quite neutral at first, but later turned negative. By predicting the next character in these comments, the model developed an advanced representation of emotion. This is a phenomenon worth pondering.
Since 2017, we have formed a higher level of intuition about this phenomenon: to predict the next word well, we need to model the underlying process of generating this sentence. Imagine the end of the detective novel when the detective says "the killer is......" In order to accurately predict the next word, one needs to remember the past plot, possess factual knowledge, understand human psychology, carry out deductive reasoning and other abilities.
Taking mathematical proof as an example, to accurately predict the next theorem or lemma, it is necessary to understand the logical process of demonstration, understand the degree of freedom in demonstration, and judge which degree of freedom can make the proof closer to the conclusion. In addition, deep mathematical knowledge is needed to understand what these degrees of freedom may be.
Furthermore, if the data source you train is the Internet, the underlying generation process of generating Internet content is the whole world. Therefore, training language models is actually to understand the world that generates these words, web pages and blogs.
Although this intuition sounds very grand, it has theoretical support. "Solomon's inductive reasoning theory" shows that the best prediction method for a cosmic observation data set is the minimum executable compressed file of the data set, which is a short program that can compress data well. The negative logarithmic likelihood loss used to optimize these word prediction models is optimized for this compression.
-----
Solomon's inductive reasoning theory(Solomonoff's theory of inductive inference) is a mathematical theory put forward by Ray Solomonoff in the 1960 s, aiming at formalizing the prediction process based on observed data. This theory combines the concepts of probability theory, information theory and computability theory. The core idea is to allocate probability to any computable theory according to the complexity of the algorithm, and tends to choose simple theories that can effectively explain data. This method incorporates Occam's razor principle and the principle of multiple interpretations, allowing a comprehensive evaluation of all possible assumptions. Although Solomon has proved that his induction method is uncomputable, that is, it cannot be fully implemented in practice, this theory is of great significance in theory, because it provides a complete and universal inductive reasoning method. This theory has a profound impact on artificial intelligence and machine learning, especially on how algorithms learn from data and predict future observations.
therefore, predicting the next word, like other great ideas in computer science, seems simple but has profound secrets.
Since we realized this, we have gradually improved the way we learn to predict the next word. We turned to the Transformer architecture, collected more and better data, improved the algorithm, and greatly increased the computing scale. New behaviors appear during each expansion.
- GPT-1 shows that the most advanced language comprehension ability can be achieved by predicting the next word.
- GPT-2 proves that a single model can not only generate coherent text paragraphs, but also be applied to various tasks.
- GPT-3 further demonstrate that the model can learn new tasks from only a given context.
- GPT-4 shows that a single model can perform a variety of highly practical real-world tasks.
All this is only derived from the prediction of the next word.
This is one of my favorite GPT-4 demonstrations and was tested before release. The model was asked to answer the exam questions of Paris Institute of Technology, a famous French University of Technology. The topic is presented in French, containing a diagram, but the model is required to be answered in English. Finally, it successfully completed the task.
This process is quite surprising because it requires the model to understand french and english and how to use the relevant context in both languages. At the same time, the model also needs to understand the diagram in combination with the text, and has the ability to deduce the physical logic chain that is most likely to be correctly solved.
Scale is obviously effective.
Both GPT-3 and GPT-4 are high-risk model training programs for OpenAI. How do we know it will succeed? What makes us confident to train these models? In short, the answer is the law of scale.
You can see the chart behind me. The horizontal axis is the logarithm of the amount of computation used to train the model, and the vertical axis is the loss value on our internal code base. There are not many unique codes in our code base on the Internet, and this curve is very consistent with the distribution of these points. This is a logarithmic-logarithmic power law curve.
-----
logarithm-logarithm power law(Log-log Power Law) curve is a graphical method used to represent the Power Law relationship between variables. By using logarithmic scales on the horizontal and vertical axes, data points are drawn into a straight line, so that the slope of the power-law relation can be directly obtained from the graph. If the relationship between variables y and x can be expressed as y = kx ^ n, the logarithm can be converted into log(y)= log(k)+ n * log(x), in the logarithmic-logarithmic coordinate system, the data point will form a straight line with a slope of n. This method is widely used in economics, biology, physics and other fields, but only by the approximate linearity of the data on the logarithmic-logarithmic graph cannot fully prove that it follows the power law distribution, therefore, more complex statistical analysis is needed to verify the effectiveness of the results.
this phenomenon was first proposed in the paper "Scaling Laws of Neural language models" in 2020. Later, I also participated in a project named Chinchilla, fixed a minor error in the original analysis. Both studies have observed that the final performance in the next word prediction task follows the logarithmic-logarithmic power law relationship between the calculation amount and the final loss.
-----
paper 《 Scaling Laws for Neural Language Models"(https://arxiv.org/abs/2001.08361) studied the empirical scale law between language model performance and model size, dataset size and training calculation amount. Research shows that cross entropy loss has a power-law relationship with model size, dataset size and calculation amount, and some trends span more than seven orders of magnitude. The paper points out that the width or depth of the network has little impact on the performance in a wide range. Simple equations describe the relationship between overfitting and model/dataset size, as well as the relationship between training speed and model size. These relationships enable optimal resource allocation under a fixed budget. The study also found that larger models have significantly higher sample efficiency, so the optimal computational efficiency training strategy is to use relatively moderate data to train very large models, and stop training before full convergence.
in other words, whenever we double the calculation amount of the model, the model's ability to predict the next word will gradually approach the irreducible entropy in the generation process in the form of logarithm. This is a key miracle of Deep Learning: only training larger models on more data can promote the progress of artificial intelligence.
More importantly, the law of scale has the ability to predict. We trained a series of models, with the amount of computation ranging from 1 billion to 10000 times less than GPT-4. Then, we only use these smaller models to draw the curve. As a result, when the GPT-4 is trained, its performance almost completely follows this curve. This extrapolation spans four orders of magnitude of computation.
This predictable extrapolation gives us confidence to make significant investments in this research project, build large-scale clusters, and invest more and more human resources to manage and understand these clusters and their hardware, and the scientific basis of deep learning that we run on these clusters.
Having said that, a reasonable problem may be: "at present, it is not clear how the bits / word of each word can be converted into the actual improvement effect of the model in real tasks, these tasks are what you and I may care about, such as programming." Can these capabilities be expanded as expected?
We have studied this issue in the context of programming competitions, such as LeetCode and Codeforces. As you can see, this chart is almost the same as what I showed before. The horizontal axis is still calculated based on the logarithmic ratio, and the vertical axis is what we call the Mean Log Pass Rate.
The calculation method of average logarithmic pass rate is as follows: first, for each programming problem, we make a large number of independent attempts, and each attempt is independently evaluated as passing or failure. Then, you can calculate the pass rate of each question. The pass rate of one question may be 20%, that of the other is 80%, and that of the other is only 2%. Finally, we take the average of all the logarithms of passing rate, which is the average logarithm passing rate score. What we draw here is its negative value.
The reason why we use this indicator instead of the average pass rate is that the average pass rate is often dominated by easy problems. The average logarithmic pass rate makes the problems with different difficulties equally important, so we don't just look at whether the model can improve the reliability of problem solving from 50% to 90%.
By using the scale law defined based on a model 1000 to 1 million times smaller than the GPT-4, we can predict the final performance of the GPT-4 very accurately, the result is predicted again by extrapolation of three orders of magnitude, that is, the calculation amount increases by 1000 times.
Even so, the logarithmic-logarithmic graph of the average logarithmic pass rate is not intuitive, so we show it in another way: this is another visualization of benchmark test, namely MMLU benchmark. In 2020, a group of doctoral students from Berkeley University established this MMLU benchmark, believing that it would become the final standard for evaluating AI systems. They really believed that this was the last time they needed to create benchmarks.
-----
MMLU(Massive Multitask Language Understanding) is a benchmark for evaluating the ability of large-scale Language models, covering 57 disciplines, including mathematics, philosophy, law and medicine, there are about 16000 multiple choice questions in total. This benchmark aims to measure the model's knowledge acquisition capability under zero-shot and few-shot settings, emphasizing world knowledge and problem solving capabilities. MMLU's design is more challenging than previous benchmarks, helping researchers identify blind spots in specific areas of the model. Although some of the latest models have made significant progress on MMLU, they have not yet reached the accurate level of human experts, showing the need for further improvement.
-----A few years later, GPT scored 90% on this benchmark. Logarithm-logarithm scale law conceals the fact that slow progress will become very rapid on an interpretable scale for human beings. Now, in just two years, the benchmark is almost completely occupied, and we find that many wrong questions are actually inaccurate. Therefore, finding new benchmarks to track the progress of these systems has become a new task.
To sum up. So far, we have reviewed the past and confirmed that optimizing the prediction of the next word is meaningful. Expansion brings huge returns, and these returns are predictable and extrapolable-they apply not only to the basic loss function, but also to the capabilities of the model.
Impact on infrastructure
what does this mean? The most obvious and important point in the law of scale is that they encourage more and more advanced training.
You can see this from FLOPs data. This chart shows the growth of FLOPs used in cutting-edge AI training since Alex Krizhevsky triggered the "Cambrian outbreak" of deep learning in 2012. According to AI's report, the annual growth rate of FLOPs was about 6.7 times before 2018, and then about 4 times every year.
-----
Alex Krizhevsky's paper published in 2012 《 ImageNet Classification with Deep Convolutional Neural Networks"Triggered the" Cambrian outbreak "of deep learning in the field of computer vision". This paper describes a large deep convolution neural network, which achieves an excellent result of 32% of the top-5 error rate in 1000 categories on the ImageNet 2010 dataset, far exceeding other methods. The network consists of five convolution layers and three full connection layers, using ReLU activation functions, overlapping pooling, dropout regularization and other technologies. Krizhevsky uses GPU to accelerate training, making it possible to train large deep networks. This paper demonstrates the powerful ability of deep learning in image recognition, triggering a revolution in the field of computer vision. In the following years, deep learning method has made breakthrough progress in various visual tasks and has become the de facto standard method. Krizhevsky's work marks a critical moment for deep learning to move from theory to practice and has a profound impact on the development of artificial intelligence.
-----although Moore's law brought boost during this period, this amazing growth rate far exceeded Moore's law. This is mainly promoted by two aspects: one is the innovation of precision and chip architecture; The other is the huge increase of frontier training scale and actual time consumption.
-----
Training Compute of Frontier AI Models Grows by 4-5x per Year
https://epochai.org/blog/training-compute-of-frontier-ai-models-grows-by-4-5x-per-year
from 2010 to May 2024, the computing resources used to train cutting-edge AI models increased 4-5 times annually. This trend has been verified in many ways, including well-known models, cutting-edge models, large language models, and leading companies (such as OpenAI, Google DeepMind, and Meta AI). Analysis of published top-level models. The article pointed out that although there were signs of slowing growth around 2018, the recent growth of frontier models remained at about 4 times per year. For language models, although the overall growth rate is faster (9 times per year), the growth rate has tended to be about 5 times per year since it reached the AI frontier in the middle of 2020. The author suggests that the annual growth rate of 4-5 times should be used as a benchmark for predicting future AI development, while considering possible bottlenecks or acceleration factors.
-----
we have seen similar phenomena internally. GPT-1 is completed after several weeks of training on a GPU cluster. GPT-3 is trained on a cluster consisting of 10000 V100. However, Microsoft CTO Kevin Scott commented that GPT-next need a large cluster. As long as the law of scale continues to be effective, we will do everything possible to continue to promote the development of this route.
This does not mean that no great challenges have been encountered. Let's review the turning point of the 2018 curve. What happened then? Why is the progress slowing down from 6.7 times per year? According to influai's best guess, "Low hanging fruit" such as hardware mathematical unit and operator Fusion has been picked up, therefore, we have entered a stage where higher costs are needed to maintain progress. There is also a certain lag in the algorithm level.
In the future, we will face some difficult problems, such as data availability.
When we talk about further increasing the cluster size, even for some companies with the strongest free cash flow in history, the capital and energy costs of the cluster will become significant. It will become more and more important to build the energy needed by these clusters, interconnect with the growing us power grid, and build the physical restrictions of these large-scale projects in a single geographical area.
So far, we have only discussed training models, not their reasoning.
For me, reasoning is more difficult to discuss and predict than training. In addition, we can plan the training infrastructure in advance, but the construction of inference capacity is often guided by the market demand for products.
However, I can still draw some conclusions that have guided us to think so far.
First of all, most of OpenAI's inference GPU time is used to serve our most advanced models, such as various versions of GPT-4. The computing capacity of GPT-4, GPT-4 Mini, GPT-3.5 Turbo models on GPU is insignificant compared with the total computing capacity we consume. I expect this trend to continue in the future.
The second observed phenomenon is that with the improvement of model intelligence, the ratio of GPU used for reasoning to GPU used for training has increased significantly. GPT-2 API requirements are almost zero. The API requirements of GPT-3 are relatively small, which is far smaller than the cluster size we use for training. Since the release of GPT-4, the 24-hour uninterrupted demand has far exceeded the scale of GPT-4 clusters. I expect this trend to continue.
Finally, in the past 12 months, we have reduced the pricing of GPT-4 by 12 to 24 times. Through these price cuts, we have made great progress on the price-quantity curve. The cost per unit of intelligence has dropped significantly, and I expect it will be further reduced in the future.
In general, this brings us three brief views on AI computing requirements.
- First, the predictable and extrapolative expansion of computing makes it possible to invest high confidence in training infrastructure.
- Second, intelligence drives inference requirements.
-
Third, in this decade, the technical and economic conditions required for scale have matured and can be utilized.
Based on the above, the whole OpenAI company and I have come to the following conclusion: the world, not only OpenAI, but the whole industry, the demand for AI infrastructure far exceeds the current supply chain planning. This view was established in February this year and I think it is still applicable.
I am very clear that the argument I put forward in this speech is roughly equivalent to believing in the straight line trend on the chart. Indeed, the lines in most charts drawn in the logarithmic y-axis will eventually become flat. But often straight lines will last longer than anyone expected.
This is a historical trend chart about the change of solar power generation capacity over time. Black represents actual data, and color represents annual forecasts given by experts. It can be seen that experts have repeatedly refused to admit the straight upward trend on the logarithmic-logarithmic graph, resulting in constant errors.
There is another rising curve that you must be very familiar with-Moore's law. Although it may eventually tend to be flat, this line has been rising for 50 years, far exceeding the expectation that anyone thought reasonable in the 1970s.
Recently, under the recommend of my friends, I read some obituary of Bob Taylor. He is the founder and leader of Xerox PARC computer science laboratory, where modern personal computers and GUI were invented. He keenly predicted the popularity of microcomputers and the rise of mobile devices, and accurately predicted these trends within a few years. He observed this straight line calmly as if he were living in the future.
Just like Moore's law, to maintain the logarithmic-logarithmic linear relationship between the calculation amount and the loss function, we need to make huge investments in human and technical aspects. Although this straight line may eventually tend to be flat, it has maintained eight orders of magnitude, and it is extremely rare to maintain eight orders of magnitude on the chart. It would be a pity if I gave up this "time machine" now.
Design for large-scale deployment
so, what does this mean to you? In short, practical constraints should be considered in the design of large-scale deployment, especially in the aspect of data center chips. I will discuss two examples: Reliability, Availability, and maintainability (RAS), and power management.
The next generation of large-scale clusters will pose a huge challenge to RAS. From the perspective of optical equipment alone, the average failure interval can be shortened to several minutes as the scale increases. This does not include more common fault sources, such as ECC errors, GPU HBM board failures, VR failures, and PCIe connection looseness. Not only hardware faults are becoming more common, but also soft faults are increasing. The frequency of silent data corruption is disturbing. Although we can isolate the GPU that generates errors, sometimes problems cannot be reproduced.
In addition, as many people have repeatedly mentioned, these faults have a great spread range. If we continue to use the synchronous random gradient descent algorithm on these clusters, a GPU failure may cause tens of billions of dollars worth of devices to suspend operation.
To solve these challenges, close cooperation between hardware and software is required. We must change the training paradigm, and the hardware also needs to provide the interfaces needed to maintain high availability.
The following are some specific items that you need to consider when making RAS decisions for your hardware.
First, minimize maintenance costs. Previously, when a failure occurred, we could shut down all processes through a hardware crash and then restart, which was still feasible in the environment of thousands of accelerators. However, as the scale increases, this method will no longer work. In the past, it was not a big problem that each failure caused the process to crash, but now the maintenance cost of each failure must be as low as possible. We prefer to capture CPU exceptions rather than cause the process to restart. Restarting a process is better than requiring firmware reset, which is more acceptable than restarting a node. Finally, RMA is used to handle the process. For example, ideally, write failures in a large-scale network should trigger a capture exception on the host so that we can reroute and decide the next step.
Another aspect to be considered is to minimize the scope of failure. For example, we observed in the actual cluster that link jitter of a port sometimes affects adjacent ports, causing them to also shake. Ideally, this should not happen. We hope that the failure of one component will not cause the failure of other components.
For another example, ideally, an uncorrectable memory error should only affect the GPU where the error occurred. In practice, it is very difficult to achieve this due to the consistent memory structure, but this is a very important goal.
While considering how to reduce the spread scope and maintenance time, we also need to consider Elegant degradation. Not all faults have the same technical support priority. A completely ineffective and unresponsive accelerator is obviously more worthwhile for technicians to spend time repairing than losing secondary memory modules.
We hope that when hardware problems occur, we can downgrade them appropriately according to the actual situation until the technicians have time to deal with them, or the technicians are in place and are checking the relevant hosts.
Finally, combined with all these measures, automated, fast and comprehensive on-site verification is carried out as much as possible. I understand that these requirements may sometimes conflict, but taking Silent Data Corruption (SDC) as an example, once we confirm that an accelerator has an SDC error, it is very important to have an in-depth correctness check on the underlying hardware.
Next, let's talk about power: when so many accelerators are deployed on a large scale, power supply will gradually become a constraint. We must make the best use of existing power resources. Synchronous training of large-scale clusters will produce a phenomenon similar to that of orchestra from weak to strong, and then from strong to weak. GPU also has similar power fluctuations in synchronous training tasks, which will adversely affect the power grid and field equipment that supply power to the data center.
We have truly heard the buzz of transformers, which is precisely because of the above requirements. We need to obtain low-latency power telemetry data at the cluster level and at the single accelerator level. In addition, it is also necessary to provide out-of-band power management to ensure that we are friendly to the power grid system on which we depend.
Furthermore, in order to make full use of the limited power resources, ideally, we want to set service level targets (SLO) for accelerators with different requirements. This again requires out-of-band management and control at the accelerator level, firmware level and driver level so that we can maximize the role of existing power.
I can continue to discuss in depth the impact on security, network expansion, and other aspects, but to avoid too many details, I hope you can remember the following four points:
-
first, predictive expansion is the core driving force for companies such as OpenAI to invest in deep AI training and computing.
-
Second, the delivery of AI to the world will require large-scale infrastructure construction.
-
Third, based on the above two points, we must jointly design hardware suitable for large-scale deployment.
-
Finally, performance is only one of the requirements. Electricity, safety, network and reliability are also the key elements of our cooperation.
Thank you for your time and attention. We still have a lot of work to do and face some difficulties, but I am looking forward to working with you to solve these problems. Thank you.
-----
host
the story about language model extension is very fascinating. As each step progresses, we will have a new experience. Can you look forward to what new experiences we may see when the model is further expanded?
Trevor Cai
this is a very difficult question, because we asked similar questions before the release of the GPT-4, and found that our prediction at that time was not very accurate. So I am a little worried about predicting the future of GPT-5. However, there are several points that I think will be important:
first of all, I expect that the reliability of these tasks will be greatly improved. We have seen this from GPT-3 to GPT-4-GPT-3 sometimes performs well but is not stable enough.
Secondly, I expect that the performance of the model in long-term task reasoning can be significantly improved. Now we see that many models gradually perform poorly in multi-step tasks, and I hope these situations will be improved.
Host
energy efficiency is still relatively low at present, at least far from the human brain in terms of computing power and energy consumption ratio. What new experiences do you think may be brought if we have 2 to 4 orders of magnitude more efficient computing power?
Trevor Cai
the most direct and obvious thing is that I will use all these energy efficiency improvements to train larger-scale models. If energy efficiency is significantly improved, I think there will be great potential for deployment outside the data center. Although memory is sometimes limited, energy efficiency is also a major factor. I am looking forward to seeing real-time speech-to-speech conversion running on mobile phones, just as we successfully demonstrated in the data center with GPT-4. I think that will be very exciting.
Audience questions
the huge amount of money we spent on training these models. Is there any actual data on the return on investment (ROI) of these models and enterprise application cases?
Trevor Cai
training is on the one hand, but what makes me feel comforted is that seeing so many people using models such as ChatGPT and Claude, real consumers have economic constraints, we are willing to pay for OpenAI services out of our own pocket. I think this fully illustrates the potential value provided by these models.
Mario Shalabi
is the logarithmic-logarithmic extension you mentioned just a manifestation of network extension? Is it because the network is based on the attention layer that the chart presents such a shape?
Trevor Cai
in the paper of Kaplan et al. in 2020, they first proposed the trend of scale law of Neural language model. They also made a similar analysis of LSTM, while LSTM did not have a attention layer. You can see that the calculation of LSTM and the final loss also conform to a similar logarithmic-logarithmic extension relationship. The difference between LSTM and Transformer is that the curve position of Transformer is relatively low. More broadly, in the field of machine learning, many work has been done to explore the law of scale in fields other than language models, including code modeling, multi-modal tasks, and even game tasks through reinforcement learning. You will also see similar logarithm-logarithm scale laws in these fields.
These are all empirical observations. At present, I am not sure whether there is a particularly strong theoretical basis behind them, but we have repeatedly seen similar models in various fields, which I suspect reveals some deep truth that we have not yet fully understood.
Tan Bennett(SemiAnalysis)
my question is related to the previous content. You just mentioned the log-log rules of the mixed expert model and other models. Are they all showing a parallel trend, or are there signs that they will converge somewhere? Are their learning rates roughly the same during expansion? If we switch to a more promising model, does it mean that we have entered an essentially better track?
Trevor Cai
most public studies on extended models show that the model performance is indeed improved in parallel. However, I would like to point out that in 2021, I participated in a paper on the expansion law of hybrid expert model, and the results showed that the slopes of curves with different expansion laws were different. Therefore, you can see parallel lines and curves with different slopes at the same time.
-----
the paper written by Trevor Cai 《 Scaling Language Models: Methods, Analysis & Insights from Training Gopher"(https://arxiv.org/abs/2112.11446) discusses the performance of Transformer-based language models in different scales, covering Gopher models with tens of millions of parameters to 280 billion parameters. The study evaluated 152 diversified tasks and found that large-scale models showed significant performance improvements in the fields of reading comprehension, fact checking and harmful language recognition, however, the benefits in logic and mathematical reasoning are relatively small. The paper also analyzes the relationship between training data sets and model behaviors, discusses the interaction between model scale and bias and harmfulness, the application potential of language model in AI security and reducing downstream hazards is also proposed. This study provides important insights into the advantages and limitations of large-scale language models.
-----padja(Apple)
considering scalability, I think security is a relatively contradictory word here. I would like to know your efforts in safety and data accuracy, especially in key areas such as medical treatment. What do you think?
Trevor Cai(OpenAI)
first of all, I must honestly say that I am not a security expert. However, we do have a special security team. Many of our work focuses on protecting model weights, protecting user data, and ensuring that one user's input does not affect the output of another user-that is, ensuring that data streams do not cross.
In addition, we have also published some blog articles on the future direction. For example, confidential computing is one of the directions that we are interested in in in the future. Although this has not been deployed in the production environment, we think it is an exciting development direction.
Padja(Apple)
but in key areas, such as the accuracy of health data? When users use your platform, they need accurate data, not only 50% of the correct data. How do you ensure that accurate data are provided in these accuracy-dependent areas?
Trevor Cai(OpenAI)
do you mean how to verify the accuracy of the data?
Padja(Apple)
yes, how to verify the accuracy of the data.
Trevor Cai(OpenAI)
I don't have a good answer to this question. I suggest you consult our policy team. I think they will tell you that in the future, manual supervision will be the key to ensuring data accuracy.
Padja(Apple)
OK, thank you.
Amit(AMD)
on the issue of power consumption, can you comment on computing, data transmission in memory, and power consumption distribution between networks? In addition, what do you think of the arithmetic intensity of these workflows?
Trevor Cai(OpenAI)
in model training, the arithmetic intensity is usually not too poor. We try our best to keep the computing limit.
As for the energy consumption of data transmission and computing, we observed a similar situation with the microbenchmark test results of the open source model.
Amit(AMD)
what about inference workloads?
Trevor Cai(OpenAI)
inference workloads have lower arithmetic intensity and more energy is consumed in weight movements, especially in latency sensitive applications.
Isri(WRC)
when a GPU fails, will you replace it with a standby GPU or the entire data center group stop working? In addition, how do you achieve this at the tool and hardware levels?
Trevor Cai(OpenAI)
the second question is relatively simple. We use a large number of C ++, Rust, and Python control logic.
On the first question, we investigated technologies such as spare hot spare parts and component removal. Which technology to use depends on the cluster size, average failure time, average repair time, and training model.
Isri(WRC)
can you summarize the current practice? When a GPU fails, will you roll back to the previous checkpoint and restart it?
Trevor Cai(OpenAI)
it is not convenient to describe the operation of our frontier training in detail.
Audience questions
do you think the existing GPU architecture combined with Moore's law and new algorithms can achieve a 10-fold performance improvement, or do we need a new paradigm?
Trevor Cai(OpenAI)
this question should be left to experts to answer. I am mainly responsible for training models, where many processor architecture experts can better answer this question.
BJ (Harvard University)
you talked about two basic pillars of machine learning: Models and infrastructure. The third pillar is data, which you have mentioned very little. Regarding MLPerf benchmark testing and other issues, can you talk about the law of data expansion and the role of data in system optimization.
Trevor Cai(OpenAI)
of course. I mentioned the analysis of the expansion rule in 2020 and a small problem we found in Chinchilla from 2021 to 2022. Specifically, we try to break down the extended rules of calculation and final loss into parameters used in the model and the amount of data used for training, assuming that you only traverse an unlimited amount of data once.
We found that it is better to extend the data and parameters roughly evenly than the parameters in the main extension model. Therefore, in the current Transformer training methods, the amount of data is very important.
Jeff Smith
although you may not be able to specifically discuss the CPU or GPU architecture changes you imagine to adapt to the new model, can you talk about in your growth model, in order to maintain the momentum of development, what are your requirements for relative improvement? What are your expectations for computing power per watt, supply chain growth or model design in the next five years? Considering that we cannot copy TSMC and Samsung immediately, or put 10% of the global GDP into the semiconductor industry, how will future improvements be achieved?
Trevor Cai(OpenAI)
indeed, in many ways, the answer is exactly the opposite. We will look at all the constraints we face, find out the most urgent one, and then say: if we cannot break through this primary constraint, this is the limit we can achieve at present.
Then, our solution to the problem is to break through these constraints one by one. As I mentioned briefly before, this is very similar to the way the wafer factory responds to Moore's law. In order to continue to promote the growth curve of computing power, we need to invest more and more efforts. Whether the primary bottleneck is chip wafer factory, memory wafer factory, power consumption or available funds, we will pay attention to which is the first limiting factor.
Jeff Smith
can you tell us who you are looking for to help solve the most pressing problems and what will be the primary bottlenecks?
Trevor Cai(OpenAI)
at present, OpenAI has made extensive contacts with all the industries I mentioned. Obviously, obvious restrictions will come soon, but after breaking through the first bottleneck, it is not clear where the real bottleneck will appear.
Jen Le
I ask a question from the perspective of capital investment and distribution. Earlier this year, we heard various estimates about the total investment amount, from trillions of dollars to 300 billion, and then to 100 billion in the next 60 years. My question has two parts:
1. I assume that you will invest more in reasoning. For training and reasoning, how do you plan to allocate the investment proportion?
2. In terms of reasoning, will you consider developing your own reasoning processor or plan to continue training with internal GPU or accelerator?
Trevor Cai(OpenAI)
first of all, I have no background in economics, so I cannot answer this question in detail. As mentioned in the speech, inference investment will be demand-driven. We need to observe market demand to determine how many inference capabilities need to be deployed.
There are several driving factors for inference requirements: a more intelligent model requires a large number of GPU hours. With the expansion, the expenditure ratio of training and inference is increasing. We need to observe the market's response to new products and adjust accordingly.
David Weaver(Aina Inc.)
IBM discussed the runtime of the host system today. If you have to choose between slightly lower performance and more consistent runtime, how much performance would you like to sacrifice for increased reliability?
Trevor Cai(OpenAI)
the correct approach is to evaluate the long-term throughput. You can approximate this balance by calculating the performance in a specific software environment multiplied by the expected runtime and assuming the failure and recovery time. The key is to find the best balance between performance and reliability.
In general, most of what I am discussing does not require significant performance sacrifice. Instead of accepting a 5-fold reduction in performance, it is better to optimize performance by improving system efficiency through software.
Amit(AMD)
questions about soft error detection (SDC). Does OpenAI have relevant metrics to evaluate the frequency of unreliable results caused by SDC? What measures have you taken at the hardware level to mitigate the root causes of these problems?
Trevor Cai(OpenAI)
I don't have ready-made indicators for reference. Because SDC has concealment, it is very difficult to detect. To deal with these problems, we used a set of diagnostic tools to test GPU and detect computing exceptions. When an exception is detected, we also use various technologies to identify the problem.
T(Frontier Research)
you mentioned that scaling makes performance predictable. So, is the schedule of AGI (general artificial intelligence) predictable?
Trevor Cai(OpenAI)
no, I don't think so. Part of the problem is that the definition of AGI (general artificial intelligence) is evolving. As a famous saying goes, artificial intelligence is something that computers cannot do yet. Although progress will continue, predicting the AGI timetable is a different issue.
David
you mentioned that later training is used to avoid bad behavior of the model. How to determine what behavior is bad? Are there potential prejudices in the decision-making process? What is the process?
Trevor Cai(OpenAI)
our policy team can provide more detailed answers. Generally speaking, there are legal and ethical norms to define which behaviors are bad. We are committed to ensuring that our model does not engage in illegal or widely considered negative acts. Decision-making is guided by laws and ethics, and has the opinions of the policy team as a reference.
Lisa
what is the current utilization rate of computing resources in a training and inference cluster? Considering today's limited thermal performance, how does this affect theoretical specifications?
Trevor Cai(OpenAI)
the utilization rate of computing resources is difficult to quantify, especially when the thermal performance is limited, compared with theoretical specifications. In general, inference utilization is usually lower than training.
Host
thank Trevor for sharing these insights and answering all the questions. Thank you.
----------references: Cai, T. (2024, August 26). HC2024-K1: Predictable Scaling and Infrastructure [Video]. YouTube. https://www.youtube.com/watch? v=Gma9cWvkbWo
---[This article is finished]] ---
this article is reprinted from: Andy730