Li Yu's practical experience and future forecast of big Language Models
攻城狮M  2024-09-04 11:14   published in China

Core view

1. Machine learning and deep learning

  • machine learning is like old traditional Chinese medicine, while deep learning is a bit like "alchemy" in fantasy novels ". Some materials (I .e. data) need to be put into the Dan furnace, and refined according to the Dan prescription (I .e. algorithm), and finally refined into a Dan medicine (I .e. model). In essence, the language model is still a machine learning model, but it is replaced with a new architecture and has a larger scale.

2. Hardware limits

  • bandwidth is the most critical and difficult link to improve.

  • Memory is even more important than computing power. It is likely that the memory capacity in a single chip will stagnate at about 200GB in the next few years, which means that the model size will be limited by the memory size to some extent.

  • The cost of purchasing a self-built GPU server is not much cheaper than renting a GPU. The main reason is that most of the profits are seized by NVIDIA.

3. Model scale and future trend

  • limited by memory and data size, models from 100B to 500B may become the mainstream.

  • The "killer application" of smart phones is short video. However, the "killer application" of large models is still difficult to predict, because new application forms will emerge only with the gradual change of users' habits.

  • In terms of music models, many people will express their thoughts and emotions through music in the future, which may have a great impact. But this is not only a technical problem, but also a commercial problem.

4. Model training and optimization

  • moore's law is still applicable, and the cost of training will decrease exponentially in linear time. Therefore, the value of the trained model may be halved after one year.

  • Pre-training and post-training are equally important. Two years ago, pre-training was a technical problem, but now it has evolved into an engineering problem. Then training is the current technology Q question.

  • The evaluation model is 50% of any practical problems related to the model.

5. Data and algorithms

  • as long as sufficient data can be collected, automation can be realized. On the contrary, if the model is expected to complete a task, the first thing to consider is how to collect enough data.

  • The algorithm determines the lower limit of the model, and the data determines the upper limit of the model.

  • The general capabilities of vertical models are critical. Extremely partial models hardly exist.

640.webp


  • Title: practical experience and future prediction of large language models

  • speaker: Li Yu, co-founder of BosonAI

  • time: August 23rd, 2024

  • location: Shanghai Jiaotong University

  • video link: https://www.bilibili.com/video/BV1dHWkewEWz? spm_id_from = 333.999.0.0 & vd_source =...

  • (The content has been revised for easy reading)

640.webp

In the first part, I will talk about some technical content, the current situation of the whole language model, and what I think may be the prediction for the future. Language models have three major elements: first, computational power, second, data, and third, algorithms.

Language Models and machine learning are essentially to compress data into models. Through calculation and algorithm, the data is input into the model, so that the model has certain ability, can find the characteristics similar to the original data in the new data, and make certain adjustments to it, to output the results you need.

If you don't know anything about this, let me make an analogy. Many years ago, when deep learning just appeared, I once said that machine learning was like an old traditional Chinese medicine, while deep learning was a bit like "alchemy" in fantasy novels ". If any of you have read these novels, you may be familiar with this system. Today's language model is very similar to the process of alchemy. You need to put some materials (I .e. data) into the Dan furnace and refine them according to the Dan prescription (I .e. algorithm) to finally produce a Dan medicine (I .e. model).

Data is the material you are looking. In the novel, the protagonist spends most of his time searching for these materials, such as searching in the mountains or bidding at auction. Therefore, it is actually very difficult to make data. It is a physical activity, but it is also essential. And you may need more data, because you don't know which will be used and which will be discarded.

Then there is computing power. You need a Dan furnace to refine these data. The importance of computing power is that the more prosperous the fire is (that is, the more advanced the equipment is), the better the effect of the refined pills will be.

The algorithm is your Dan Fang. Different from the novel, Dan Fang (I .e. algorithm) is improving every year, and it is very important to control the details. Even if others tell you how to operate, you will still find it very different in actual scenarios. This is a bit like launching a rocket. You need to debug it manually. Once it is not debugged properly, the rocket will explode.

Finally, a significant difference between this language model and the last deep learning is that what kind of medicine might be refined to cure a disease last time, but now it is hoped that the refined pills have "soul" and can solve many problems. This actually represents a progress in technology.

640.webp

Hardware.

Next, what changes will take place in the next few years in the three most important aspects of computing power, data and algorithms. This is regular, not jumping.

Bandwidth.

The first thing I want to emphasize is bandwidth, which is very important. Although people may pay more attention to how powerful GPU computing power is, bandwidth is actually the most critical and difficult part to improve. Because it is difficult for a single machine to complete all tasks in current model training, if distributed training is required, the bottleneck is usually the bandwidth. At present, the bandwidth of each optical fiber is 400GB. With the progress of technology, the bandwidth of each generation will double. The previous generation was 200GB, now it is 400GB, and the next generation will reach 800GB.

This is a system named GB200 released by NVIDIA six months ago and has been postponed. The advantage of this system is that it overcomes the volume and heat dissipation problems of traditional GPU in design. If you have seen GPU before, you will find that it occupies a lot of space. Although it may only be an 8-card machine, it is actually on the rack of the data center, it occupies a complete cabinet. In the past, a cabinet could hold many blade servers or other servers with moderate thickness, but now after being replaced with GPU, only two machines can be placed in one rack, which is mainly due to power supply, heat dissipation and other problems. NVIDIA said that by optimizing the design, 72 GPU cards can be compressed into one rack, which uses water cooling technology.

Previously, we seldom used water cooling technology, mainly because there are many problems in water cooling. For example, if the valve is not well done, water leakage may lead to problems in the whole rack. In addition, water cooling has higher requirements on infrastructure, and water inlet and outlet pipelines need to be considered. The advantage of water cooling is that water can take away a lot of heat. Compared with air cooling, water has higher density and stronger heat dissipation capability. Therefore, after water cooling is adopted, the computing density can be increased, which means that a data center can accommodate more machines. At present, it is very difficult to purchase large air-conditioning equipment in the computer room. Most of the time, more machines cannot be installed because the cooling capacity of the air conditioner is not strong enough. After switching to water cooling, the heat dissipation efficiency of the machine is improved and the overall design can be more compact. Another advantage of water cooling is that the thickness of the heat dissipation device can be greatly reduced and the machine becomes more flat.

After the machine becomes flat, the distance between chips is shortened, which helps to improve the communication efficiency between chips. Although the optical fiber communication speed is extremely fast, in high-precision calculation, the delay caused by the length of the optical fiber will also affect the performance. For example, when designing a data center, we need to accurately calculate the length of the optical fiber, because small differences in the length of the optical fiber will change the speed of light propagation time, thus affecting the performance of distributed training. Therefore, putting GPU together and shortening the distance between chips as much as possible is an effective way to improve communication efficiency.

This is similar to the development trend of multi-core processing. In the past, when a single-core processor reached a bottleneck, we turned to multi-core and encapsulated multiple cores in the same chip. Now, multi-core is no longer enough. We begin to adopt multi-card architecture. In the past, GPUs in multi-card architectures were usually distributed in one room, but now the trend is to put these GPUs together as much as possible. Similarly, the expansion of chip area faces the challenges of manufacturing and heat dissipation, so it has become an industry trend to put relevant components together as much as possible.

Of course, there is also the communication speed between GPU and CPU, which has doubled every few years, but it is indeed slower.

Memory.

After the bandwidth is finished, it is memory. Memory is even more important than computing power, because the core of the current language model is to compress a large amount of data into the model. The model is usually very large, up to hundreds of GB, so a large number of intermediate variables need to be processed at runtime, which requires a large amount of memory. At present, we can encapsulate 192GB of memory in a chip. The bandwidth of the next generation of memory will be higher, but memory has become a bottleneck. Memory occupies the area of the chip. A chip has limited space. When a part of the memory is allocated, it cannot accommodate more components. Therefore, it is very likely that the memory capacity in a chip will stagnate at about 200GB in the next few years, unless there is a major breakthrough in technology. If this happens, it means that the size of our model will be limited by the memory size to some extent, because the efficiency will be significantly reduced when the model is further enlarged. Therefore, the memory size determines the upper limit of the model, not the computational power. If the memory is insufficient, the size of the model cannot be expanded.

Although NVIDIA is in a leading position in this field, in terms of memory technology, NVIDIA is actually not as good as AMD or even as Google TPU in those years.

640.webp

Computing power.

Besides computing power, when you solve the bandwidth and memory problems, you can focus on computing power. According to Moore's law, the nanometer level of the process increases, and the frequency also increases. One advantage of machine learning is that it can use 4-bit floating-point computing, but now 8-bit floating-point computing is very mature. The advantage of 4-bit floating-point numbers is that they occupy less hardware resources and have lower bandwidth utilization, because each calculation requires fewer floating-point numbers. In recent years, several generations of NVIDIA chips have brought significant hardware advantages by reducing the precision of floating-point numbers.

Resource.

When you want to expand to a larger scale, you will find that resources become bottlenecks. When it reaches a certain scale, power supply becomes a problem. When designing the data center, we have seriously considered the feasibility of building power plants by ourselves and found that the cost of self-built power plants may be lower than the cost of paying electricity bills. We did spend several months studying how to build power plants. Imagine that if a chip consumes 1,000 watts, 1,000 chips need 1 megawatt of electricity, while the electricity consumption of the whole campus may not reach 1 megawatt.

Theoretically, in a fair market, every time the calculation power doubles, the price should remain unchanged, which is the advantage brought by market competition. This has been the case for many years, but in recent years, due to NVIDIA's monopoly, prices have not dropped. I think in the short term, the price may not be the same as the calculation power doubles, but may increase by 1.4 times. However, in the long run, with the intensification of competition and the effect of Moore's law, the computational power cost will eventually decrease, so in the long run, the computational power will become cheaper and cheaper.

Other options.

Of course, in addition to NVIDIA chips, there are other options, especially for inference tasks. However, at present, the threshold for other chips in training tasks is still relatively high and may need several years of development. Therefore, NVIDIA is still in a monopoly position in the training chip market.

640.webp

In terms of hardware, Moore's law is still applicable. The cost of training will decrease exponentially in linear time, or the training will become faster and larger.

Therefore, the conclusion is that after training a model today, the value of the model may be halved one year later. Many people should not only focus on the size of the model that can be trained at present, but should consider the long-term value of the model. Large models do not necessarily have high cost performance, so it is necessary to evaluate the actual value brought by the model from a long-term perspective to ensure its value preservation.

640.webp

The model.

Language model.

Next, we will talk about the model, taking the language model as an example. Currently, the amount of pre-training data for language models is usually between 10T and 50t tokens. A Chinese character has about two tokens, while an English word has about 1.5 tokens. The data volume of the open source model is basically above 10T token, which is relatively sufficient. Although the amount of data in human history far exceeds this number, from the perspective of data diversity and quality, the scale of 10t to 50t is more suitable. Even if more data can be used, it is difficult to significantly improve the model performance after data cleaning and processing. The current mainstream model size ranges from 100B to 500B. Although smaller models can also be effective, the scale of first-line models is usually within this range. Although models that exceed 500B can be trained, it is very difficult to deploy services (serving). In the history of Google, models above 500B have not been launched, and OpenAI has not launched models with an effective size of more than 500B (excluding MoE models, if converted into dense models). In the future, models from 100B to 500B may become the mainstream due to memory and data size limitations. The closed source model can be larger, but MoE technology is usually used, and its effective size may be about 500B per activation.

The voice model.

The voice model has two main advantages. First of all, when we speak, the voice contains a large amount of information, such as emotion, intonation, etc., and even the speaker's dialect and personal characteristics can be judged by voice. Voice signals also include background music, scene music, and even rhythm when singing. Traditional processing methods can no longer capture this information comprehensively. Therefore, the speech model can use the powerful functions of the text model to mine this information by retaining a large amount of speech information. During the output, the model can adjust intonation and emotion according to the input situation, making the output more natural and in line with human needs.

Secondly, the real-time response of the voice model has been significantly improved. In the past, the delay from speaking to model-generated speech could reach one second, but now it can be shortened to about 300 milliseconds. The advantage of this low latency is that conversations between people are often interrupted and responded. Reducing the latency can make the conversation experience smoother and more real. In addition, users can customize the style and characteristics of voice with text as needed, which is more flexible and practical than traditional recording methods.

Music model.

Another area where commercialization has been done well is music generation that has developed well in China. Recently, many music tools have been released. I think the progress of this area has never been a technical problem. Although the technology of music is more complicated than voice, because music is more complicated than people's speaking, it is still a copyright issue in fact.

Now people gradually begin to solve the copyright problem. Large companies buy copyright, while small companies take advantage of its flexibility ("barefoot people are not afraid of wearing shoes") to directly enter the market. So you will find that there are already many good music tools on the market. Take "TikTok fast song" as an example. Although it is difficult to make it into hot models, if you are not a music professional, it sounds no problem.

I saw a colleague writing a song before. He wrote a song at lunch time. His lyrics are roughly described like this: I only have one friend in the company. This person hasn't come back after eating for an hour and a half. I wonder if something has happened to him, do you want to call his girlfriend. I was afraid of social contact and didn't dare to make a phone call, so he wrote a song to express this feeling.

Any of your feelings can be expressed by music. Music is a way of expression and a form of personal interaction. It was difficult to express feelings in the emotional way of music before. By contrast, writing poems may be easier.

Now, after mastering the expression tools, I think many people will express their thoughts and feelings through music in the future. We used emoticons and punctuation marks before, but now we can use music. Lowering the threshold is likely to solve this problem, and it is also possible to make more use of this expression in social media or human-to-human interaction in the future. I think this may have a very big impact. But this is not only a technical problem, but also a commercial problem.

640.webp

The image model.

Maybe you should have seen some very vivid TED speech pictures they have made in recent days. At present, these image generation technologies are the earliest and most effective technologies in the AIGC field. Now, you can generate images with more than 1 million pixels. It is often said that pictures should have souls. You will find some tattoo tools before, even if there is a style, it seems very fake. But now you will see that these generated pictures are very close to the real ones and are almost difficult to distinguish. However, they still lack some souls and are not mature enough in emotional expression and subtle expression, but a breakthrough is likely to be made in this field soon.

The video model.

Sora has attracted wide attention since its launch, but it is still in a relatively early stage. The common video generation technology is still very expensive because the processing of video data is extremely complex. In the past year, a large amount of data has been cleaned, but the training cost of video data may be lower than that of data preprocessing. Therefore, there is no particularly excellent open source model on the market, and this field is still in its early stage. The main problem is that it is relatively easy to generate an image, but it is very difficult to generate a coherent video. For example, when a person's video is generated, details such as hair and eyes will change, which is very sensitive to the audience.

Multimodal model.

Among each module, the current trend is multimodal technology. Text can be used in combination with other modules because it has the highest information content and is the easiest to obtain. The capabilities learned through text models can be extended to different module requirements, including pictures, videos, and sounds. This has two advantages: first, it can borrow the ability of text model in generalization learning, and second, it can be customized to control the generation of voice, pictures and videos through text. In the past, professional tools may be needed to complete these tasks.

For example, writing code is a professional process, which usually requires writing code to control certain functions. Now you can ask ChatGPT to help write the code, just tell it the requirements. In the future, the content generated by other modules can also be controlled by text. Using natural language to interact may become a normal state.

640.webp

To sum up, I think the score of the current language model is between 80 and 85. The level of the audio model is acceptable and is in the usable stage, about 70 to 80 points. Now the progress of this area is relatively fast. The overall video aspect is relatively weak, about 50 points. In some specific fields, such as the case of generating a person with a single picture, the upper body can not move in the report, but the more general video generation capability is still limited.

The inference here is that there may be some changes in human-computer interaction. Before the launch of ChatGPT, the interaction between people and mobile phones was mainly completed through screen brushing and clicking. This method is the simplest, by clicking on the menu or button to operate, for human beings, this is the most labor-saving choice, reducing words and knocking as much as possible. However, the emergence of ChatGPT broke this concept. Users are willing to input longer words to complete the task, because the designed system may not meet all the requirements. Although it can meet 80% of the requirements, for some specific requirements, the system may not be able to meet all the details. Therefore, users can input detailed requirements through long text, but long text input is still not as simple as oral expression. Therefore, although some people think that voice message is not convenient enough and may need to listen to it again, it is indeed more convenient in input.

However, the current voice technology has indeed improved, and in the future, people may more and more accept the task of describing one thing through long voice. Previously, voice control was not very good because it could only handle simple tasks, such as window opening. Users may be used to clicking the button because the position of the button is clear. However, for some complex tasks, detailed communication with the model is required, and voice technology may become more natural in the future. Therefore, this involves the problem of user habits. Some people think that the so-called "killer application" has not appeared in the current technological revolution, that is, after the emergence of a technology, a widely popular application form may emerge.

For example, do you know what is the "killer application" of smart phones? What is the final "killer application? Does anyone know? What are the mobile phone applications we often use? It is a short video. Five years ago, it might be hard for you to imagine that people would swipe videos for a few seconds so frequently. So what is the "killer application" this time? The last wave of top stream applications has basically declined, including Character.AI, Inflection, and Adept that has been sold. At present, only Perplexity search is still supported, but the next generation of "killer applications" is still uncertain. It may be that the technology gradually matures, and then the user habits gradually change, and new application forms will emerge. Therefore, it is difficult to predict what will be the "killer application" in the future today.

640.webp

Application.

The third is application. What do you think AI models can be used? The essence of artificial intelligence is to help human beings complete some tasks, so its core application is to provide unlimited human resources.

I classify these applications into three categories. The first category is liberal arts white collar. White-collar work mainly involves interacting with people and the world through natural language, including writing articles, personal assistants, call centers, telephone processing, text processing, story planning in games and entertainment, etc. In the field of education, teachers also communicate with students through natural language. In this field, the application of language models is relatively good, and language models perform well in these aspects. A liberal arts white collar may need an hour to complete the work, and the language model can usually be completed in 8% to 90.

640.webp

The second is engineering white-collar workers, who may be engaged in code writing and problem solving. You can think that the current model is far from replacing you. Although there are models that can help write code, you need to know what it actually does. In the past, when writing code, I might search for data online, such as copying code on Stack Overflow, modifying the variable name and running it. Now the model can help you complete these tasks because it has crawled the content on the Stack Overflow during the training process. The model can search for relevant code in its database, adjust variable names, and provide code examples, which saves time to search and modify variable names. But it is not really writing code. Considering that you can write complex code in one hour, the model cannot replace the complex tasks completed by engineering white-collar workers at present. If you require the model to complete a complex project from design to implementation, this is still a distant goal.

The last one is the blue-collar class, which is the most difficult field to realize. The only good field is self-driving. Self-driving can make progress because it is carried out in a relatively closed environment. Traffic conditions are relatively fixed. For example, in Shanghai, although the changes are fast, the traffic conditions may not change much in other places within 10 years. Therefore, driving is relatively simple under closed road conditions. Although unmanned driving technology has not yet completely solved the problem, great progress has been made. The reason is that there are a large number of vehicles, and each vehicle is equipped with sensors, which can collect a large amount of data. These data are used to train models. Tesla and other companies have adopted end-to-end technology. Through a large number of cameras and vehicle data, although the road conditions have not changed much and the operation is relatively simple, as a result, the technology has gradually matured.

However, for blue-collar jobs, such as carrying plates and transporting goods, which involve complex interactions with the real world, it is very difficult for robots to understand the items in the room. Although robots can handle some simple tasks, they need a large amount of similar data to support their understanding of the environment. This is a technical breakthrough because there are still limitations in data acquisition and sensor deployment. There are not enough sensors to obtain enough data; Without enough robots, data collection is also limited. It will take a long time to widely deploy this technology in the physical world. It may take at least five years to build infrastructure and accumulate enough data to make automation in blue-collar work possible.

640.webp

For the simple task of Liberal Arts white-collar workers, the current technology has been able to complete. However, complex tasks remain a challenge. Although the model can be completed to a certain extent, the simple tasks of engineering white-collar workers are still difficult, similar to the "moon landing plan". For blue-collar jobs, in addition to unmanned driving and factories in specific scenarios (such as the situation where there is little change in specific scenarios and a large amount of data can be collected), the technology progress is fast, and other simple tasks are still difficult to implement, complex tasks are more difficult. However, blue collar is the most important labor group in the world. Therefore, technology still needs time to transform the blue collar's work field. The whole technology will take many years to progress, so there are still many opportunities to participate and develop in the next 10 to 20 years.

640.webp

To sum up, for applications, as long as enough data can be collected, automation can be realized. This has always been a major challenge facing the entire AI field. Algorithms and statistical models require a large amount of data for training. Therefore, if an occupation or industry can collect enough data, it is possible to realize automation. Therefore, if you want the model to complete a task, the first thing you need to consider is how to collect enough data. Many traditional enterprises will choose to install a large number of sensors in order to collect data first and gradually accumulate useful data after several years of operation. This is a law of development, and it is impossible to achieve it in a hurry.

640.webp

Experience: Model

pre-training and post-training are equally important.

During this period of study, I started my business for about a year and a half, and I learned some more detailed contents.

First of all, people may think that pre-training is very important, for example, training a model with hundreds of B parameters. But now it seems that pre-training has become an engineering problem, and then training is the technical problem. Two years ago, pre-training was a technical problem, but now it has become an engineering problem. The post-training phase is still very difficult. High-quality data and improved algorithms can significantly improve the effect of the model.

The so-called high-quality data means that the data needed in the pre-training stage should contain structured information, have certain diversity, and be able to adapt to practical applications to meet the needs of the model. High-quality data includes not only Internet data, but also screening and optimization according to specific application requirements. Recently, the RLHF (reinforcement learning from human feedback) algorithm released by OpenAI has received extensive attention, but I personally think it may be far-fetched when reading relevant algorithms. Yann LeCun also said recently that this set of technology is still relatively primitive, although there have been great changes in the past one or two years. However, I cannot give a clear answer to which algorithm is better, because the data used by each person is different, resulting in different applicability of the algorithm. The objective function mentioned in a paper may work well in some applications, but it may not be used in actual applications because your data does not match the data in the paper, the objective function may not be applicable to the assumptions of structured problems. Therefore, this problem cannot be solved only by technology, and further research and development are needed.

Here I show a diagram that uses the LLaMA 3 70B model and performs role-playing tasks. Role-Playing tasks cover a variety of roles, including teachers, insurance sales, and virtual roles in the game. We conducted post-training on the LLaMA 3 basic model and debugged two versions: V1 and V2. Currently, V2 performs better than all other models in role-playing tasks. However, the LLaMA 3.1 405B model ranks fourth, and the debugging version of LLaMA 3.1 70B ranks 53rd. As a start-up company, our funds are limited, while the LLaMA team spent 50 million US dollars on labeling data and has a large team for training.

Nevertheless, I think there are problems with their data annotation, and their investment in algorithms is not enough. Therefore, I would like to tell you that when developing large language models, you can focus on the post-training phase rather than the pre-training phase. The post-training part is more important for the practical application of the model, including algorithm innovation. The pre-training phase has become an engineering problem, requiring a large amount of computing resources and manpower. Although the current threshold is still high and the requirements for tuning are different between the 8B and 70B models, the threshold has been lowered a lot in general.

640.webp

The general capabilities of vertical models are very important.

The second point is that you may want to build a vertical model because there are some problems with the general model. General models need to process massive amounts of exponential data to improve in all aspects. Taking OpenAI as an example, even if it pays attention to specific tasks such as role playing, the improvement of its general model needs to process a large amount of data, making the model huge and difficult to optimize. Therefore, the model focusing on the vertical field was generally accepted a year ago, but in fact, we found that this view was not completely correct and there was no real "vertical model".

For example, we once had a customer who asked to develop a Game Master model. Although this sounds simple, in practice, the model needs to understand complex instructions and reason for a large number of rules. In addition, it also needs to understand mathematics and calculate various data in the game. Therefore, even in a seemingly vertical field, general capabilities are still indispensable.

Our recent role-playing tasks on the V2 model show that the improvement from V1 to V2 is mainly reflected in the improvement of general intelligence, not just the increase of knowledge. The V2 model has made significant progress in role playing and has performed well in the evaluation. This shows that the model can be outstanding in some vertical fields only when the improvement of general intelligence enables the model to compete with GPT-4O and Claude 3.5.

To get ahead in a specific discipline or field, the general ability of the model is still very important. Extreme partial models hardly exist. In most cases, the powerful ability of the model in a general field is possible to make it win in a specific task. This is another important experience we have learned in practice.

640.webp

The evaluation model is difficult and critical.

Another problem is the difficulty of evaluating the model. Although the above discussion may cause suspicion, many evaluation results do not always truly reflect the actual performance of the model, because these evaluations are often too simplified. In practical application, the performance of the model is very complex, and the advantages and disadvantages of the model cannot be accurately measured only by simple evaluation criteria. Therefore, over the past year, we have seen many models perform well on the ranking list, but they do not perform well in actual use. This is often because the evaluation method is not perfect enough to cover the complex situations in practical application. Evaluation is crucial in application development. First, it is necessary to ensure the accuracy of evaluation methods before considering other issues.

Because our models interact through natural languages, the ambiguity of languages makes it very difficult to evaluate. It is not easy to determine whether the logic of the model is correct and whether its language style meets the requirements. In general, we do not want to use manual evaluation, because the cost is too high, while using other models to evaluate may introduce deviation. For example, GPT-4 tend to generate long and gorgeous text during evaluation and prefer specific words. So evaluation is 50% of any practical problem, because once the evaluation is completed, you can optimize it. Second, the completion of the evaluation also means that you have obtained some data.

640.webp

Experience: data

the algorithm determines the lower limit of the model, and the data determines the upper limit of the model.

At present, we are still far away from AGI. AGI can conduct autonomous learning, while our current model is still "cramming education". Therefore, the upper limit of the model depends on the data. For example, Claude 3.5, a relatively small model, can defeat many large models on various lists, such as GPT-4, and performs quite well in practical use. I have communicated with their team and found that their data processing is excellent. They have invested a lot of time and energy in data preparation. Therefore, if you want the model to perform well in a certain field, you must prepare the relevant data. The technology itself is not so fancy, and 7% or 80 of the time is still spent on data preparation.

640.webp

Experience: computing power

in terms of computing power, the cost of purchasing a self-built GPU server is not much cheaper than renting a GPU. The main reason is that most of the profits are occupied by NVIDIA. NVIDIA's profit margin is as high as 90%. The actual price of a $3,000 GPU is $30000. No matter how your relationship with NVIDIA is, they will not offer discounts, so the current GPU is similar to luxury goods. Based on three-year cost calculation, you will find that GPU accounts for 50% of the overall operating costs in three years. Because NVIDIA has made 50% of the profits, it makes little sense to optimize the remaining costs.

In this picture, the upper row is self-built and the lower row is rented. It is acceptable to rent small GPU cloud services, but large cloud services cannot be rented. Although I have worked in Amazon for seven and a half years, the service cost of using Amazon is too high to afford. Only small cloud services can be used. These small cloud services are often the AI cloud that companies that used to dig Bitcoin changed their careers. They have done a good job in cost control, because they have electricity resources. The cost of purchasing GPU accounts for 50%. Although the cost of other hardware is a little higher, the power cost will not be too high. In contrast, Bitcoin digging companies need a large amount of power resources, such as 20 megawatts, which they have already established in the early days. The operating cost of self-built servers is relatively high because GPU often fails. We have a data center in Toronto with three people working in shifts for maintenance. If a fault occurs, it needs to be repaired quickly. The operating conditions are very poor. Although cloud services earn a little money, the profit is only 20%, so overall, the difference is not big.

The only advantage of a self-built server is that it can save CPU computing power, storage, and network bandwidth., these aspects are relatively cheap, while cloud services are costly. In the past decade, there has been little change in technology in this area. For example, the cost of using AWS to store data is equivalent to the cost of purchasing hardware devices, and the storage capacity can be increased by ten times. The cost of storing 10PB of data on AWS is equal to the cost of purchasing a 100PB storage cluster for several years. Therefore, when the data volume is very large, self-built servers have certain advantages. In fact, we didn't estimate this at first, and finally found that we could make some money in this aspect.

640.webp

The language model is actually a machine learning model, but a new architecture is changed, and the scale is larger. The increase in scale brings many difficulties, but in essence you can still understand it with traditional machine learning methods. The model still depends on data, and evaluation is still very important. Therefore, many previous experiences are still applicable. We don't have to be too mythical about new technologies, but we need to realize that expanding the scale by 100 times will bring greater difficulties. At present, the main problem is that pre-training has become an engineering problem caused by a large scale. However, for the follow-up work, although the scale is larger, the exploration of the algorithm is still not deep enough, and the improvement direction of the algorithm needs to be carefully studied. These are some of our technical experiences.

640.webp

The second part may be more interesting, involving some personal experiences.

If you are not interested in AI itself, I can talk about my experience after graduating from Jiaotong University. I have done many different things, which I call "punch-in life", including publishing papers. I spent more than six or seven years in Jiaotong University. He spent two years in HKUST and five years in CMU. In Berkeley and Stanford, I stayed in each school for six months. These are all the schools I have been.

In terms of big companies, I have worked in Baidu for two years. Dai Wenyuan once came here. He is actually my Tech Leader. I have worked in Amazon for seven years. This is my second time to start a business. I have worked in a startup company for two and a half years.

640.webp

I 'd like to share with you the experience of different work experiences. What are the goals of large companies, PhD companies and start-up companies? In large companies, your basic goal is promotion and salary increase. Although this is the basic goal, it is not the ultimate goal. The goal of PhD is to graduate, while the goal of start-up companies is to quit, either go public or sell. These are the basic needs you consider every day.

In large companies, you need to solve the company's concerns. This is very important. A large company must be clear about what the company wants to do, and your work should be consistent with the company's goals. If the things you like do not conform to the direction of the company, you will feel very uncomfortable. PhD, your task is to solve problems of research value. Start-up companies require you to solve the actual problem that users are willing to pay. If no one pays, your company cannot be maintained. However, it is also possible that investors are willing to give you funds. In short, you need to solve the problem, but in different environments, the way and content of solving the problem will be different.

Another driving force is your motivation. Big companies are OK, as long as you don't have mines at home, you can work step by step. PhD needs higher motivation, because you are not trying to earn money. The motivation of start-up companies needs to be higher, otherwise it is difficult to stick to it. We'll talk about why later.

640.webp

A worker.

What are the benefits of workers? In a relatively simple environment, you can learn all kinds of industry knowledge, understand how technology is implemented into products, and how products are designed, operated and managed. In addition, another advantage is that you can finish the scheduled tasks on time, avoid anxiety at night, and have relatively stable income and free time at the same time. Especially when having children, many life needs will consume a lot of time, such as buying houses, educating children and taking care of parents. Therefore, as a part-time worker, you have relatively more time. Even if it is a 996 working system, although people have many complaints about it, at least after work at 9:00, you still have a relatively fixed rest time. In contrast, the other two options may be a 7 × 24-hour working mode.

But the disadvantage of working people is that you may stay in the thinking mode of professional managers, which is closely related to the simplified environment created by companies and schools. From primary school to now, the school is a very simple social environment, so is the company. The company abstracts the complex world into simple tasks, allowing you to perform them layer by layer. The further down, you may feel that you are just a screw, and the advantage of the screw is that you only need to find the nut to fix it without paying attention to the complexity of the machine or the complexity of the outside world, these are handled by large machines.

However, this simplified environment also has its disadvantages. The longer you stay, the more you will work in a simplified world, the more tired you may feel and the less you will learn, causing you to stay in the thinking mode of a worker or a professional manager, but failed to think more widely about problems and face difficulties. This is good or bad.

640.webp

PhD.

The advantage of PhD is that you can focus on exploring a certain field for several years, although there is no income during this period, and there is no promotion or salary increase. After completing PhD, you can acquire R & D capabilities of individuals or small teams. Many people can complete projects independently, while others can lead students and teams to complete research and development. In addition, during PhD, you spend 50% of your time on writing and speaking, which is very important, because in the education system of my generation, this part of content is often ignored. Many companies require a PhD degree in their research and development positions, which is not really required. Anyway, PhD is a large amount of money in the market. Anyway, it is right to have a PhD.

What are the disadvantages? Few laboratories can participate in very large project development, especially those real large-scale research and development projects. Another problem is that you need to adapt to the research subject and the tutor's style, whether you are interested in the research subject, and whether you can adapt to the tutor's research method. This is actually a process of personal adaptation. It depends on your adaptability and the adaptability of your mentor. If you cannot adapt, the process will be very painful. In the company, you can solve the problem by changing departments, but in the PhD stage, this adjustment is more difficult. It is best that you really love research, otherwise it is difficult to stick to it. You need to be clear about the purpose of your writing and what goals you wish to achieve through research. You should have a bigger goal and truly love this field so as to make a breakthrough in research. Therefore, this choice is not suitable for everyone.

640.webp

Start a business.

The advantage of starting a business is to enjoy being a pirate. You, a pirate, need to constantly look for ships that can be robbed. Grabbing a ship will bring satisfaction. If you can't grab it, you may face failure. The process of starting a business is to constantly pay attention to the market dynamics, communicate with people and look for opportunities. Once the opportunity appears, you must fight all in, otherwise you may miss the opportunity. If you don't have all in, you will be easily eliminated. Moreover, the survival of entrepreneurship is often in a flash, and this kind of stimulation may not be experienced in other places, because this is the only legal pirate activity.

Another advantage is that you can directly face the complex social reality. Entrepreneurs interact directly with the society, and no one abstracts the problems or makes them clear for you. Therefore, you must understand the society by yourself and learn quickly. Complex environment can train your abstract ability. You need to simplify complex phenomena, which is also one of the advantages of entrepreneurship in my opinion. This experience can make you feel relatively simple when facing other challenges. In addition, experiencing hardship is also a part of growth. For example, giving birth to a child is a difficult process. Giving birth to more children may bring great pain, especially waking up every three hours in infancy, which makes people doubt whether they can continue to stick to it.

I have asked many successful people, including Zhang Yiming and some people who may have become the richest man in the world. I asked them how they felt when they started their business. They also doubted their own career, but they finally survived. In the process of starting a business, all the difficulties are borne by yourself, and no one will solve them for you. In school, the tutor can help you share, while in the company, the superior may help you carry the pot. But in the process of starting a business, you must face all the difficulties yourself and have no room to escape. Only when you really love entrepreneurship itself can you stick to it, otherwise you may choose to give up in the face of difficulties.

I said that the motivation for starting a business is higher than that for PhD, and the motivation for PhD is higher than that for work. The core reason is that different activities have different delayed enjoyment. Promotion and salary increase in the company usually get immediate bonus or praise after completing the task, while the research results of PhD may take several years to be recognized, starting a business may take five years to get positive and positive feedback. Therefore, without immediate positive feedback, you must love this career very much and encourage yourself, otherwise it is difficult to stick to it.

640.webp

Motivation.

Which road to choose often depends on motivation. You need a very strong motivation, not just a short interest. Simple desires and fears are easily satisfied, and the real motivation must come from deep desires and fears. You need to look at yourself from the perspective of onlookers and think about whether there are deep-seated things in your heart that you are unwilling to share. Explore what your motivation is to pursue or fear. Desire should be rooted in the bottom, such as fame and power. These are all desires rooted in the bottom. Consider whether gorillas care about these desires? If I let the gorilla be the head of the gorilla, will it be willing? Gorillas may say yes. But if I asked if it would like to eat some good food, gorillas would also like. What really matters is that you are willing to follow your desires. These desires are not complicated, but the key lies in facing your own desires and fears. Fear can lead to depression and make you feel the pressure of life and death.

When I was a child, I had some depression and felt a sense of emptiness in life. I often think about what I pursue in my whole life and feel a kind of nothingness fear. The vast majority of people, 99.99% of them, will not leave any memory in history. In the past, Chinese tradition believed that the meaning of life lies in the continuation of future generations, but now many people choose not to give birth, and this traditional concept has also been broken. So, what should I do at this time?

The core is to turn desire and fear into positive motives. This is very important. Your motivation must be correct and conform to your values. Although it may sound strange, in fact, neither escaping nor indulging can really satisfy desire nor relieve fear. The only way to overcome these challenges is to turn them into positive motivations and finally solve problems that conform to social values.

People are social animals. Although many people may think that they can dominate everything at a specific age, in fact, you need to choose a positive direction that conforms to mainstream values. The core of movies, whether at home or abroad, whether they are positive or negative themes, or even praising the villains, is the positive motivation. The essence of film is the abstract expression of human nature.

640.webp

With motivation, you need to determine what problems to solve. The problem you want to solve may be your motivation. If you fear something, solve that fear; If you want money, make money, which is to solve the problem of making money; If you want fame, you can choose to become an internet celebrity, solve famous problems. Or you can indirectly satisfy this motivation. For example, if you think a problem has academic value, you can consider studying PhD; If a problem has commercial value, you can consider starting a business. If these two attributes are not so ideal, at least choose a direction that is valuable to personal growth, and becoming a part-time worker is also a choice.

For example, to be specific, in academic aspect, why language models can work effectively is not yet completely clear. Although the field of deep learning is not yet fully understood, language models have made remarkable progress, this itself has high academic value. In terms of commercial value, whether the language model can incubate new applications and create killer applications, if there is no new application, we can consider applying the language model to an existing product. Through this process, you can also learn how to complete a relatively small project. Although this is a simple task, you may need to review your motivation regularly and make a new choice.

Every time you make a choice, you should return to your motivation.

640.webp

A very useful method.

Finally, I think a very useful method is to summarize yourself from a mentor, superior or objective perspective. You need to review what you have done every week and why these goals have not been achieved. This problem is very important, because the reason why the goal is not achieved may be laziness. You need to face this problem and consider how to make yourself more diligent. For example, find a Learning Partner, go to the library together every day, and supervise each other. This is a way. Another way is that if you think you are stupid, you have two choices: one is to give up this and do other things that you are good; the other is to accept this piece and invest twice as much time as others to complete it. In any case, you need to find out whether you are lazy, lack of ability, or other problems. This is an essential question. If you are cruel to yourself and overcome your ruthlessness, you will become a very awesome person. If you can't be cruel to yourself, you will make slow progress. The last thing is how hard you are to yourself.

You can set a habit, for example, spending 30 minutes every Monday night summarizing your work progress. Once you develop this habit, you can stick to it for decades. A summary should also be made every quarter to check whether the goals of the previous quarter have been completed and to plan the tasks for the next quarter. This is helpful for planning semester exams, final exams or summer internships.

The advantage of this method is that it can help you think about goals in a longer period of time, such as what goals you want to achieve within four years of your undergraduate course. This kind of thinking helps to clarify the direction. In the process of starting a business, the importance of choice is often considered to exceed efforts. You need to know what your goal is. In addition, every year or every five years, you need to reflect on your motivation. If you were unhappy last year and did not make progress, it may be because of lack of strong enough motivation, or because your motivation is inconsistent with reality. If it is the former, you need to continue to work hard; If it is the latter, you may need to re-examine and adjust the direction. It is important to check your motivation regularly and consider how to solve the current problem. Reviewing your motivation and future goals every five years can avoid falling into a "punch-in life", that is, acting casually without a clear direction. It is lucky if you can define your goal early. If it is not clear, it doesn't matter. You can make adjustments in the next few years.

640.webp

Basically, there are two paragraphs, involving the future of the whole technology and my idea of punching in in recent years.

I think this is the best time. New technologies bring many new opportunities. I can see the impact of language models on society. Even if there is no more advanced technology in the future, the current Transformer technology will have a great impact on the world in the next few years. This is not my personal opinion. I asked many CEOs of the world's top 500, and they also held the same opinion. Their internal data support this point. Therefore, this will bring many new opportunities. Whether it is an undergraduate, a master, a doctoral, or a person who has just entered the workplace, everyone can enjoy the changes brought by technology in the next few years.

However, this is also the worst time. Because now the competition is getting fiercer and fiercer. For example, everyone here may need to make 10 times more efforts than our generation. Therefore, the achievements and rewards of our generation may not necessarily apply to you. We may be lucky, but your challenge is even greater. Therefore, this is also the worst time. Most of the time, the achievements described in some reports are the welfare of the times, not necessarily the welfare of individuals. The good news is that the welfare of the Times still exists. The bad news is that everyone needs to make more efforts.

Thank you, this is all I want to share with you.


---[This article is finished]] ---

this article is reprinted from: Andy730

Replies(
Sort By   
Reply
Reply