Key points
-
use flash memory to accelerate the adoption of large language models (LLM) on edge devices.
-
When running LLM on a client device, you need to resolve the Memory Wall problem.
-
By loading part of LLM into GPU VRAM, the system memory requirement can be reduced.
-
With low latency and high speed of flash memory, more efficient parameter loading and calculation can be realized.
-
The loading and computing efficiency of LLM can be further optimized by using sparsity prediction algorithm and row and column binding technology.
Memory Wall, computing power, and bandwidth
memory Wall effect
-
model parameters 410x / 2yrs
-
hardware memory 2x / 2yrs
the growth gap among computing power, DRAM capacity, and on-chip interconnection rate.
-
Calculation force 3x / 2yrs
-
memory Bandwidth: 1.6x / 2yrs
-
on-Chip Interconnection Bandwidth: 1.4x / 2yrs
what do you want to express?
Compared with the shortage of computing power, data pipeline reading bandwidth and communication efficiency are more critical factors to limit AI training.
Client-side device model inference challenges
problems encountered when AI applications are deployed on end-side devices
although the SLM model has been significantly compressed, it is still significantly exceeded compared with the DRAM capacity of the front-end device.
The figure shows that Apple (a famous memory search in the industry) and Windows are facing challenges in the process of integrating large models.
The terminal toC market is very sensitive to prices, improving VRAM to support client-side reasoning is considered uneconomical..
As shown in the following figure, the price of an RTX 2000 graphics card is close to half of that of a PC on the left.
DRAM cannot be added, only the size of the model can be started, and the actual workflow of the model must be analyzed first.
LLM inference workflow
LLM workflow from input to output
model architecture:
-
after the input passes through the embedded layer, it enters a module that loops 18 times. This module consists of two main parts:
-
MLP layer (Gelu activation function) and RMS normalization.
-
Attention layer and RMS normalization.
-
Finally, the probability results of the model are output through the linear layer and softmax layer.
Volume proportion of embedded layer/attention layer/MLP layer model
-
the Embedding Layer accounts for 20% of the model size.
-
The Attention Layer accounts for 8%.
-
The MLP layer accounts for 72%, and this layer is sparse.
Note: From the perspective of working principle, the key work of compression model volume should focus on embedded layer and MLP layer.
Common data resident DRAM policies
-
the percentage of layer components varies for each LLM.
-
Based on the layer structure, some LLM can reside in the VRAM of the GPU.
Example: in the Gemma 2B parameter model, 28% of the parts are maintained (20% are Embedding layers and 8% are Attention layers) it resides on the GPU, while 72% of the MLP layer is loaded on demand (non-resident part).
The figure shows an optimization method, some layers of LLM (such as the embedded layer and the attention layer) reside in the VRAM of the GPU, while the remaining MLP layers are loaded from the SSD as required.. This method can realize efficient inference computing while avoiding occupying a large amount of GPU memory. This hybrid storage architecture takes advantage of GPU efficiency and SSD capacity.
Load running model parameters from Flash
loading large language models from NVMe devices
the process of loading large language models (LLM) through NVMe storage devices and the time consumption of different steps.
GPU unit:
-
the GPU processes the embedded layer and the attention layer, and the processing time is 0.2 ms/layer.
-
GPU communicates with CPU units through CUDA kernel API, and the processing time of each layer is 18 ms/layer.
CPU unit:
-
applications running LLM access data through XNVMe (enhanced NVMe interface).
-
The CPU communicates with the SSD through PCIe.
SSD unit:
-
the SSD stores Gemma-2B complete model.
-
The time taken to load each layer from the SSD is 63 ms/Layer. After using the XNVMe interface, the time reduced to 24 ms/layer, which reduces about 60%.
The specific process of loading LLM from NVMe devices is shown, and the improvement of loading efficiency by XNVMe technology is emphasized. When XNVMe is not used, the SSD loading time is 63 ms/layer, while after XNVMe is used, the SSD loading time is reduced to 24 ms/layer, which is reduced by 60%. GPU processing speed is fast, but the overall loading performance is limited by the speed of data transmission from SSD to CPU and then to GPU. Therefore, the application of XNVMe technology significantly improves the loading efficiency of the model.
Note: the key of XNVMe technology is to optimize the IO process of data in the operating system. The classic path requires user-layer kernel State scheduling, while the core contribution of XNVME is direct user state calling, so as to achieve the purpose of acceleration.
WD, as a traditional memory manufacturer, studies inference optimization from the perspective of Flash characteristics. Don't forget that there is also a CXL. Its layout in communication technology is not only IO path optimization, but also changes the communication base of existing servers.
Further, from Flash to GPU?
From another perspective: Can data be directly transferred from Flash to GPU memory? Skip the CPU transmission process.
By streaming parameters from flash memory to VRAM while maintaining acceptable inference performance, some optimization methods are proposed.
Question:
-
can we stream parameters from flash memory to VRAM while maintaining acceptable inference performance?
-
Many LLM (large language models) are highly sparse. Can we use this to selectively load parameters to avoid redundant computing?
Optimization policy:
-
squeeze:
-
LRP(Low Rank Predictor): predict which neurons will remain active and which will be set to zero; Then omit those neurons that are set to zero.
-
Speed Up:
-
Row Column Bundling: clusters up and down projection neurons. This helps reduce the number of reads from the SSD.
Why can redundant computing be avoided after the sparsity of LLM is utilized?
The sparsity of LLM (large language model) refers to that some neurons or parameters in the model are not always activated or used in a specific calculation process. For example, when some inputs pass through the model, only a part of neurons participate in the calculation, while another part of neurons are in an "inactive" state, this means that these inactive neurons do not contribute to output in the current inference task. This phenomenon is called sparsity.
Use sparsity to avoid redundant computing the key points are as follows:
1. Reduce unnecessary computing:
sparsity means that not every neuron needs to calculate the output in every step of reasoning. If we can predict which neurons will not participate in the current reasoning process (for example, through a low rank predictor, LRP), then we can skip the calculation of these neurons and directly ignore them. This reduces unnecessary computing and saves computing resources and time.
2. Optional loading parameters:
if the weights of some neurons are not activated during reasoning, these weights can be ignored, especially on devices with limited storage resources (such as GPU VRAM). Pass selectively load parameters, loading only those parameters required in the current inference task can effectively reduce read operations from storage devices (such as SSD) to GPU. This method not only reduces the overhead of data transmission, but also reduces the occupation of memory space.
3. Accelerate the inference process:
skipping redundant computing and reducing unnecessary parameter loading directly improve the inference speed. Active neurons account for only a part of the model. Sparsity enables us to focus on a few parts that really need to participate in the calculation, greatly reducing the total calculation load, thus accelerating the reasoning of the model.
4. Reduce hardware pressure:
in practical applications, GPU and CPU resources are limited, especially on client devices. Sparsity allows the model to avoid unnecessary memory usage and computation and optimize the use of hardware resources. This is especially important when LLM model is large in scale and equipment hardware resources are limited.
Note: the research on model sparsity is the key to promote its operation on limited resource terminals and edge devices!
What innovations can storage hardware or software vendors try on the sparsity of models?
Hardware manufacturer
dedicated accelerator: develop hardware accelerators that are optimized for sparse matrix operations. These accelerators can be designed with special circuits to skip zero value calculation, thus saving processing time and energy consumption.
Memory architecture: it adopts a Memory architecture that can store sparse data structures more efficiently, such as Compressed Sensing Memory (CSM) or sparse RAM, they can reduce the storage requirements for zero values at the physical level.
Software Vendor
sparse data format: provides efficient data formats to represent sparse tensors, such as CSR (compressed row storage), CSC (compressed column storage), or COO (Coordinate list), and provides high-performance Library functions for these formats.
Optimization algorithm: develop a linear algebra library optimized for sparse data, including core operations such as sparse matrix multiplication and Equation Group solving.
Summary
summary and Future Outlook
summary:
-
enabled to run the Gemma model on a 4GB GPU VRAM machine use LRP (low rank predictor) to detect sparsity.
-
Using XNVMe technology, data loading time is reduced by three times.
-
Integrate the phased loading and prediction algorithm and use it with the XNVMe loading/storage system.
Future exploration:
-
train LRP on larger datasets to obtain higher accuracy.
-
To study larger LLM models, such as Llama2 7B model, use ReLU activation function and have 90% sparsity.
-
Window technology is applied to load parameters only for the nearest tokens.
--- [This article is finished]]---
source: Wang Zhiyu public account