Large language models have changed the world. The OpenAI series of GPT models has led to ChatGPT, which has seen significant adoption in 2023. Next to those closed-source approaches, there has been the emergence of a vast landscape of open source models such as Mistral's 7B and MoE models, the LLaMA family, and others.
But even though there has been a lot of progress training wise, running these models for inference still requires powerful hardware. Even though progress has been impressive (running those LLMs can now often be done on a single GPU-equipped laptop!), we would eventually want such models to be able to run on smartphones and other edge devices.
Late December 2023, a research team from Apple released the LLM in a Flash paper. It "tackles the challenge of efficiently running LLMs that exceed the available DRAM capacity by storing the model parameters on flash memory but bringing them on demand to DRAM" (Alizadeh et al., 2023). In other words, parts of the model are stored in different memory and loaded dynamically, only when they are needed.
In this article, I'm exploring the paper. It begins with why running large language models on edge hardware is difficult. Then, I'm looking at the LLM in a Flash paper and the three main categories for optimization proposed by the authors:
Then, I briefly summarize results. This way, you'll learn why optimizing memory management can really be beneficial if you want to run large language models on memory-constrained hardware.
Let's go! 😎
First, let's set the stage by exploring why running large language models on edge hardware is costly. For that, we'll need to take a look at how GPU memory works. I found this great article by Thushan Ganegedara from June 2023 which explains the basics very well together with the CUDA Programming Guida (Nvidia, n.d.).
Let's see:
nvidia-smi
), with large GPUs such as A100s often in the range of 40-80 GB (Dao et al., 2022). It is essentially the GPU's global memory.In terms of speed vs size:
Okay, now that we understand what these components are, let's sketch how a model is typically used for inference from the perspective of the GPU:
In order to load specific models, significant amounts of GPU RAM (i.e. HBM, i.e. DRAM) are required. For example, for Mistral 7B, a minimum of 16 GB is what you'll need to make it run in the first place, although 64GB is recommended (Hardware Corner, 2023).
The setting above discusses inference when done on GPUs. Let's now assume that you want to run your model at the edge. That is, you're not interested in running the model in a central place (such as in the cloud), where requests to the model are queued and where responses are sent back.
No, you want to run the model on the device yourself, such as your smartphone. This is a requirement for making privacy friendly LLM applications, such as assistants which learn anything about you. You don't want that on some third-party server.
Typically, you can load your models into the GPU but inference is often also done on CPU because no compatible GPUs are available. Here, too, your DRAM comes into play, although it's not HBM - but probably DDR based RAM.
Dynamic random access memory (DRAM) is a type of semiconductor memory that is typically used for the data or program code needed by a computer processor to function. DRAM is a common type of random access memory (RAM) that is used in personal computers (PCs), workstations and servers. Random access allows the PC processor to access any part of the memory directly rather than having to proceed sequentially from a starting place. RAM is located close to a computer's processor and enables faster access to data than storage media such as hard disk drives and solid-state drives (Techtarget, 2019).
It does not matter whether you're running the model on your CPU or on your GPU, the same amounts of memory (in DDR DRAM and if running on GPU also on your GPU's HBM based DRAM) are necessary for running the models successfully. The problem is that these large amounts are typically unavailable on edge devices: as far as I know, some of the latest iPhones have 8GB of RAM, meaning that Mistral 7B cannot run on them as we speak.
The LLM in a Flash paper written by Alizadeh et al. (2023) is an attempt to improve this situation. The authors, which are all working for Apple (I am thus not surprised by their interest in this problem), propose a core idea for allowing models larger than available DRAM to run on edge devices:
What if it is possible to store all model weights in flash memory, then load the critical weights into DRAM, followed by dynamically loading the other weights when necessary?
The idea looks as follows:
The memory architecture as proposed by Alizadeh et al. (2023).
While quite a good idea (as results show that it is feasible to run models larger than available DRAM this way), the authors describe some roadblocks (Alizadeh et al., 2023). The roadblocks are primarily related to the slow transfer speed between flash memory and the size of DRAM itself. They propose three main strategies (with various sub strategies) to deal with those roadblocks:
Let's analyze these in more detail.
Transfer speed wise, the path between flash memory and DRAM is a bottleneck. One very simple solution to experience lower impact of this bottleneck is reducing the amount of data transferred from flash memory into DRAM. In the paper, the authors describe that it is in fact possible to do this without compromising on model quality.
They describe three sub strategies which they tested for doing so:
Firstly, the authors propose to be selective when choosing what weights to persist - i.e., to store in DRAM.
We opt to retain the embeddings and matrices within the attention mechanism of the transformer constantly in RAM. For the Feed-Forward Network (FFN) portions, only the non-sparse segments are dynamically loaded into DRAM as needed (Alizadeh et al., 2023).
According to the authors, for the OPT 7B and Falcon 7B models, this means that approximately 1/3 of the model is stored in DRAM by default. The other 2/3 is loaded dynamically.
This already reduces the memory requirements quite significantly, although we're not there yet, as we haven't loaded all necessary components at this point in order to complete our forward pass.
Okay, good. We need to load the other weights dynamically. But how do we do that? One way is by benefiting from the inherent sparsity of feed-forward layers in LLMs, due to the ReLU activation function.
The ReLU activation function naturally induces over 90% sparsity in the FFN’s intermediate outputs, which reduces the memory footprint for subsequent layers that utilize these sparse outputs (Alizadeh et al., 2023).
The authors propose to "employ a low-rank predictor to identify the zeroed elements post-ReLU". What this means can be seen in the image below (and low-rank benefits are explained in the context of LoRA here).
What is happening is that the model is slightly altered. A low-rank predictor (which are essentially two matrices of dimensionality N x r
and r x M
where r
is way smaller than both N
and M
) is added to the output of each attention layer (next to the feedforward layer, which is there already). The low-rank nature means that a N x M
matrix can be computed by multiplying both matrices without adding many extra parameters to the model (which would offset the benefits). In other words, by adding these layers, extra learning behavior can be added to the model.
But what learning behavior?
More interesting than the fact that the predictor is low-rank is that it ends in a Sigmoid-activated output (i.e. all outputs are between 0 and 1). Combined with a threshold (if sigmoid(prediction) > 0.5
, the class outcome is 1
; it is 0
otherwise) this leads you a classification: will this neuron be active or silent? Since the Tensor has the same dimensionality as that of the ReLU-activation (both M
), this prediction can be used as a mask for what neurons will be active.
And only these weights will be loaded dynamically from flash memory into DRAM before the up projection is computed.
In other words, the selective persistence strategy means that 1/3 of the used 7B models are loaded up front; anticipating ReLU sparsity will determine which other amount of the 2/3rd is loaded when needed.
Low rank predictor besides the FC up projection (Alizadeh et al., 2023).
Anticipating ReLU sparsity will instruct you what weights to load dynamically. Naïvely, if your predictor tells you to load specific neurons (and hence weights), you could load all of them from flash memory into DRAM.
But what if certain weights were already loaded?
It would be inefficient, a waste even, to load them another time.
The same is true for keeping weights in DRAM if they are not necessary at a specific point in time.
That's why the authors propose using a sliding window technique for managing what neurons are active and hence what weights are present in DRAM at every time during the inference process.
We define an active neuron as one that yields a positive output in our predictive model. Our approach focuses on managing neuron data by employing a Sliding Window Technique. This methodology entails maintaining neuron data only for a recent subset of input tokens in the memory (Alizadeh et al., 2023).
This sliding window technique can be compared with just-in-time loading and works as follows:
0:l
, 0:k
is a sliding window over some amount of tokens k << l
. Then, s_agg(k)
is the amount of neuron data stored in DRAM for the sliding window. The goal is to keep s_agg(k)
approximately equal all the time, even when sliding the window across the sequence.1:k+1
), it is possible to predict what new neurons must be loaded via neuron activation prediction (using the low-rank predictor). Doing so, it is easy to follow that only s_agg(k+1) - s_agg(k)
new neurons need to be loaded from DRAM, as can be seen in the image below (the neurons in dark blue are loaded; the neurons from the initial window are kept in memory). Interestingly, what the authors found was that decreasingly extra neurons need to be loaded for longer sequences, which means that the sliding window approach is really useful to only load neurons just in time.By loading only necessary neuron data with a sliding window, it is possible to load the neurons precisely when you need them, while total amount of DRAM usage can be kept as low as possible by performing deletes of non-necessary neurons as well.
Sliding window data management (Alizadeh et al., 2023).
Also known as reading larger chunks. Interestingly, the authors found that a large chunk of data is transferred from flash memory faster compared to a few smaller chunks.
Flash memory systems perform optimally with large sequential reads. For instance, benchmarks on an Apple MacBook Pro M2 with 2TB flash demonstrate speeds exceeding 6GiB/s for a 1GiB linear read of an uncached file. However, this high bandwidth is not replicated for smaller, random reads due to the inherent multi-phase nature of these reads, encompassing the operating system, drivers, interrupt handling, and the flash controller, among others (Alizadeh et al., 2023).
For this reason, the authors have proposed two strategies for making chunk sizes larger - while thus attempting to improve transfer throughput. The first is bundling matrix columns and rows; the other is coactivation bundling.
The first strategy, bundling matrix columns and rows, essentially boils down to the observation that in Transformer FFN projection layers the activation of neuron i
boils down to "the usage of the ith column from the upward projection and the ith row from the downward projection".
Hence, if neuron i
is predicted to be active, both weights need to be loaded. Then, why not combine them into a larger chunk, increasing throughput of the neuron? That is what is meant with bundling matrix columns and rows.
Consequently, by storing these corresponding columns and rows together in flash memory, we can consolidate the data into larger chunks for reading (Alizadeh et al., 2023).
The second strategy, which failed, is related to bundling weights of neurons that coactivate - or, in their words,
We had a conjecture that neurons may be highly correlated in their activity patterns, which may enable further bundling (Alizadeh et al., 2023).
To validate this conjecture, the behavior of neurons over a validation set was computed. It was indeed the case that a power law behavior was followed (i.e., where some neurons activate very often, while there is a long tail of not-so-often activating neurons as well). Now, if you call the neurons that coactivate with a particular neuron its closest friends, then you can bundle them together for the often-activating neurons and loading them as a chunk - potentially reducing transfer.
Unfortunately, that did not work as expected, because
[T]his resulted in loading highly active neurons multiple times [due to the fact that a small set of neurons is active most of the times, and hence is loaded again and again because they also have closest friends that are very active] and the bundling worked against our original intention. It means, the neurons that are very active are ‘closest friend‘ of almost everyone (Alizadeh et al., 2023).
For this reason, this strategy was omitted in further experiments.
Transferring less data from flash memory to DRAM and improving throughput during transfer are two strategies which focus on moving data from A to B. The third strategy proposed by Alizadeh et al. (2023) is optimizing data management in DRAM itself. In other words, if you've moved data from A to B, gains are possible if management of B is done well. Let's take a look at how this works.
If you are transferring (parts of) the weights of your model from flash memory into DRAM, you're effectively going to reallocate parts of the existing weights in memory in order to structure things well. Also, you often need to allocate more DRAM before you can actually put the data there. This incurs time and hence slows down the inference process. For this reason, Alizadeh et al. (2023) propose to:
This management structure uses a few variables to:
While I would refer to the LLM in a Flash paper for a more thorough discussion of the results, it's interesting to note that:
The practical outcomes of our research are noteworthy. We have demonstrated the ability to run LLMs up to twice the size of available DRAM, achieving an acceleration in inference speed by 4-5x compared to traditional loading methods in CPU, and 20-25x in GPU (Alizadeh et al., 2023).
In other words:
Alizadeh, K., Mirzadeh, I., Belenko, D., Khatamifard, K., Cho, M., Del Mundo, C. C., ... & Farajtabar, M. (2023). LLM in a flash: Efficient Large Language Model Inference with Limited Memory. arXiv preprint arXiv:2312.11514.
Ganegedara, T. (2023, June 15). Hey GPU, what’s up with my matrix? Medium. https://towardsdatascience.com/hey-gpu-whats-up-with-my-matrix-cb7f6d7ae7d6
Dao, T., Fu, D., Ermon, S., Rudra, A., & Ré, C. (2022). Flashattention: Fast and memory-efficient exact attention with io-awareness. Advances in Neural Information Processing Systems, 35, 16344-16359.
Hardware Corner. (2023, December 12). Mistral LLM: All versions & hardware requirements – Hardware corner. Refurbished Computers: Laptops, Desktops, and Buying Guides. https://www.hardware-corner.net/llm-database/Mistral/
Techtarget. (2019, November 7). What is DRAM (Dynamic random access memory)? How does it work? Storage. https://www.techtarget.com/searchstorage/definition/DRAM
Nvidia. (n.d.). CUDA C++ programming guide. NVIDIA Documentation Hub - NVIDIA Docs. https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html
Learn how large language models and other foundation models are working and how you can train open source ones yourself.
Keras is a high-level API for TensorFlow. It is one of the most popular deep learning frameworks.
Read about the fundamentals of machine learning, deep learning and artificial intelligence.
To get in touch with me, please connect with me on LinkedIn. Make sure to write me a message saying hi!
The content on this website is written for educational purposes. In writing the articles, I have attempted to be as correct and precise as possible. Should you find any errors, please let me know by creating an issue or pull request in this GitHub repository.
All text on this website written by me is copyrighted and may not be used without prior permission. Creating citations using content from this website is allowed if a reference is added, including an URL reference to the referenced article.
If you have any questions or remarks, feel free to get in touch.
TensorFlow, the TensorFlow logo and any related marks are trademarks of Google Inc.
PyTorch, the PyTorch logo and any related marks are trademarks of The Linux Foundation.
Montserrat and Source Sans are fonts licensed under the SIL Open Font License version 1.1.
Mathjax is licensed under the Apache License, Version 2.0.