CHOICE

I have a choice to make which I also don’t have the choice to make, sounds confusing right, let me make it easy so maybe I have a very successful life, successful failure to be exact, i love…

Smartphone

独家优惠奖金 100% 高达 1 BTC + 180 免费旋转




The FLOPs Calculus of Language Model Training

Training a large Transformer requires many [flip-]FLOPs

In this article, I will offer you a very useful tool to reason about large Transformer LMs. This tool will help you roughly answer questions like

Turns out quick back-of-the-envelope calculations can be sufficient to answer these questions if you use a simple equation that ties together

Without further ado, meet the Transformer FLOPs Equation:

C ≈ 6ND.

A slightly more sophisticated version of the equation expresses the compute C as the product of cluster’s throughput 𝜏 and training time T:

𝜏T = 6ND.

Let’s apply the Transformer FLOPs Equation to some middle-school style problem solving:

As I explain later in this post, the error is due to us naively plugging the theoretical peak throughput 𝜏 that is not achievable with distributed training and when models do anything but large matrix multiplications. If you correct 𝜏 accordingly (I will discuss this later), the FLOPs equation will get much more accurate. The other correction is that with checkpointing² that is a must for the largest models the required compute C goes up to ≈ 8ND.

To derive the Transformer FLOPs equation we will have to make a key assumption.

The FLOPs that matter the most are weight FLOPs, that is ones performed when intermediate states are multiplied by weight matrices.

The weight FLOPs are the majority of Transformer FLOPs, meaning that we can put aside FLOPs required for bias vector addition, layer normalization, residual connections, non-linearities, softmax and even attention. If you do not believe this, you are not wrong: while other FLOPs are less numerous, they also requires a lot of memory access and will in practice matter quite a bit. I will return to this later.

The beauty of matrix multiplications is that each of them adds a predictable and easy to compute number of FLOPs to the training total:

weight FLOPs for multiplying by a matrix W = 6 times batch size times size of W

This Weight FLOPs Equation can take some time to wrap one’s head around. To understand where it comes from, consider a weight w that connects an input unit i to an output unit j:

For each example in the batch, the weight w generates exactly 6 FLOPs combined in the forward and backward pass:

The Weight FLOPs Equation directly follows from the fact that we need 6 FLOPs per example per weight. And from this equation follows the Transformer FLOPs Equation. To understand this, think about how many weight matrix multiplication Transformer performs for each input token, no matter how many input sequences the batch consists of. The answer is exactly 1 for each weight matrix! So the total number of FLOPs for each token is 6 times the model size N, Q.E.D.

The other FLOPs (softmax, layer norm, activations and etc), should be even more negligible, but there is a catch — the GPU memory bandwidth becomes the bottleneck when these operations are performed. In practice these elementwise operations can take non-negligible time. I find it thus helpful to think about the weight FLOPs (WFLOPs) throughput that a particular implementation can deliver on a particular hardware.

The theory of counting Transformer FLOPs is elegant, but as seen in the HyperCLOVA example, naive application results in significant underestimation of time required to training the language model. For more precise reasoning, we need a better of idea of what actual WFLOPs throughput can be like.

I have done a little case study on a A100 GPU. According to Nvidia documentation, it can deliver up to 312 bfloat16 teraFLOPs — that’s 3.12e14 operations per second! Nvidia docs also show how ~250 teraFLOPs can be actually achieved by doing 4096 x 8192 x 4096 matrix multiplications, and I was able to reproduce that. But what practical throughput can we get when training neural networks?

Here are the throughputs:

The MLP throughput looks encouraging, but for the actual GPT-2 implementation from HuggingFace Transformers the throughput was merely 68 teraWFLOP/s. I have not looked deeper into the exact breakdown, but a likely explanation is that the memory-intensive computations, such as residual connections, activations, layer normalization, attention masking and attention softmax do cost a lot when combined together.

In this article I have shared with you the Transformer FLOPs equation that makes reasoning about extremely large language models easy. The equation ties together the throughput 𝜏, the training time T, the model size N and the number of training tokens T:

𝜏T = 6ND.

Looking at publicly available white papers, the throughput 𝜏 is likely to be anywhere between 50 and 150 teraWFLOP/s per A100 GPU.

My favorite corollary of this equation is that assuming constant throughput, the training time grows linearly with the model size. So if you want to increase the model size by 2, you have to either use 2 times as many GPUs, or wait 2 times longer. Easy mathematics that you can do in your head!

The FLOPs calculus for LSTMs would look very similar to that of Transformer, which is a key factor explaining their demise. The total number of FLOPs grows linearly with the model size. But to train LSTMs on long sequences while fitting in GPU memory one has to reduce the batch size. And with a small batch size the GPU throughput for sequential LSTM computations falls dramatically. For example, for 32x1600x6400 matrix multiplication the throughput is below 20 teraFLOP/s, more than 10 times slower than for 8192x1600x6400! Recurrence comes at a price: the computation for further tokens must wait before computations for previous tokens are done, making computations less parallel and thus less GPU-friendly.

I hope you found this article useful! Many thanks to Harm de Vries, Amine el Hattami, Torsten Scholak, Nicolas Chapados, Sebastien Paquet, and my other fabulous colleagues at ServiceNow Research for discussions that greatly helped me in researching this topic.

[1] Disclaimer: this is meant to be a popular article, not an academic contribution. While I’m trying to give credit where the credit is due, please don’t freak out if you find this not sufficiently rigorous and just get in touch — I will to try to address the issue.

[2] Note that Nvidia reports different teraFLOP/s numbers, namely 138 teraFLOP/s vs 100.8 teraWFLOPs that I calculated. The major source of the difference is that they include the FLOPs needed for the extra forward pass that recomputes the activations. Activation recomputation (checkpointing) allows back-propagation without storing all intermediate states in memory. It is now routinely used to train the largest models. The extra forward pass requires 2ND FLOPs. The rest of the difference comes from them including attention and output layer FLOPs. In this article I view that activation recomputation FLOPs do not directly contribute to learning and thus reduce the system’s effective WFLOPs throughput. But if you are an HPC person and what you want to showcase is optimal device utilization, it makes sense to include these FLOPs in the total.

Add a comment

Related posts:

StartEngine ICO 2.0 Summit is SOLD OUT

It is with great pleasure that I can announce the ICO 2.0 summit has sold out. We expect over 400 attendees and 30+ speakers to help bring ICOs out of the shadows. You can participate as well. Just…

The Greater Good Mirage

This post comes in light of the results of the latest Indian general elections. Narendra Modi, the incubant Prime Minister was voted back to power with an even higher majority. Modi has been in…

23 coisas que foram maior que meu transtorno de personalidade.

Assistir de repente 30 com o Dan.. “23 coisas que foram maior que meu transtorno de personalidade.” is published by shao-may.