Training in half-precision

Some time ago, I read about this new GPU from NVIDIA that’s supposedly a notch higher. It wasn’t just an update to an old line, it was something new. Well, that’s judging from the amount of press it had and of course, a new model name: RTX.

It’s also been years since I lost most of my interest in gaming, and so GPUs were only things to make my models train faster and better… things that were mostly too expensive to think too much about owning. When I got into graduate school, I bought a used GTX 1070 just so I wouldn’t have to join the resource competition with my classmates. It was good enough for our mini-projects. Back then we weren’t aiming for research-grade outputs.

But RTX has this feature that can theoretically double its capacity for training deep models: its capability to compute at half-precision. Double capacity??? Obviously, I had to take a look.

Half-precision what?

Basic Computer Science says that computers store information in bits. For example, the number 2 is stored as 0010, while the number 4 is stored as 0110. Actually, computers can represent all the first 16 numbers (in programming, that’s 0 up to 15) using four bits. In other words, if all we’ll ever need in our life is those 16 numbers, then 4 bits is enough. But of course that isn’t true!

The standard for computing libraries is 32 bits (single precision). That means we can use it to represent all the numbers from 0 all the way to 4,294,967,295, or if it’s signed, then it represents -2,147,483,647 up to 2,147,483,647*****. That’s a massive jump from just 16 with 4 bits! But that also means it’s costly. In practice, we crunch *millions* of numbers. Deep neural networks do not only *crunch* those, but they also have to store the subsequent numbers in memory since they’re all involved in the training process. My current network has 8 convolutional layers with residual blocks, and two fully-connected layers at the end. It has around 2 million parameters, that’s 64 million bits or 8 million bytes. That’s just the network parameters, we haven’t even talked about the inputs and the successive outputs at each convolutional layer. Then, we also have to store the gradients in memory. There’s a lot of calculations here, but to cut things short, I always max out my 8GB VRAM GPU even with that moderately-sized network and small inputs (2x500). *(to be fair, I do use a large batch size: 512, and my method doubles that into 1024…)*

*: this is a naïve calculation and I’m aware that there are more technicalities such as IEEE standards that govern floating point values, but that’s the general idea. More bits = higher range of values represented = higher precision.

Anyway, the point is, training deep networks is fun until you reach a ceiling and probably the way to improve the performance is by making the network deeper (more layers, which means more gradients and outputs to store), or wider (more parameters in each layer, which kinda has the same effect). Deep learning is theoretical and conceptual at first, but as you go along it becomes more an experimentation, so it can get frustrating if you’re limited by computing power.

Now if only GPUs (which dramatically improves training time as they parallelize matrix multiplications) can handle 16-bit numbers (half precision), then it’s intuitive to understand that the memory requirement will also go down by half. We can’t expect them to work exactly as that though, since there’s still a lot of other technical hardware/software stuff that happens in there (that I do not know…). But even a 25% reduction in consumption is a large boost and I’ll take it in a heartbeat!

Note: Half-precision doesn’t sound nice at first, but they say the math used in deep learning doesn’t require high precisions and will do just well with 16-bit or even 8-bit floats.

NVIDIA’s RTX Line

The RTX GPUs (2060, 2070, 2080…) are from NVIDIA’s new Turing architecture which supports calculations at half-precision (FP16 or Floating Point 16). With their previous GPUs, calculations other than single precision weren’t optimized. In other words, you couldn’t do mixed precision training, or the mechanism of GPUs to do some calculations at FP16, and some other on FP32. It’s a way to optimize GPU efficiency.

You might think this can cause instability, but in 2016 NVIDIA released a blog post [1] showing that they finally allow mixed precision calculations and gave an example of using such a method in training AlexNet, a CNN (they performed convolutions at INT8!) See the figure on the left. What an improvement!

Tensor Cores

These cores found in the new Turing and Volta GPUs perform matrix multiplications (which are the costly ones!) in FP16 and then accumulate those results in FP32. Copying directly from NVIDIA’s documentation: “Each Tensor Core performs D = A x B + C, where A, B, C and D are matrices. A and B are half precision 4x4 matrices, whereas D and C can be either half or single precision 4x4 matrices. In other words, Tensor Core math can accumulate half precision products into either single or half precision outputs.”

Tensor cores are different from CUDA cores that calculate only at FP32. The previous GPUs only had CUDA cores, while the new ones contain both. Here is a chart showing the FP16 performance of modern GPUs:

Up there towering above the rest is the Titan RTX, while the previous king, the 1080 Ti is wayy down the chart. Wow, what a difference tensor cores can make. Before today, I didn’t know that there was this very large gap between 1080 Ti and the RTX’s. And they’re just a year apart???

Deep learning framework support for training at half-precision

Here comes the important part for practitioners like me. Can we actually utilize those tensor cores without going through hurdles and machine crashes and headaches trying to decode CUDA errors?

Apparently, there’s this thing called Automatic Mixed Precision (AMP) that is already baked into the popular deep learning frameworks such as TensorFlow and PyTorch.

Also copying from NVIDIA’s documentation, mixed-precision training involves the following:

Using mixed precision training requires three steps:

Converting the model to use the float16 data type where possible.
Keeping float32 master weights to accumulate per-iteration weight updates.
Using loss scaling to preserve small gradient values.

Frameworks that support fully automated mixed precision training also support:

Automatic loss scaling and master weights integrated into optimizer classes
Automatic casting between float16 and float32 to maximize speed while ensuring no loss in task-specific accuracy

This requires CUDA version 10.1 and above, also requiring NVIDIA driver 418.+

A test-drive with RTX 2080 Ti and PyTorch 1.5.1 (Nightly)

I’m lucky enough to be able to use our lab’s RTX 2080 Ti for a while, and from just my original code, my per epoch runtime has already decreased by 12x!! (now just ~30s from ~360s using my local computer with a GTX 1070). Now I want to see how much this can get lower with using FP16.Based on initial skimming of various pages, it seems like it’s already really easy to do. In PyTorch, it apparently only requires 3 lines to activate AMP.

It’s only available in the nightly build of PyTorch, so I had to update my packages first. Then, as they promised, it was a breeze to turn on mixed precision ops. Really, these are all the lines I needed to add to my code:

from torch.cuda import amp

self.grad_scaler = amp.GradScaler()
if self.use_fp16:
		with amp.autocast():
		    out, loss = _forward(batch)
		self.grad_scaler.scale(loss).backward()
		self.grad_scaler.step(self.model.optim)
		self.grad_scaler.update()

autocast takes care of all the conversion during the forward pass and loss calculation, and the gradient scaler GradScaler fixes possible underflow problems due to the loss of precision. There are some other simple lines to use if I do a different training process, e.g. multiple loss functions, but those also seem to be simple and well-documented.

Now, moment of truth!

I ran my network with two sets of parameters: a small one with ~2.5m parameters and batch size=512, and another at 8m~ with batch size=256.

Running my network with amp turned on used up only a bit more than 6000MB of GPU memory. For the small network, I got a memory reduction of 27% and a 33% on the large. Not the 50% reduction that I was hoping for, but hey, that was being hopeful. A 30% decrease is welcomed! What I like better, though, is the reduced time to go through an epoch: 40%!! That’s closer to half and IT’S GREAT. I usually stick with the small network because I don’t get much of a boost with the larger one. As I said, on my personal computer, an epoch takes me almost six minutes. Now… not even a quarter of one. Good lord I now sometimes feel like I’m not even doing deep learning 🤣and maybe I don’t deserve to call my thesis a thesis anymore. it’s crazyyy!

How’s the performance, though? Thankfully, The loss values are pretty much the same. Here are some logs from the small network:

First 10 epochs without amp on the small network

First 10 epochs without amp on the small network

First 10 epochs with amp on the small network

First 10 epochs with amp on the small network

And here are logs from the large one. The difference between the two runs are at the third decimal point. Really, should I be bothered?

I also didn’t observe a huge performance jump on the classification tasks (I do unsupervised learning to learn representations which I then use as inputs to different tasks).

Conclusion

Technology is amazing. There’s all this talk about Moore’s Law and how it’s maybe slowing down? I am really not into those engineering things, but when you experience the change as stark as this, it’s mind-boggling. And much appreciation, too, for the folks who coded up AMP for all of us who can’t be bothered to learn about CUDA programming.

Now if only we can have more of these GPUs in our lab, we’d probably produce not only more research papers but more graduates! 😭

References

[1] https://developer.nvidia.com/blog/mixed-precision-programming-cuda-8/