Updated on Nov 9, 2020; Posted on Jul 5, 2019
Updated on November 10th 2020
With the newly announced RTX 3000 the GPU bang for your buck blogpost needs an update. The newly introduced GPUs are competitively priced and have great specs.
Now looking at the updated chart, the consumer cards are still on top. The original Pascal cards however are beaten by the new RTX 3080. The other RTX 3000 series cards are still on the top level of chart. Sadly, there is a drawback as of right now. The RTX 3000 series suffers from some delivery problems. The prices are thus not stable yet, and getting one of these new GPUs can be a real challenge.
Though graphical processors were originally meant for gamers, any computer science nerd knows that they are extremely valuable in other domains as well. Now that crypto-miners are starting to move to ASICs, prices and stock for GPU’s are stabilising again. The deep learning community can take a breath, relax, buy a new batch GPUs and run some training sessions.
NVIDIA knows that gamers are not the sole target demographic for their products anymore. In September of 2018, they released the NVIDIA Tesla T4: a server-grade inference card for deep learning. The Tesla V100, meant for training, is part of their deep learning line-up as well. These cards are fitted with so-called Tensor Cores for neural network performance. The same Tensor Cores are also present in the latest generations of consumer cards like the RTX 2060, 2070, 2080 and 2080 Ti and the “SUPER” cards. If you Google around, you will find that the 2080 Ti is most often recommend for machine learning (at this point). In this post, we’ll be looking at the full range of machine learning hardware to figure out what is what and find the best “bang 4 buck” card there is.
Now, when looking for a “bang for your buck” graphics card, you need to take note of a couple of pitfalls. First of all, there is a profound difference between the pricing of consumer cards and server-grade cards that NVIDIA sells. For example, take the Tesla V100: it is a server-grade card based on the Volta GV100 architecture. Similarly, there is a consumer-grade card, the Titan V, based on the same architecture with nearly identical specifications. Both boast 5120 CUDA cores, a TDP of 250 Watts, and around 15 TFLOPS of single-precision floating point performance. The V100 does have more memory: it has 16GB of HBM2 memory running at a slightly higher clock speed compared to the 12GB memory capacity of the Titan V. The main difference lies in the price: the Titan V sets you back around $3000, the Tesla V100 around $10.000! What could possibly justify this enormous price difference? NVIDIA would argue that the Tesla V100 has all the perks of a server-grade card: a 3-year warranty, rated for use in server racks for long periods of time and… well, that’s basically it. Except for one thing, the EULA that goes with the required drivers for these cards explicitly forbids the use of consumer cards in data centers. This is why AWS, Azure and Google Cloud do not offer Titan Vs. Basically, you pay around 7000$ more for the right to use the card in a datacenter. Whether or not a Titan V would catch on fire if used in a server configuration or not, we doubt it. Either way, if you are a researcher or deep learning hobbyist, stay away from the server-grade cards. Unless you are planning on deploying a datacenter with graphics cards, there is absolutely no need for them.
Second, take the TFLOPS-rating NVIDIA boasts on their website with a grain of salt. We will be looking at the Tesla V100 and T4, as these are the cards that NVIDIA mainly markets for deep learning. Basically, there are two numbers that NVIDIA keeps using, and both have their fair share of asterisks:
Deep learning performance: For the Tesla V100, this is apparently 125 TFLOPS, compared to the 15 TFLOPS single-precision performance. That is an insane number, how do they get to this? Well, it’s based on NVIDIA’s “mixed precision performance”. Basically, using some mathematical trickery, NVIDIA managed to combine both the advantages of FP32 as well as FP16 training: fast results and accurate convergence. The 640 Tensor Cores introduced in the Tesla V100 are specifically built to accelerate half-precision training which allows them to reach these insane performance results. The pragmatic deep learning experts will now think: “Great, how do I turn it on?”. Well, if you are using TensorFlow, you need the “NVIDIA NGC TensorFlow 19.03 container” and run it inside a docker instance. Then, enable an environment variable and there you go! To quote NVIDIA: “We are also working closely with the TensorFlow team at Google to merge this feature directly into the TensorFlow framework core.” (Other libraries like PyTroch and MXNet also have support.) So you bring in your own model and start training, but you only see a 10% performance increase… why? Well, you basically need to contact NVIDIA support to figure out that the “mixed-precision training” feature only benefits simple 2DConv operations. Anything beyond the basic building blocks of popular image recognition models like ResNet-50 (coincidentally, this model is used in almost all of NVIDIA’s charts) does not yield the promised 3x performance increase. Though technically, if you manage to get mixed-precision training to work, which is far from a given (random StackOverflow), and have a simple model that relies heavily on 2D convolutional layers, you might see a large speed-up. In any other case, you will not. Despite this, NVIDIA markets the performance of the V100 using that 125 TFLOPS number - they call it the “DEEP LEARNING” performance metric:
On the inference-side of things, something similar is happening. We’re looking at the NVIDIA Tesla T4. NVIDIA boasts with extreme numbers, but they are all based on FP16 performance or quantised performance. The first one might actually work for you, but again, only if your model consists of operations that meet this list of requirements. When using different methods for quantization, you are unlikely to reach the advertised performance numbers. The most ridiculous number listed on their website is the INT4 performance: there is simply no support for INT4 inference using NVIDIA’s TensorRT library at all. It does not exist (presumably you would need to write native CUDA code for this?). This performance metric is completely theoretical. INT8 is supported in the same way FP16 is: somewhat. Always look for the FP32 (single-precision) performance. This is the default for TensorFlow and, at this point in time, supports all of TensorFlows built-in ops.
Either way, even if your model does benefit from the Tensor Cores, the documentation around these features is very limited. It requires an endless stream of minimum software requirements (regarding CUDA, TensorFlow, TensorRT, etc.) and you will need to sign up for the NVIDIA Developers program. Support in TensorFlow is expected soon, but will have the same limitations. We would advise you to take a really good look at the small grey text below NVIDIA’s comparison graphs. These will usually tell you that 1) they used something other than the default FP32 training and 2) put the card into the most beefy computer system imaginable. For example, on their extensive performance comparison page, all reported performance metrics are based on something other than FP32-based training even though the large majority of deep learning researchers and professionals use single-precision training and inference to this day.
Unfortunately, the alternatives right now are scarce. Google is starting to release some of their TPU-based products, but they are not as great as Google would like you to think. Though the theoretical bang for buck could be up to 5 times better, in practice, support is not great and you are easily tied up to in Google infrastructure. TPUs are not ready for primetime yet, but they are promising. Other FPGA chips are starting to hit the market (Intel is doing some work here, other lesser known companies as well, Google it!). All of them have questionable performance metrics. Again, there are a lot of INT8 and FP16 performance metrics. We even saw some companies comparing performance between INT8 models on their hardware and FP32 models on Tesla GPUs - don’t fall for it! Lastly, most of these chips are not supported by TensorFlow out of the box. Intel has their own framework (like TensorRT) called OpenVINO. It requires your model to be converted and does not have support for all the native TensorFlow operations, of course. Same goes for AMD’s ROCm + TensorFlow toolchain. I have yet to find a platform that is fully compatible with everything TensorFlow already has to offer. There is simply too many caveats at this point. You will likely end up working for days until you finally figure out that your specific model is either not compatible with hardware or - even worse - is compatible but does not run as fast as it would on a cheaper NVIDIA GPU. On the other hand, if you only work with out-of-the-box, well known models like VGG-16, Inception or something of the likes, you might be in luck!
Now that we know not to buy server-grade cards (unless you’re Google) and have an understanding of how to navigate the NVIDIA product line, we can make some purchase recommendations. First of all, the most expensive cards in NVIDIAs line-up are almost never the best in terms of “bang for your buck”. The Titan cards, and even the newest 2080 Ti (or the just released 2080 Ti “Super”) is most likely overpriced. Compared to second-hand flagship cards, they cost twice the amount versus about a 20% performance increase. You are better off getting a second-hand top-of-the-line gaming card from the previous generation, unless you have special memory requirements. We are not just making this up! We compiled a somewhat complete table of graphics card and have taken into account multiple metrics of performance:
The table shows that the GTX 1080 provides for the best bang 4 buck compared to other cards. We would still recommend the GTX 1080 Ti, because of the larger memory. Especially for models with 3D convolutions - or models that are just large - the extra memory capacity can yield greater than shown performance gains if you use a high batch size. The current generation cards are a good choice if you prefer brand new cards or cannot buy second-hand (1080s and 1080 Ti’s are not sold first-hand anymore). Interestingly, the new “Super” cards are - especially the 2060 and 2070 SUPER - great value as well. The RTX 2060 and RTX 1660 offer great value for smaller models. An RTX 2080 Ti used to be the go-to first-hand card for larger models, even though it is rather expensive for the performance it has to offer. As of July 2019, We recommend the Super cards, as they have been released at the exact same price, but with much better performance. We haven’t been able to reliably determine the second-hand price for the non-Super cards yet. When they are in, it might become a good choice as well. As expected, server-grade cards have extremely low bang 4 buck. We would recommend against getting a Tesla V100 in any situation unless money is of no concern and you require the best performance possible.
Please note we updated our recommendations after this blog post originally came out in 2019. Scroll to the top of this article to read our updated recommendation.
Keep the following in mind when shopping for a video card: