I'm pretty sure this isn't using the Tensor cores on the GPU. If you see here (h...

ml_hardware · on Sept 10, 2021

My guess is they are using tensor cores as they report FP16 throughput, but they seem to be measuring at batch size 1, which is hugely unfair to the GPUs.

For inference workloads you usually batch incoming requests together and run once on GPU (though this increases latency). A latency/throughput tradeoff curve at different batch sizes would tell the whole story.

Also, they are using INT8 on CPU and neglect to measure the same on GPU. All the GPU throughputs would 2x.

tl;dr just use GPUs

Edit: to the comments below, I agree low-latency can be important to some workloads, but that's exactly why I think we need to see a latency-throughput tradeoff curve.

Unfortunately, I'm pretty sure that modern GPUs (A10/A30/A40/A100) basically dominate CPUs even when latency is constrained, and the MLPerf results give a good (fair!) comparison of this:

https://developer.nvidia.com/blog/extending-nvidia-performan...

The GPU throughputs are much, much higher than the CPU ones, and I don't think even NM's software can overcome this gap. Not to mention they degrade the model quality...

The last question is whether CPUs are more cost-effective despite being slower, and the answer is still... no. The instances used in this blog post cost:

- C5 CPU (c5.12xlarge): $2.04/hr - T4 GPU (g4dn.2xlarge): $0.752/hr

NM's best result @ batch-size-1 costs more, at lower throughput, at lower model quality, at ~same latency, than a last-gen GPU operating at half it's capacity. A new A10 GPU using INT8 will widen the perf/$ gap by another ~4x.

Also full disclosure I don't work at NVIDIA or anything like that so I'm not trying to shill :) I just like ML hardware a lot and want to help people make fair comparisons.

markurtz · on Sept 11, 2021

Disclosure: I work for Neural Magic.

Hi ml_hardware, we report results for both throughput and latency in the blog. As you noted, the throughput performance for GPUs does beat out our implementations by a bit, but we did improve the throughput performance on CPUs by over 10x. Our goal is to enable better flexibility and performance for deployments through more commonly available CPU servers.

For throughput costs, this flexibility becomes essential. The user could scale down to even one core if they wanted to, with a much more significant increase in the cost performance. We walk through these comparisons in more depth in our YOLOv3 blog: https://neuralmagic.com/blog/benchmark-yolov3-on-cpus-with-d...

INT8 wasn't run on GPUs because we have issues with operator support on the conversion from PyTorch graphs to TensorRT (PyTorch currently doesn't have support for INT8 on GPU). We are actively working on this, though, so stay tuned as we run those comparisons!

The models we're shipping will see performance gains on the A100s, as well, due to their support for semi-structured sparsity. Note, though, A100s are priced more expensive than the commonly available V100s and T4s, which will need to be considered. We generally keep our current benchmarks limited to what is available in the top cloud services to represent what is deployable for most people on servers. This usability is why we don't consider ML Perf a great source for most users. ML Perf has done a great job in standardizing benchmarking and improving numbers across the industry. Still, the systems submitted are hyper-engineered for ML Perf numbers, and most customers cannot realize these numbers due to the cost involved.

Finally, note that the post-processing for these networks is currently limited to CPUs due to operator support. This limitation will become a bottleneck for most deployments (it already is for GPUs and us for the YOLOv5s numbers). We are actively working on speeding up the post-processing by leveraging the cache hierarchy in the CPUs through the DeepSparse engine, and are seeing promising early results. We'll be releasing those sometime soon in the future to show even better comparisons.

bigcorp-slave · on Sept 10, 2021

It depends on your application. If you are running on a smartphone, or on an AR headset, or on a car, or on a camera, etc, you generally do not have the latency budget to wait for multiple frames and run at high batch size.

deepnotderp · on Sept 10, 2021

V100 GPUs have non tensor core fp16 operations too I think

woadwarrior01 · on Sept 10, 2021

Yes. Non tensor core fp16 ops are the default. Tensor cores are essentially 4x4 fp16 mac units and there's a requirement that matrix dimensions are multiples of 8[1] that needs to be met for them to be used.

[1]: https://docs.nvidia.com/deeplearning/performance/mixed-preci...

ml_hardware · on Sept 10, 2021

That's true.. in fact, seeing V100 FP16 < T4 FP16 makes me believe you're right, the V100 should be much faster if the tensor cores were being used.

37ef_ced3 · on Sept 10, 2021

Batch size 1 improves latency, especially for businesses/services with fewer users. Latency matters.

Also, your CPU cost numbers are way off, using an expensive provider like AWS instead of, say, Vultr (https://www.vultr.com)

And many businesses/services can't saturate the hardware you describe. It's just too much compute power. With CPUs you can scale down to fit your actual needs: all the way down to a single AVX-512 core doing maybe 24 inferences per second (costing a few dollars PER MONTH).

ml_hardware · on Sept 11, 2021

I was providing costs for the exact instance types that NeuralMagic used in their blog post, if we’re allowed to change that then I can also find cheaper GPU providers.

I can agree with you that on super, super small inference deployments, maybe you can lower monthly spend by using CPUs. But i must ask.. who is the target customer that is both spending <$100 / month and also trying to optimize this? I feel like big players will have big workloads that will be most cost-effective on GPUs.

markurtz · on Sept 11, 2021

Disclosure: I work for Neural Magic.

Hi deepnotderp, as noted by others the speeds listed here are combining throughput for GPU from Ultralytics to latency for GPU from Neural Magic. We did also include throughput measurements, though, where YOLOv5s was around 3 ms per image on a V100 at fp16 in our testing. All benchmarks were run on AWS instances for repeatability and availability and is likely where the 2 ms vs 3 ms discrepancy comes from (slower memory transfer on the AWS machine vs the one Ultralytics used). Note, though, a slower overall machine will also affect CPU results as well.

We benchmarked using the available PyTorch APIs mimicking what was done for Ultralytics benchmarking. This code is open sourced for viewing and use here: https://github.com/neuralmagic/deepsparse/blob/main/examples...