Disclosure: I work for Neural Magic. Hi ml_hardware, we report results for both ...

Disclosure: I work for Neural Magic.

Hi ml_hardware, we report results for both throughput and latency in the blog. As you noted, the throughput performance for GPUs does beat out our implementations by a bit, but we did improve the throughput performance on CPUs by over 10x. Our goal is to enable better flexibility and performance for deployments through more commonly available CPU servers.

For throughput costs, this flexibility becomes essential. The user could scale down to even one core if they wanted to, with a much more significant increase in the cost performance. We walk through these comparisons in more depth in our YOLOv3 blog: https://neuralmagic.com/blog/benchmark-yolov3-on-cpus-with-d...

INT8 wasn't run on GPUs because we have issues with operator support on the conversion from PyTorch graphs to TensorRT (PyTorch currently doesn't have support for INT8 on GPU). We are actively working on this, though, so stay tuned as we run those comparisons!

The models we're shipping will see performance gains on the A100s, as well, due to their support for semi-structured sparsity. Note, though, A100s are priced more expensive than the commonly available V100s and T4s, which will need to be considered. We generally keep our current benchmarks limited to what is available in the top cloud services to represent what is deployable for most people on servers. This usability is why we don't consider ML Perf a great source for most users. ML Perf has done a great job in standardizing benchmarking and improving numbers across the industry. Still, the systems submitted are hyper-engineered for ML Perf numbers, and most customers cannot realize these numbers due to the cost involved.

Finally, note that the post-processing for these networks is currently limited to CPUs due to operator support. This limitation will become a bottleneck for most deployments (it already is for GPUs and us for the YOLOv5s numbers). We are actively working on speeding up the post-processing by leveraging the cache hierarchy in the CPUs through the DeepSparse engine, and are seeing promising early results. We'll be releasing those sometime soon in the future to show even better comparisons.