I like the author started with measuring and thinking bandwidth, which makes sen...

I like the author started with measuring and thinking bandwidth, which makes sense for streaming through a big file, so I'd have continued that way towards a diff design & conclusion

Continuing with standard python (pydata) and ok hw:

- 1 cheap ssd: 1-2 GB/s

- 8 core (3 GHz) x 8 SIMD: 1-3 TFLOPS?

- 1 pci card: 10+ GB/s

- 1 cheapo GPU: 1-3 TFLOPS?

($$$: cross-fancy-multi-GPU bandwidth: 1 TB/s)

For streaming like word count, the Floating point operation (proxy for actual ops) to Read ratio is unclear, and the above supports 1000:1 . Where the author is reaching the roofline on either is a fun detour, so I'll switch to what I'd expect of pydata python.

It's fun to do something like run regexes on logs use cudf one liners (GPU port of pandas) and figure out the bottleneck. 1 GB/s sounds low, I'd expect the compute to be more like 20GB+/s for in-memory, so they'd need to chain 10+ SSD achieve that, and good chance the PCI card would still be fine. At 2-5x more compute, the PCI card would probably become a new bottleneck.