I/O is still often the bottleneck. My laptop can handle 11 GB/s through RAM (and no NVME, so under 1 GB/s through the hard drive), less with unpredictable I/O patterns (like a hash-map) and 7600 GB/s through the CPU. Unless the thing you're doing is particularly expensive per byte of data, you're going to be limited at a minimum by RAM I/O, and maybe by DISK I/O.
FWIW, all my recent performance wins have either been by reducing RAM I/O or restructuring work to reduce contention in the memory controller, even at the cost of adding significantly more work to the CPU.
FWIW, all my recent performance wins have either been by reducing RAM I/O or restructuring work to reduce contention in the memory controller, even at the cost of adding significantly more work to the CPU.