We're in a strange place with optimization. Hardware folks build very complex memory access paths (multiple caches/multiple accessors) to try and speed up the average case of unoptimized code. Then we try to trick that mechanism into performing for our particular computations. Maybe some programmable architecture would help here? Maybe not.