> I mean, memory topology varies greatly by uarch... You're absolutely right, th...

> I mean, memory topology varies greatly by uarch...

You're absolutely right, this is why I said that if you're using libraries, this burden is generally handled by them. Also compilers do this and handle this very well.

If you're writing your own routines, the best way is to read the arch docs, maybe some low-level sites like chips and cheese, do some synthetic benchmarks and write your code in a semi informed way.

After writing the code, a suite of cachegrind, callgrind and perf is on order. See if there are any other bottlenecks, and tune your code accordingly. Add hints for your compiler, if possible.

I was able to reach insane saturation levels with Eigen plus, some hand-tuned code. For the next level, I needed to change my matrix ordering, but it was already fast enough (30 minutes to 45 seconds: 40x speedup), so I left it there.

Sometimes there are no replacement for blood, sweat and tears in this thing.

I have never played with custom interconnects (Slingshot, etc.), yet, so I can't tell much.