This is all very interesting, and the author has clearly done a lot of research....

nkurz · on Sept 29, 2010

I'm only semi-familiar with the issues, but on the surface swapping across nodes rather than to disk seems like it has to be a win. I think the problem might be that on a long running system there is rarely any truly "free" memory. Rather, one chooses to dump cached pages. It's possible that the cost of copying across nodes plus the eventual reread of the cache makes it a negative? Although I'd have to think that it's better than a certain write.

I can see the logic where allocating on a non-local node is potentially a mistake. Depending on how many times the memory will be accessed, it may well be worth the immediate hit to swap a page to disk and keep all your accesses local. For the swap, you at least have evidence that it hasn't been used that recently, thus may never be used again. It would be sad to work yourself into a corner where lots of long lived processes are constantly cross-allocating.

Edit to add:

Looks like a good reference paper here: http//www.kernel.org/pub/linux/kernel/people/christoph/.../numamemory.pdf Only skimmed, but it makes it sound like 'page migration' is already in place.

I'm particularly interested in the idea of migration partly because it might help provide an answer to my recent StackOverflow question: http://stackoverflow.com/questions/3784434/inserting-pages-i...

nkurz · on Sept 29, 2010

That was odd. I must have submitted at the same time that the edit window ended, and it took me to a broken page. Anyway, the proper link is: http://www.kernel.org/pub/linux/kernel/people/christoph/pmig...

jeremycole · on Sept 29, 2010

Hi,

I didn't feel that benchmarks were necessary in this case, since the result is clearly visible: either it does or does not swap under a given workload. We did run benchmarks, but only to prove that the performance was nominal with and without the setting in place, swap behavior aside, to ensure that this doesn't introduce some regression.

Regards,

Jeremy

illumin8 · on Sept 29, 2010

The problem is that there is a latency hit required for a thread running on node 0 to access memory on node 1. Furthermore, this uses Hypertransport on AMD or QPI on Intel, which has limited bandwidth so if you get too many off-node memory accesses, performance begins to suffer.

The real solution to this issue is for MySQL to become NUMA aware and place threads and the cached data blocks those threads are accessing more intelligently on nodes that have enough space. Other more robust databases like Oracle already do this, having been running on NUMA architectures for decades now.

jemfinch · on Sept 29, 2010

> The problem is that there is a latency hit required for a thread running on node 0 to access memory on node 1. Furthermore, this uses Hypertransport on AMD or QPI on Intel, which has limited bandwidth so if you get too many off-node memory accesses, performance begins to suffer.

Does it suffer more than hitting the disk?

illumin8 · on Sept 29, 2010

No need to be snarky. You take a one time large hit in performance to page out some data as opposed to many small continual hits in performance going across the interconnect between nodes.

Which would you rather suffer? A 50 ms one time hit to page some data out, or many thousands of 500 microsecond hits and interconnect saturation over time? The kernel engineers looked at these trade offs and determined it was better to page out data. After all, the kernel does not know how long you'll need your data, and if it allowed memory to be allocated haphazardly all over a NUMA system, after many hours or days you could end up with a very slow running system where every other thread had to access memory in a different node.

I find it rather puzzling that DBAs think they know more about how a kernel should page memory than a kernel developer like Linus Torvalds.

The answer seems clear - if your software relies on huge amounts of memory, make it NUMA aware. Oracle did this a long time ago and I don't see any strange swap activity on our 8-socket 48-core 128GB NUMA systems (AMD Opteron).

jemfinch · on Sept 29, 2010

> You take a one time large hit in performance to page out some data as opposed to many small continual hits in performance going across the interconnect between nodes.

You misunderstand the OP. He's not saying "Just put it in Node 1 and access it from there." He's saying to swap it out to Node 1, and then when it is needed in Node 0 again, swap it back to Node 0. That's certainly cheaper than swapping to disk and back.

illumin8 · on Sept 29, 2010

I see, so he's essentially proposing a memory to memory swap functionality as opposed to just memory to disk. It sounds like a workable solution, although it would require some engineering in the kernel paging algorithms. You'd also need to make changes to the scheduler so that you could intelligently schedule threads on the node where their memory is. It sounds doable, but it seems that this is a lot of work for kernel engineers to do that could be done by software that is NUMA aware.