It should be kept in mind that standard memory (e.g. DDR3) doesn't actually supp...

zurn · on Dec 7, 2014

While you are kind of correct in that a memory channel doesn't do multiple concurrent transfers, single-threaded memory-bound workloads definitely leave significant main memory performance on the table in most cases. (And if not memory-bound, then of course a lot more still)

Main memory is high-latency, and single threads usually don't saturate the bandwidth because they are not issuing back to back pipelined requests to memory. The CPU has a single shared memory controller and it sees requests from all cores/SMTs. A desktop x86 might support 16 outstanding cache misses that might come from different cores or SMT "threads", and is designed to keep the memory channels busy and request pipelines full by scheduling in work from all the cores/SMTs, doing prefetching according to the big picture etc.

Even multiple threads accessing the same data doesn't actually look different from the DIMM point of view, all the x86 synchronization takes place inside the CPU caches and is invisible in the main memory traffic/requests. (This is true even on multi-socket systems, the caches just talk to each other using an inter-CPU bus).

Also, desktop/server systems typically have more than one independent memory channel, in beefy servers a whole lot of them

adwn · on Dec 7, 2014

> While you are kind of correct in that a memory channel doesn't do multiple concurrent transfers, single-threaded memory-bound workloads definitely leave significant main memory performance on the table in most cases.

You have to distinguish between memory-latency-bound and memory-throughput-bound. If your program has a simple, regular access pattern (which the CPU's prefetcher recognizes), then it's not that hard to saturate the memory bandwidth of a typical two-channel, DDR3-1600 system (25.6 GB/s peak throughput) with a single thread.

vardump · on Dec 7, 2014

In order to get maximum throughput, you also might need to do SIMD loads (and stores).

On Haswell this means two AVX loads, 64 bytes per clock cycle (two 32 byte loads). I think with with scalar code, you get only 16 bytes per cycle (two 8 byte loads), and 32 bytes per cycle with SSE.

So a single Haswell core at 3 GHz can load about 178 GB/s!

On Ivy Bridge, SSE is enough to saturate memory bus.

Regardless, that's quite a bit more than what lower cache levels and DRAM can supply. It's not very hard to saturate whole CPU-local memory bus with just one core.

nkurz · on Dec 7, 2014

So a single Haswell core at 3 GHz can load about 178 GB/s! On Ivy Bridge, SSE is enough to saturate memory bus.

I'd claim instead that it's surprisingly hard to saturate the memory bus with a single core[1]. Presuming your math is right, you've given the rate at which Haswell can load from L1 to registers if we can ignore latency and concurrency. That is, the CPU can sustain this number only if the latency is low enough relative to the number of "requests in flight". In practice, these prevent a single core from saturating the memory bus except for the case of dual-channel memory, perfect prediction and prefetching, and no TLB misses.

More commonly, the latency of the read from memory combined with the limited number (10) of "line fill buffers" reduces the throughput from RAM. The formula for actual bandwidth is known as "Little's Law": Concurrency = Latency * Bandwidth. Considering it's fundamental importance in gauging actual performance, it's strikingly rare to see it discussed in CS theory. John McCalpin, the author of the Stream benchmark has an excellent explanation of how it applies to memory bandwidth: http://sites.utexas.edu/jdm4372/2010/11/03/optimizing-amd-op...

[1] Here's my write up of trying to do so on Sandy Bridge: https://software.intel.com/en-us/forums/topic/480004

vardump · on Dec 8, 2014

Your write up was pretty interesting. Especially the instruction mixes, LFENCEs of all things helped?! Thanks.

I haven't tried optimize fill / copy rate since late nineties. Getting the last 10-20% appears to be rather tricky. Luckily maximum bandwidth per core is seldom an issue.

Sometimes I've thought about designs of having one core do memory prefetching ahead of loosely synchronized other core doing actual computation, to get maximal serial computational throughput. The idea is, that the computational core would fetch data from shared L2/L3 at all times. For those times, when the access patterns are not predictable for hardware prefetcher etc.

nkurz · on Dec 8, 2014

LFENCEs of all things helped

I don't really understand what was happening here, but it was a definite positive effect. I think the issue is that the early preloads filled the queue, preventing the stores from happening. The bright side is that getting good performance on Haswell is easier, as it supports 2 loads and 1 store per cycle (although note that address generation is still a bottleneck unless you make sure that the store is using a "simple" address).

I've thought about designs of having one core do memory prefetching ahead of loosely synchronized other core doing actual computation

I'm interested in this approach too, but haven't really tried it out. "srean" suggested this also: https://news.ycombinator.com/item?id=8711626

I think it could would work well if you know your access patterns in advance, the hard part would be keeping the cores in sync. In the absence of real-time synchronization, you'd need a really high throughput queue to make this work. Interesting write up and links here: http://www.infoq.com/articles/High-Performance-Java-Inter-Th... It's a much easier problem if you can batch up the preloads rather than doing them one at a time.

I have had success with a somewhat similar technique within a single core by issuing a batch of prefetches, then do processing on the previous data, then a batch of loads corresponding to the prefetches, then repeat. You can get 2-3x improvement of throughput on random access if you are able to tolerate the slightly increased latency. The key is figuring out how to keep the maximum number of requests in flight at all times.

nkurz · on Dec 7, 2014

What do you mean when you say that standard memory doesn't support concurrent accesses? Assume you have 2 threads running on 2 (physical) cores on a single-socket machine. In what case would the data access be serialized? How much do you think this serialization would affect the latency of a read from memory?

For current Intel, all cores in a socket share L3 cache, but have independent L1 and L2. If the two are using the same address read-only, both will have the data in their own L1. If they are read-write, performance will drop, but there still won't be any accesses to RAM, and the data will stay in L3.

If you are talking about random access, there can be concurrent access on either 2, 3, or 4 "channels". This is physically partitioned, so sometimes data can truly be accessed in parallel and sometimes it can't. But the maximum wait time for a request to be held up is small relative to the total latency of a memory access.

The "two cores at most" part of the saturation statement is trivially untrue, although correct for many systems. On the Sandy Bridge system I just checked (quad-channel DDR3 1600), I did get a small increase on the Stream benchmark going from 2 to 3 physical cores: 15 GB/s Copy for one core, 23 GB/s for two, 25 GB/s for three.

But this overstates the problem somewhat, as perfectly predictable access with prefetch isn't particularly common. For random access, it is difficult to saturate RAM bandwidth unless you pay very close attention to the parallelization of your requests. For dependent access (each request depends on the previous) I'm doubtful that you'd be able to reach saturation on most processors.

I agree with your sentiment though --- it's tough to get the performance increase you'd hope for unless the threads, and paying close attention to data access patterns is key. But I'd argue (and I'm sure you'd agree) that managing data access is crucial to single threaded performance as well.

e12e · on Dec 7, 2014

Sure, but assuming an Intel Haswell i7, L1 cache is 64K (two threads with hyper-threading) -- and the total of L1+L2+L3 is anywhere from 2 to 20 megabytes.

So if you "do little" with "much data", then that (might) matter, whereas if you "do much" with "little data" it might not.

srean · on Dec 7, 2014

I had restrained myself from responding with a 'this' to @userbinator's comment. What you say is true but it is not trivial. This not to say that you implied it to be trivial.

The key is that with some thought and change in approach one can sometimes transform between the two regimes

    "do little" with "much data"  -->    "do much" with "little data" __at_a_time__

Often it is not at all obvious how to do this. It is fairly common in discussions on parallelism for naysayers to appear and claim most things / algorithms aren't parallel and that's the end of it.

Creativity lies in rethinking algorithms so that the transformation above applies. Sometimes this is reasonably easy, for example for matrix multiplication. It is quite detrimental to write it out as a loop over large rows x large columns. The speedup will be abominable. The solution is also fairly obvious here and that is why it is everyone's favorite example, in general however solutions are rarely that obvious. Here we only reordered the data access so that it touches the same things often, there was no fundamental change in the algorithm. These are the easy cases.

Once I had to fight with my co-worker about how to structure code. I was arguing against building the operation as a huge matrix times vector and offload it to architecture specific BLAS. I lost that argument, it was not clear at that time which approach would be better. When the implementation was done the speedup was missing in action. Large Mat x Vecs are just bad primitives to use on modern CPUs (too little computation per data touched): one lesson learned. After all BLAS cannot do magic, its still going to run on the same CPU (unless your BLAS offloads it to some other sub-architecture like GPUs, even then bandwidth is a major concern).

Profiling showed that this was choking on memory bandwidth. The bus just cannot keep up with 1 core hammering it with read requests, let alone 4. At this point one could have washed one's hand off and said that the algorithm is memory bandwidth limited and that's the end of the story. But those limitations are true only if you have to stick with that particular algorithm or the specific order of the data access. Usually those restrictions are artificial.

I had far better luck by breaking it up into pieces so that the next time around most of the data was pulled from cache. This required changing the algorithm a little bit, and also proving that the new algorithm converges to the same thing.

It still pays significantly due to synchronization on de-allocation. My next target is to switch to stack based allocation/deallocation because the allocation pattern for the most part follows a stack discipline for this application. For multi-threaded allocations that is the best kind of allocation pattern you can have. Wish alloca was included in POSIX and that it returned usable error values, otherwise its pretty much a case of call and pray.

For compute heavy stuff hyperthreads are rarely useful. The bottleneck is typically floating point operations and FPU becomes the scarce resource. Hyperthreading actually ends up hurting because all the threads want to do the same thing: some floating point operation (ideal for GPU, not for hyperthreading). Sometimes it is possible to make one hyperthread take care of streaming in the data and uncompressing it in memory while the other takes care of floating point operations. I have not tried this out myself but ought to work out (in theory :)

colin_mccabe · on Dec 7, 2014

Thanks for this comment. That's really interesting. It seems like caching is becoming more and more important at every level of the system. Even GPUs are getting DRAM caches now. So "do much with little data at a time" is the way to go.

P.S. I work on Hadoop. "Do little with much data" is a pretty good high level description of what we do. Our single-node efficiency is not what I'd like it to be, but it's improving with next generation SQL and other execution engines.

seanmcdirmid · on Dec 7, 2014

Threads can share data in caches (depending on the core and cache level), which is especially true for parallel computing use cases. Hyper threading can run another thread on a core while a thread is stalled on a read, which makes reasoning about performance even more tricky.