I'd rather say "don't use Intel x86 assembly unless you are measuring very, very...

pbo · on April 4, 2014

I second that.

I write assembly for DSPs on a near-daily basis. Up until a few months ago there didn'nt even exist a C compiler for the target architecture.

Even when you write C code for an embedded platform that does have a decent C toolchain, you cannot truly understand what you're doing without spending a lot of time looking at the generated assembly, and writing some of it yourself.

However, this kind of architecture has nothing to do with the x86. RISC, no cache, in most cases no MMU (and rarely any DMA), an extremely simple and straightforward pipeline, etc. I've written assembly for several such architectures, but compared to them I find x86 assembly intimidating.

dankoss · on April 4, 2014

I also do DSP development and looking at the x86 architecture it almost seems pointless to try to predict what the processor is doing. Out of order execution? Register renaming? The only way I can be sure assembly is faster than generated C is by profiling, which isn't even exact anyway without isolating the operating system.

For us, latency and determinism are the most important requirements of any processor, and it's very hard to get that on x86 without writing your own OS. But that negates the entire advantage of using the x86 platform with so much available code written on top of other kernels.

crazy1van · on April 4, 2014

This brings back memories of when I wrote a lot of DSP code. The architecture I was on had a decent C compiler, but it wouldn't correctly take advantage of some of the special instructions. By rewriting a viterbi decoder in assembly to take advantage of their special subtract-compare-branch instructions, I recall getting a huge performance increase. But the compiler did a decent job at optimizing for multiply-accumulate instructions especially if you gave it the right hints with pragmas.

ballard · on April 5, 2014

I wrote x86 assembly demos for fun in the golden age of asm between the eras of 8088 and p6... It became so much about putting random instructions here and there to get better performance, one almost needs like a QuickCheck-like sythensizing+benchmarking hybrid compiler that automagically mutates code to find what is actually fastest by genetic algo-like heuristics.

Peaker · on April 4, 2014

In my experience, performance is dominated almost completely by memory throughput and/or memory latency (depending on whether the prefetcher succeeds).

All the instruction latencies, VLIW optimizations, and so forth have a minor effect.

However, the cases where these things do have an effect, are exactly the cases where it might make sense to go down to ASM level.

The first avenue to improve performance, though, is improving memory access patterns. Compacting, aligning and rearranging data in cache windows. Even creating and maintaining redundant copies of data which is compacted or organized in a way that aids the cache can help.

Another big improvement I've witnessed is helping the prefetcher. For example, I've had code that had to iterate a few linked lists to work on all elements. When I reorganized that code to interleave the iteration of 8 linked lists at a time (rather than iterating them all serially), I went up from being memory-latency-bound, to being closer to memory-throughput-bound. The prefetcher could predict the next 8 ptrs I'll dereference, and prefetch 8 cache lines, instead of 1 at a time.

sillysaurus3 · on April 4, 2014

I recently wrote some ARM (Thumb-2, for the Cortex-M4) code. It wasn't very difficult, and the code I wrote was much better than what gcc could produce.

Better in terms of code size, speed, or both? (Hm, does code size even matter anymore? Maybe for the CPU cache.)

jwr · on April 4, 2014

Better in terms of both, and code size matters on all architectures. On microcontrollers it's because you really have limited memory, and on larger machines it's because your code cache imprint matters for performance.

panic · on April 4, 2014

Since memory latency improves more slowly than CPU instruction throughput, code size actually matters more and more over time.

userbinator · on April 4, 2014

That's what I thought too when I saw the numbers; OK, he managed to get it 18% faster, but how much bigger was the result? I'm guessing far more than 18% bigger. I've seen cases where replacing a function with a smaller one, but one which microbenchmarked slower, actually lead to the application as a whole performing much faster due to more code fitting in the cache. Ditto for individual instructions; (unfortunately - but Intel seems to be slowly changing this) many of the shortest instructions aren't the fastest when considered in isolation, so compilers tend to avoid them, but the 2-3x difference there is far less than the cost of a single cache miss.

64K of L1 might seem like a huge amount for a function to fit into, but that's 64K shared among every other often-used piece of code. It's for this reason that loop unrolling is no longer recommended too.

The first one is alignment of branches: the assembly code contains no instructions to align basic blocks on particular branches, whereas gcc happily emits these for some basic blocks. I mention this first as it is mere conjecture; I never made an attempt to measure the effects for myself.

And if you were, you would likely not see much on a modern x86; in fact it would probably make things slower, as it may confuse the branch predictor --- instructions are variable-length, so their addresses naturally are not aligned, and that enables e.g. lower-order bits to be used as cache tags.

ithkuil · on April 4, 2014

yes, actually a better word would be code density. A weird encoding like x86 behaves like a compressed code, that gets decoded in the cpu into another instruction set that actually gets executed. This property, along with the backwards compatibility is what kept the x86 instruction set alive.

With the move to 64-bit, code had to be recompiled anyways; hence there was a good opportunity to reshape the x86 instruction set significantly. Instead, the number of changes were relatively small; the instruction set maintained his flavour (variable length encoding, a non orthogonal operands, ...). I believe there are many reasons for that choice (made by AMD), but code density was an key factor.

raverbashing · on April 4, 2014

Yes, but that's why you have an L1 cache

And code is much smaller than data usually, it's a small code snippet operating on a lot of data, not the other way usually

zokier · on April 4, 2014

> Does code size even matter anymore? Maybe for the CPU cache

Cortex-M4 implies microcontroller. Quickly looking, the cheapest M4 chips have 32KB of flash for program memory.

joezydeco · on April 5, 2014

And little to no instruction caching.

caf · on April 4, 2014

It wasn't very difficult, and the code I wrote was much better than what gcc could produce.

This is at least partially because a lot more work has been put into gcc's x86 backend. If the compiled code was that bad then presumably you saw quite a bit of low-hanging fruit in terms of potential improvements to the optimiser.

charlieflowers · on April 4, 2014

A random thought just crossed my mind -- how about a genetic algorithm that experiments with all these options and evolves towards the best performance?

Run speed is objective and measurable, which is desirable for a fitness criterion.

Somehow, you'd have to specify which instructions can be moved around and which can't -- or maybe not, if you could also assess whether the program executed "successfully."

I don't really see it as practical, but it might be fun. Might uncover optimizations that no one has seen before.

bradleyjg · on April 4, 2014

Take a look at this paper from last year:

Stochastic Superoptimization (http://cs.stanford.edu/people/eschkufz/research/asplos291-sc...)

dankoss · on April 4, 2014

> Run speed is objective and measurable

Unfortunately in most modern OSes run speed on x86 is variable and profiling itself is intrusive. You might be able to get a general trendline toward a faster algorithm but it'll still be lost in statistical noise.

The second problem is that the underlying processor is a moving target. The x86 programming model doesn't directly map to hardware anymore, with out of order execution, branch predictors and register renaming. One variant of the processor with a particular caching algorithm might have totally different results than the next.

jamesjporter · on April 4, 2014

It's not a genetic algorithm, but FFTW's planning system actually does perform experiments to optimize for a particular machine's hardware.

stcredzero · on April 5, 2014

Genetic algorithms are just a particular kind of stochastic hill-climbing graph-search. (EDIT: It's actually a general graph, not a tree.)

caf · on April 5, 2014

http://blog.regehr.org/archives/1109