Redundancy is a well-accepted (at least in aeronautics) way of providing robustness. When designing onboard systems for aircraft, it is not unusual to have a function with a higher level of design assurance which is implemented by a set of similar devices with a lesser level using a majority vote.
There is a tremendous number of hardware products and industrial systems where the processing is performed on small and cheap components (microcontrollers, digital signal processors).
Of course there exist very complex components in the category of microcontrollers, some of them even offer enough resources to run Linux, but if you stick to the $1-$5 range the specs are very limited.
Here are two examples, the first one costs around $3 and the second one is less than $1.
I develop on such platforms and even though there is an interesting challenge in programming these tiny processors and optimizing CPU cycles and memory usage all the time, in the long run it becomes quite strenuous because there is only low-level stuff and I miss the expressiveness and flexibility of more abstract languages.
If I recall correctly, Intel tried a few years ago to sell a system-on-chip combining an Atom CPU with a FPGA from Altera. I believe it didn't work very well, especially with regards to communication and synchronization between the two cores.
It didn't work very well, but there is a good reason: Nobody wanted a slow and comparatively low performance chip paired with a small FPGA connected via a (slow) PCI-express connect. There are hundreds of big FPGA boards with PCIe connectors that can be tied to big CPUs already. It was a non-product from the get-go.
I write assembly for DSPs on a near-daily basis. Up until a few months ago there didn'nt even exist a C compiler for the target architecture.
Even when you write C code for an embedded platform that does have a decent C toolchain, you cannot truly understand what you're doing without spending a lot of time looking at the generated assembly, and writing some of it yourself.
However, this kind of architecture has nothing to do with the x86. RISC, no cache, in most cases no MMU (and rarely any DMA), an extremely simple and straightforward pipeline, etc. I've written assembly for several such architectures, but compared to them I find x86 assembly intimidating.
I also do DSP development and looking at the x86 architecture it almost seems pointless to try to predict what the processor is doing. Out of order execution? Register renaming? The only way I can be sure assembly is faster than generated C is by profiling, which isn't even exact anyway without isolating the operating system.
For us, latency and determinism are the most important requirements of any processor, and it's very hard to get that on x86 without writing your own OS. But that negates the entire advantage of using the x86 platform with so much available code written on top of other kernels.
This brings back memories of when I wrote a lot of DSP code. The architecture I was on had a decent C compiler, but it wouldn't correctly take advantage of some of the special instructions. By rewriting a viterbi decoder in assembly to take advantage of their special subtract-compare-branch instructions, I recall getting a huge performance increase. But the compiler did a decent job at optimizing for multiply-accumulate instructions especially if you gave it the right hints with pragmas.
I wrote x86 assembly demos for fun in the golden age of asm between the eras of 8088 and p6... It became so much about putting random instructions here and there to get better performance, one almost needs like a QuickCheck-like sythensizing+benchmarking hybrid compiler that automagically mutates code to find what is actually fastest by genetic algo-like heuristics.
I wrote firmware (~20kLocs) for a power (electrical) supply that's part of one such power train.
This kind of development is very demanding, because you can't afford to leave any bugs in your program but at the same time you're always shipping late because of tight schedules and often blurry specifications.
On top of that hardware and software development cycles are concurrent, so no target hardware available when you're writing your code.
In these conditions the only way I found to have something that works is to keep it very simple:
1. Simple algorithms
2. Simple data structures
3. Few abstractions
4. No dynamic memory allocations
It's useful when you have a circuit which needs a high-current burst which is powered from a low current (but high capacity) power source.
One application for this is lasers. I'm part of a team that works on a converter that uses wall power as an input to continuously charge a few dozens ultracaps then empties them all at once to fire very short pulses in the hundreds of kW.
What I find the most impressive about this is that the mechanical engineers have managed to fit this in a 2U rack.
kW isn't that high for pulsed lasers. Particularly since with Q-switching you can get very short pulses and mode-locking ridiculously short pulses. Total energy for a 1ps pulse at 100kW is negligible.
I was lead tech (and first hire) at a hardware startup which launched two unsuccessful products then ran out of money (despite some seed funding).
After that I left.