Compressed instructions really hurt the parallelism of CPUs. See M1 for an example of what happens when you design your decoders to fetch independently of each other because the offset of their instruction is deterministic. I think a better model is for code to be compressed with an efficient encoder at cache line size (eg zstd with the dictionary defined to be your instruction set)
I'm not talking about parallel decode. I'm talking about the fetchers feeding the decoders. How does the fetcher know where the instruction boundaries are if you have variable-length instructions? My understanding is that x86 is the worst here - while x86 gets broken down into RISC-like uops so the overall x86 hit isn't large, the fetcher is extremely complex & slow to try to get back some parallelism that x86 encoding loses. The compressed version of these instructions seems similarly complicated since the length of the instruction is dependent on the data operands to the instruction.
Can you help me correct my understanding of the situation if that's where I'm wrong or what additional details I may be missing? I'm not a CPU architecture expert by any means - just an enthusiast.
The issue with x86 is that instructions can be any length, and that their length is a nontrivial function of their encoding. This means there's an unavoidable sequential scan over, effectively, every byte of the input when doing decode. (This isn't implemented as a sequential scan like it would be in software.)
With RISC-V, compressed instructions are still 16 bit, and all instructions are 16 bit aligned. Further, the instruction length is fully determined by the first two bits in the instruction (11₂ for 32 bits).
For an 8-wide decoder, this means routing is fully determined by 15 bits, after a single AND reduction, this can be expanded early during a fetch pass, the worst case is an 8-way MUX for the last instruction, and even doing decode fully in parallel at every possible offset would at worst require interleaving eight 32 bit decoders with eight (much simpler) 16 bit decoders. None of that really matters a great deal since RISC-V is so trivial to decode.
Early on the focus was on the hypothetical CISC advantage, hence wanting ‘compressed’ variable-length instructions, even though the ISA hasn't ended up especially space efficient now that the actual complex stuff is legacy. But as to why their encoding is so difficult, put that down to lack of foresight. x86 was designed before parallel decode became a thing, and the complexity accumulated over time.
Intel have had quite a few ISAs and all the famous ones are bad, so there doesn't have to be a good reason beyond that.
x86_64 instructions can be anything from 1 byte to 15 bytes long. You have to decode the entire instruction, except any trailing constant/offset, to even know how long the instruction is.
RISC-V instructions can be exactly two lengths: 2 bytes or 4 bytes. You can tell how long an instruction is by looking at the first 2 bits: 11 -> the instruction is 4 bytes long, anything else (00, 01, 10) -> the instruction is 2 bytes long.
These things are so different that you can't just say "oh, they're both variable length".
Also, the 2 byte instruction in RISC-V are completely optional. They each exactly duplicate some 4 byte instruction. If you want to build an 8-wide machine like the M1 and you think having two instruction lengths will make that too hard -- you can just use the fixed length RISC-V base instruction set instead. The only downside is your programs will be a bit bigger, and you'll need to compile your own software (as Apple, for example, does).
My understanding is that the typical way to do parallel decode of a variable length ISA is to speculatively decode from every valid offset, then throw away those which turned out to be wrong.
So x86 implementations start to decode at [n, n+1, n+2, n+3, etc.] bytes. Then if turns out that, say, the instruction at offset n was 2 bytes, throw away whatever work was done for decoding at n+1 and continue decoding the n+2 etc.
For RISC-V with the C (compressed) extensions, instruction boundaries are 16-bit aligned, so you can do a similar strategy like above for x86 except you speculatively decode at offsets [n, n+2, n+4, etc.]. See e.g. the SonicBOOM paper.
Of course, speculatively decoding instructions only to throw them away costs power. I have no hard numbers, but I do wonder why RVC didn't specify that an "instruction bundle" is 32-bit aligned and consists of either a 32-bit instruction or a pair of 16-bit instructions? That would have made decoding simpler, though at a cost of worse density.
The problem with this approach is that you get more and more such wasted decoding effort when you try to make a wider decoder. Then again, uop caches exist for a reason and generally work well, they are an excellent choice if you have a hard to decode ISA.