Matt Pharr is the co-author of the popular and excellent book “Physically Based Rendering”, among other things.
The story of how he ended up building a shader-style SIMD compiler at Intel is interesting and approachable because he’s not a compiler expert but a graphics programmer. Some of the internal politics at Intel sound hair-raisingly awful and may explain why the company is where it is today.
This is a GPU compiler that made different design choices to LLVM and GCC.
Autovectorisation of numerical code has a long and not totally successful history. It turns out to be really difficult to determine which instructions are independent enough to pack their arguments into vector registers.
Where Cuda differs from autovectorisation is by giving that data independence problem to the programmer. The semantics of the kernel are that the instructions are independent from one another, thus the compiler doesn't need to prove it, and if you've written code where that has unwanted behaviour it's your problem.
Where ispc differs from llvm is by immediately mapping that implicitly vectorised code onto vector operations in the compiler front end and then optimising the vector form. llvm keeps the instructions in the scalar style until very near codegen.
I think ispc made the better choice here. It makes control flow easier to optimise and basic blocks harder to optimise than llvm's approach and that looks like the right tradeoff. However there's an element of the grass is always greener here and it's an expensive experiment to change llvm over to predicated vector IR to find out whether it is better in practice.
The semantics of the CUDA (or OpenCL) kernels are identical to the OpenMP Fortran "parallel do" and OpenMP C/C++ "parallel for" (which derive from similar syntax proposed by C.A.R. Hoare in 1978 and first implemented in the language Occam).
As you say, the difference between autovectorization and OpenMP or CUDA is that for the latter the programmer specifies the independent sequences of operations, so that the compiler knows with certainty when they can be safely scheduled to be executed concurrently, either on different processor cores or on different SIMD lanes of the same core, therefore the performance of the compiled code is predictable and it does not vary wildly after minor changes in the source code.
Moreover, the compiler can report errors when the programmer introduces by mistake dependencies between sequences of operations that are intended to be independent.
The C programming language was often thought of as "portable assembler" code. It abstracted over an architecture of machines similar to the PDP-11. As it turned out, for many years this was close enough to the mainstream computer architectures in use to be very successful.
ISPC does the best job of any language I've seen at abstracting SIMD hardware. It is very well thought out and very usable.
But ISPC never got traction. There was not corporate support; it didn't get incorporates into any high-usage technology stacks; it has struggled to find any sustaining community of users.
It should be noted that the compilers that are able to target SIMD lanes for compiling SPMD (SIMT in the NVIDIA non-standard terminology) programs rely on the ability of applying an execution mask to each operation, to be able to implement conditional expressions or statements.
This masking feature is intrinsic in the Larrabee New Instructions a.k.a. AVX-512 a.k.a. AVX10, which are available in AMD Zen 4, Intel server/workstation CPUs and in a few obsolete laptop/desktop Intel CPUs, and it is also intrinsic in Armv9-A CPUs and in all GPUs.
In the original SSE ISA and in its early extensions, implementing SIMD lane masking was very inefficient.
However, in 2008 the Intel Penryn New Instructions, a.k.a. SSE4.1, introduced a set of BLEND instructions, which perform SIMD lane masking.
On any Intel/AMD CPU that does not support AVX-512/AVX10, but which supports SSE4.1 or AVX/AVX2, the equivalent of an AVX-512 instruction with masking is obtained by pairing an instruction without masking with a BLEND instruction.
Because of this, the ISPC compiler supports all such CPUs with SSE4.1 or newer ISAs.
Nevertheless, in comparison with AVX-512, using BLEND is worse than just doubling the number of instructions, because when masking is implemented inside an instruction it also reduces the power consumption for the inactive lanes and it also ensures that the inactive lanes do not generate exceptions.
Worth reading the entire series, IMO. They are also nicely linked with "next part" at the bottom (and this linking is not unique to that volta/ispc series of blog posts on Matt's blog).
I remember being very interested in ispc when it appeared since the model was just right. It wasn’t obvious to me that it would go on to stay though ( and I used to be irrationally wary of intel software ) so I decided against using it at my day job.
I found it very weird that GPU shader languages (and CUDA) have solved the vectorization problem so successfully, it's almost unheard of for people to write straight SIMD vector code (not even sure if you can). Yet on the CPU, a far more flexible architecture, it's still almost completely unsolved.
Autovectorization is unreliable, more like a 'take what you can get' or 'toy example' nice boost than something to be relied on.
People even say that intrinsics tend to suck bad compared to hand-written assembly, which is the only true way. Almost nobody uses tools like ispc, which I find puzzling.
There's a degree of convergent evolution between CPUs and GPUs. The basic question is "what shall we do while waiting for memory". Branch prediction and speculative execution gives you a CPU, swap to another thread gives you a GPU.
The wide vector unit vs scalar unit choice is sort of orthogonal to that. I'd guess you reach for a GPU if the problem decomposes into lots of similar independent operations, at which point vector units are also what you want. And a CPU when there are deep data dependencies, where you're less likely to be able to pack the vector.
Implicitly vectorised code on a CPU might be interesting given the increasing variety of avx instructions. I'm interested in scalar workloads on a GPU but there's a credible chance performance won't work out well enough.
Shaders are for different hardware than regular code, both in the amount of independent work happening all the time but also the memory hierarchy is designed for very different workloads. I think it's the memory than explains why one has been more successful.
A more practical reason in terms of ergonomics is probably that by the time you have a problem that fits a different programming model but still compiles to SIMD you're basically in GPU territory already.
The story of how he ended up building a shader-style SIMD compiler at Intel is interesting and approachable because he’s not a compiler expert but a graphics programmer. Some of the internal politics at Intel sound hair-raisingly awful and may explain why the company is where it is today.