It should be noted that the compilers that are able to target SIMD lanes for compiling SPMD (SIMT in the NVIDIA non-standard terminology) programs rely on the ability of applying an execution mask to each operation, to be able to implement conditional expressions or statements.
This masking feature is intrinsic in the Larrabee New Instructions a.k.a. AVX-512 a.k.a. AVX10, which are available in AMD Zen 4, Intel server/workstation CPUs and in a few obsolete laptop/desktop Intel CPUs, and it is also intrinsic in Armv9-A CPUs and in all GPUs.
In the original SSE ISA and in its early extensions, implementing SIMD lane masking was very inefficient.
However, in 2008 the Intel Penryn New Instructions, a.k.a. SSE4.1, introduced a set of BLEND instructions, which perform SIMD lane masking.
On any Intel/AMD CPU that does not support AVX-512/AVX10, but which supports SSE4.1 or AVX/AVX2, the equivalent of an AVX-512 instruction with masking is obtained by pairing an instruction without masking with a BLEND instruction.
Because of this, the ISPC compiler supports all such CPUs with SSE4.1 or newer ISAs.
Nevertheless, in comparison with AVX-512, using BLEND is worse than just doubling the number of instructions, because when masking is implemented inside an instruction it also reduces the power consumption for the inactive lanes and it also ensures that the inactive lanes do not generate exceptions.
This masking feature is intrinsic in the Larrabee New Instructions a.k.a. AVX-512 a.k.a. AVX10, which are available in AMD Zen 4, Intel server/workstation CPUs and in a few obsolete laptop/desktop Intel CPUs, and it is also intrinsic in Armv9-A CPUs and in all GPUs.
In the original SSE ISA and in its early extensions, implementing SIMD lane masking was very inefficient.
However, in 2008 the Intel Penryn New Instructions, a.k.a. SSE4.1, introduced a set of BLEND instructions, which perform SIMD lane masking.
On any Intel/AMD CPU that does not support AVX-512/AVX10, but which supports SSE4.1 or AVX/AVX2, the equivalent of an AVX-512 instruction with masking is obtained by pairing an instruction without masking with a BLEND instruction.
Because of this, the ISPC compiler supports all such CPUs with SSE4.1 or newer ISAs.
Nevertheless, in comparison with AVX-512, using BLEND is worse than just doubling the number of instructions, because when masking is implemented inside an instruction it also reduces the power consumption for the inactive lanes and it also ensures that the inactive lanes do not generate exceptions.