For what it's worth, at least for HPC-ish distributed computing, this sort of thing turns out not to be terribly worthwhile. We have a standard for distribution of computation, shared memory, i/o, and process starting in MPI (and, for instance, DMTCP to migrate the distributed application if necessary, though I think DMTCP needs a release).
I don't know what its current status is, but the HPC-ish Bproc system has/had an rfork [1]. Probably the most HPC-oriented SSI system, Kerrighed died, as did the Plan 9-ish xcpu, though that was a bit different.
The biggest benefit is arguably that codes that are designed for "telefork" and perhaps remote threads can also be scaled down to a single shared-memory machine, and run way more efficiently than if they had been coded using the MPI approach. Whilst you don't really add much of any overhead when running in a cluster, assuming that the codes are designed properly.
Just doing a fork may be sufficient for something embarrassingly parallel, but interesting things are tightly coupled. Obviously MPI scales down to a single node (a distributed system anyway, these days), typically as real forked processes, but possibly with all the ranks in a single processwith an appropriate implementation.
Citation needed, as they say, for “run way more efficiently”, particularly as the conventional wisdom says shared memory in a single process (e.g. OpenMP).
“Acknowledgements: ... NUMA and Amdahl’s Law, for holding OpenMP back and keeping MPI-only competitive in spite of the ridiculous cost of Send-Recv within a shared-memory domain.” — Jeff Hammond, ‘MPI+MPI’
I don't know what its current status is, but the HPC-ish Bproc system has/had an rfork [1]. Probably the most HPC-oriented SSI system, Kerrighed died, as did the Plan 9-ish xcpu, though that was a bit different.
1. https://www.penguinsolutions.com/computing/documentation/scy...