It would be interesting to try emulate a many-core CPU as a GPU program and then...

Symmetry · on April 21, 2018

Generally speaking emulating special purpose hardware in software slows things down a lot so I don't think that relying on a software branch predictor is going to result in performance anywhere close to what you'd see in, say, an ARM A53. And since you have to trade off clock cycles used in your branch predictor with clock cycles in your main thread I think it would be a net loss. Remember that even though NVidia calls each execution port a "Core" it can only execute one instruction across all of them at a time. The advantage over regular SIMD is that each shader processor tracks its own PC and only executes the broadcast instruction if it's appropriate - allowing diverging control flows across functions in ways that normal SIMD+mask would have a very hard time with except in the lowest level of a compute kernel.

That also means that you can really only emulate as many cores as the NVidia card has streaming multiprocessors, not as many as it has shared processors or "cores".

Also, it's true that GPUs have huge memory bandwidth they achieve that by trading off against memory latency. You can actually think of GPUs as throughput optimized compute devices and CPUs as latency optimized compute devices and not be very mislead.

So I expect that the single threaded performance of a NVidia general purpose computer to be very low in cases where the memory and branch patterns aren't obvious enough to be predictable to the compiler. Not unusably slow but something like the original Raspberry Pi.

Each emulated core would certainly have very good SIMD support but at the same time pretending that they're just SIMD would sacrifice the extra flexibility that NVidia's SIMT model gives you.

joe_the_user · on April 21, 2018

Remember that even though NVidia calls each execution port a "Core" it can only execute one instruction across all of them at a time.

There are clever ways around this limitation, see links in my post this thread.

https://news.ycombinator.com/item?id=16892107

Symmetry · on April 21, 2018

Those are some really clever ways to make sure that all the threads in your program are executing the same instruction, but it doesn't get around the problem. Thanks for linking that video, though.

joe_the_user · on April 21, 2018

The key of the Dietz system (MOG) is that the native code that the GPU runs is a bytecode interpreter. Bytecode "instruction pointer" together with other data is just data in registers and memory that's interpreted by the native code interpreter. So for each thread, the instruction pointer can point at a different command - the interpreter runs the same instructions but the results are different. So effectively you are simulating a general purpose CPU running a different instruction on each thread. There are further tricks required to make this efficient, of course. But you are effectively running a different general purpose instruction per thread (actually runs MIPS assembler I recall).

etaioinshrdlu · on April 22, 2018

This is more or less what I'm talking about. I wonder what possibilities lie with using the huge numerical computation available on a GPU applied to predictive parts of a CPU, such as memory prefetch prediction, branch prediction, etc.

Not totally dissimilar to the thinking behind NetBurst which seemed to be all about having a deep pipeline and keeping it fed with quality predictions.

joe_the_user · on April 22, 2018

I'm not sure if your idea in particular is possible but who knows. There may be fundamental limits to speeding up computation based speculative look-ahead not matter how many parallel tracks you have and it may run into memory through-put issues.

But take a look at the MOG code and see what you can do.

Check out H. Dietz' stuff. Links above.

joe_the_user · on April 21, 2018

Support for specialized CPU functions won't happen and doesn't make sense.

However, it is quite feasible to emulate, on a GPU, a networked group of a general purpose CPUs( ie, run MIMD[1] programs on SIMD[2] architecture). This, MOG[3], has been a project of Henry G. Dietz of the University of Kentucky. Unfortunately, the project seems to have stalled at a "rough" level. He claims that he can run MIMD programs at 1/4 efficiency while also running SIMD programs at near full efficiency. His video is instructive [4].

Edit: Note that this isn't intended for deep learning applications as such but rather for traditional supercomputing applications (weather prediction, other physics simulations ,etc).

[1] https://en.wikipedia.org/wiki/MIMD [2] https://en.wikipedia.org/wiki/SIMD [3] http://aggregate.org/MOG/ [4] https://www.youtube.com/watch?v=FZ6efZFlzRQ

ianai · on April 21, 2018

How would that work when the individual cores have to be running the same instructions at a time? Where’s the ability to emulate a CPU come from?