Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

It would be interesting to try emulate a many-core CPU as a GPU program and then run an OS on it.

This sounds like a dumb idea, and it probably is. But consider a few things:

* NVIDIA GPUs have exceptional memory bandwidth, and memory can be a slow resource on CPU based systems (perhaps limited by latency more than bandwidth)

* The clock speed isn't that slow, it's in the GHz. Still one's clocks per emulated instruction may not be great.

* You can still do pipelining, maybe enough to get the clocks-per-instruction down.

* Branch prediction can be done with ample resources. RNN based predictors are a shoe-in.

* communication between "cores" should be fast

* a many-core emulated CPU might not do too bad for some workloads.

* It would have good SIMD support.

Food for thought.



Generally speaking emulating special purpose hardware in software slows things down a lot so I don't think that relying on a software branch predictor is going to result in performance anywhere close to what you'd see in, say, an ARM A53. And since you have to trade off clock cycles used in your branch predictor with clock cycles in your main thread I think it would be a net loss. Remember that even though NVidia calls each execution port a "Core" it can only execute one instruction across all of them at a time. The advantage over regular SIMD is that each shader processor tracks its own PC and only executes the broadcast instruction if it's appropriate - allowing diverging control flows across functions in ways that normal SIMD+mask would have a very hard time with except in the lowest level of a compute kernel.

That also means that you can really only emulate as many cores as the NVidia card has streaming multiprocessors, not as many as it has shared processors or "cores".

Also, it's true that GPUs have huge memory bandwidth they achieve that by trading off against memory latency. You can actually think of GPUs as throughput optimized compute devices and CPUs as latency optimized compute devices and not be very mislead.

So I expect that the single threaded performance of a NVidia general purpose computer to be very low in cases where the memory and branch patterns aren't obvious enough to be predictable to the compiler. Not unusably slow but something like the original Raspberry Pi.

Each emulated core would certainly have very good SIMD support but at the same time pretending that they're just SIMD would sacrifice the extra flexibility that NVidia's SIMT model gives you.


Remember that even though NVidia calls each execution port a "Core" it can only execute one instruction across all of them at a time.

There are clever ways around this limitation, see links in my post this thread.

https://news.ycombinator.com/item?id=16892107


Those are some really clever ways to make sure that all the threads in your program are executing the same instruction, but it doesn't get around the problem. Thanks for linking that video, though.


The key of the Dietz system (MOG) is that the native code that the GPU runs is a bytecode interpreter. Bytecode "instruction pointer" together with other data is just data in registers and memory that's interpreted by the native code interpreter. So for each thread, the instruction pointer can point at a different command - the interpreter runs the same instructions but the results are different. So effectively you are simulating a general purpose CPU running a different instruction on each thread. There are further tricks required to make this efficient, of course. But you are effectively running a different general purpose instruction per thread (actually runs MIPS assembler I recall).


This is more or less what I'm talking about. I wonder what possibilities lie with using the huge numerical computation available on a GPU applied to predictive parts of a CPU, such as memory prefetch prediction, branch prediction, etc.

Not totally dissimilar to the thinking behind NetBurst which seemed to be all about having a deep pipeline and keeping it fed with quality predictions.


I'm not sure if your idea in particular is possible but who knows. There may be fundamental limits to speeding up computation based speculative look-ahead not matter how many parallel tracks you have and it may run into memory through-put issues.

But take a look at the MOG code and see what you can do.

Check out H. Dietz' stuff. Links above.


Support for specialized CPU functions won't happen and doesn't make sense.

However, it is quite feasible to emulate, on a GPU, a networked group of a general purpose CPUs( ie, run MIMD[1] programs on SIMD[2] architecture). This, MOG[3], has been a project of Henry G. Dietz of the University of Kentucky. Unfortunately, the project seems to have stalled at a "rough" level. He claims that he can run MIMD programs at 1/4 efficiency while also running SIMD programs at near full efficiency. His video is instructive [4].

Edit: Note that this isn't intended for deep learning applications as such but rather for traditional supercomputing applications (weather prediction, other physics simulations ,etc).

[1] https://en.wikipedia.org/wiki/MIMD [2] https://en.wikipedia.org/wiki/SIMD [3] http://aggregate.org/MOG/ [4] https://www.youtube.com/watch?v=FZ6efZFlzRQ


How would that work when the individual cores have to be running the same instructions at a time? Where’s the ability to emulate a CPU come from?




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: