Disclaimer, it's my project, but I run an open source project called deeplearning4j, who's algorithms have a hardware abstraction layer built in to them called nd4j. You get numpy on the jvm and hardware as a jar file. Deeplearning4j itself is built on top of that. Would love to help spread deep learning to different runtimes.
Need to run empirical benchmarks. CUDA is usually faster. I'd like to run my own benchmarks with nd4j though. We have our own benchmark setup that works for every backend. It allows us to do some interesting things. Cuda itself is usually faster with data transfer latency though[1].
Looking forward to running these ourselves after our opencl support kicks in (only the kernels are written =/)
I plan on basing the work for open cl on our cuda work which is fairly well established at this point (mainly doing optimizations not much change in architecture)
I wouldn't be surprised if this turned out not to be accidental. I mean, it wouldn't work against NVidia for OpenCL to continue to be seen as the "slower option". So I'm sure their efforts to improve their OpenCL implementation aren't considered as important from a business point of view.
True, though if you compare roughly equivalent Nvidia and AMD GPUs, my impression is that the CUDA implementation on Nvidia still outperforms OpenCL on AMD for deep learning. Is this right?
Here's the current work being done on opencl: https://github.com/deeplearning4j/nd4j/tree/master/nd4j-jocl...
We'd love to get this finished. Bit more to do yet though...definitely looking for contributors here. You'll get opencl neural nets for free.