Palli
25
FYI: I see multi-GPU done (while I guess not Distributed, then) in 2018, with Flux:
Also interesting:
On the Nvidia blog in 2017:
On average, the CUDAnative.jl ports perform identical to statically compiled CUDA C++ (the difference is ~2% in favor of CUDAnative.jl, excluding nn).