Performance issue with broadcasting of custom array wrapper wrapping a CuArray

There’s been some developments, Code using Flux slow on GPU - #10 by maleadt, but I’m not sure if anybody’s actively working on it. For now, Adapt.jl plus some quirks in CuArrays.jl/GPUArrays.jl covers most uses (except for custom wrappers, and using multiple wrappers).