I think this is something that actually isn’t too far off, with a bit of work. There already seems to be at least one package investigating this idea (https://github.com/JuliaDiffEq/AutoOffload.jl), and GPUifyLoops and GPUArrays both implement operations (loops vs. array ops) that can be essentially written once, and executed on different devices without substantial changes.
In my mind, all that one would need to do to achieve efficient, automated offload of computations to whatever devices are available is the following:
- A unified mechanism to query all available compute devices, their topologies, and then load the appropriate packages if available (for example, Hwloc.jl plus detection methods from CUDAdrv/CUDAapi)
- A means to annotate or statically/dynamically analyze code for data access and compute patterns (probably the hardest part, but some solutions definitely exist in current literature)
- A package which can tie the above together, and make the actual resource allocations and compute assignments, probably with some scheduling for longer-running, dynamic computations (also difficult to do well, but for simple problems, “obvious” solutions may exist)