I can only speak for my own experience. Firstly, it is obvious that each of the programming paradigms you mention has its intricacies that need to be respected and properly used in order to make the most out of each of them. For example, the way you would optimize a multi-threaded program is often not close to the way you would optimize a GPU program or a distributed program, beyond basic Julia gotchas.
However, the way Julia and multiple dispatch work allow for some sort of functional abstraction where one can define a new array type, e.g. CuArrays.CuArray or DistributedArrays.DArray and then define some common functions on these types to try to hide most of the implementation details entailed in GPU and distributed programming. For example, if your program can be written as a series of simple maps and map-reductions, then this is straightforward to support in Julia without code duplication as all these array types define map and mapreduce. Similarly, mul!, dot and some very basic linear algebra are supported by these array types. So all you have to do in this case is to make sure you don’t over-constrain the inputs of the function, possibly constraining them to ::AbstractArray which the above array types are sub-types of. One of the best examples of this is perhaps IterativeSolvers.cg! which works for all these array types because it only uses functions that have been defined for all the array types above.
However, if your code is more involved and cannot be written in terms of those defined functions only, then you will have to use dispatch to do your own magic. This can involve a fair bit of code duplication which can be reduced by a careful definition of your building block functions and macros to be reused in all implementations.
Perhaps as Julia grows, more of these functions and abstractions will be already defined for you, so your off-the-shelf options will grow. But at least for now AFAIK, if you want to do something somewhat complicated on the GPU and/or multiple machines, you may have to get your hands dirty with the details of each programming paradigm.
Packages like DiffEq are made to work with GPUArrays without actually having any extra code for handling GPUs, so the overhead can be essentially zero even on large projects. You just have to use the right atomics.
Thank you both, I can move all the ‘expensive’ calculations to arrays as you mentioned, this will leave very few things, which relative to the other are O(1), and GPU is not really needed for them (will probably even make them slower due to loading and reading from GPU memory), so I think I can make it work.