Hi, I’m building an ML application, and want to pass around a data structure that contains several scalars/arrays/vertices/matrices.
The dimension will be determined at runtime, depending on input dataset etc.
The numeric type may be determined at runtime, for performance (lower of higher precision).
And finally, the array type should be determined by system capabilities  CPU/GPU/TPU/NPU etc.
After reading the documentation online, I came up with this initial proposal, please give me all your comments on how it can/should be improved, and see list of questions below
using Flux, CUDA
import StaticArrays: MMatrix
import GPUArrays: AbstractGPUArray
struct MyGenericData{T <: AbstractArray, D <: Number, N}
x1::T
x2::T
# CPU
function MyGenericData{T, D, N}() where T <: Array where D <: Number where N
new{T, D, N}(Array{D}(rand(D, N, N)),
Array{D}(rand(D, N, N)))
end
# CUDA GPU
function MyGenericData{T, D, N}() where T <: CuArray where D <: Number where N
new{T, D, N}(cu(rand(D, N, N)), cu(rand(D, N, N)))
end
# CPU static  slow for large N
function MyGenericData{T, D, N}() where T <: MMatrix where D <: Number where N
new{T, D, N}(MMatrix{N, N, D}(rand(D, N, N)),
MMatrix{N, N, D}(rand(D, N, N)))
end
end
function logic(data::MyGenericData)
return sum(data.x1 * data.x2)
end
function logic(d1::MyGenericData, d2::MyGenericData)
return sum(d1.x1 * d2.x1)
end
d1 = MyGenericData{Array, Int32, 64}()
d2 = MyGenericData{CuArray, Int32, 64}()
r1 = logic(d1)
r2 = logic(d2)
r3 = logic(d1,d1)
r4 = logic(d2,d2)
# warning about GPU/CPU crossover
r5 = logic(d1,d2)
Some questions:

On CPU, what is the right way to hint Julia about the array size? When I use StaticArrays (MMatrix), the code becomes very slow when the data is large enough, and
@code_llvm
shows that operations are rolled out percell instead of single array operation. 
For GPU, should I explicitly specify types of data like CuArray, or is it sufficient to use AbstractGPUArray to support other GPU frameworks?

How to make
logic(d1,d2)
accept only parameters both on same device? 
How to explicitly free the data from GPU instead of waiting for GC?

How to make transferring data between CPU and GPU only explicit and disallowed otherwise?

If I have more fields, like
x1,...,x20
, should I define them separately, or in an array (with elements of matrix typeT
), or add an extra dimension and store in one array? What are advantages and disadvantages for each approach, like for indexing and loop fusion? Each of these fields will be used separately for a different purpose. 
If the constructor has some shared logic (like here the
rand(D, N, N)
part), how to refactor it so the shared part are in one place? 
Could the array type be replaced by GPU boolean flag as value type? I.e.
gpu::Value{true}
. 
Are you familiar with existing projects which have a similar data structure? The examples I saw for Flux GPU are using single variables which are either on CPU or GPU, or there is a
Flux.Functors: @functor
which transfers fields recursively but is limited on what it can do. Perhaps I can extend it to support my use case?