# Cross-platform (CPU/GPU) data structure for linear algebra

Hi, I’m building an ML application, and want to pass around a data structure that contains several scalars/arrays/vertices/matrices.

The dimension will be determined at runtime, depending on input dataset etc.
The numeric type may be determined at runtime, for performance (lower of higher precision).
And finally, the array type should be determined by system capabilities - CPU/GPU/TPU/NPU etc.

After reading the documentation online, I came up with this initial proposal, please give me all your comments on how it can/should be improved, and see list of questions below ``````using Flux, CUDA
import StaticArrays: MMatrix
import GPUArrays: AbstractGPUArray

struct MyGenericData{T <: AbstractArray, D <: Number, N}
x1::T
x2::T

# CPU
function MyGenericData{T, D, N}() where T <: Array where D <: Number where N
new{T, D, N}(Array{D}(rand(D, N, N)),
Array{D}(rand(D, N, N)))
end

# CUDA GPU
function MyGenericData{T, D, N}() where T <: CuArray where D <: Number where N
new{T, D, N}(cu(rand(D, N, N)), cu(rand(D, N, N)))
end

# CPU static - slow for large N
function MyGenericData{T, D, N}() where T <: MMatrix where D <: Number where N
new{T, D, N}(MMatrix{N, N, D}(rand(D, N, N)),
MMatrix{N, N, D}(rand(D, N, N)))
end
end

function logic(data::MyGenericData)
return sum(data.x1 * data.x2)
end

function logic(d1::MyGenericData, d2::MyGenericData)
return sum(d1.x1 * d2.x1)
end

d1 = MyGenericData{Array, Int32, 64}()
d2 = MyGenericData{CuArray, Int32, 64}()
r1 = logic(d1)
r2 = logic(d2)
r3 = logic(d1,d1)
r4 = logic(d2,d2)

r5 = logic(d1,d2)
``````

Some questions:

1. On CPU, what is the right way to hint Julia about the array size? When I use StaticArrays (MMatrix), the code becomes very slow when the data is large enough, and `@code_llvm` shows that operations are rolled out per-cell instead of single array operation.

2. For GPU, should I explicitly specify types of data like CuArray, or is it sufficient to use AbstractGPUArray to support other GPU frameworks?

3. How to make `logic(d1,d2)` accept only parameters both on same device?

4. How to explicitly free the data from GPU instead of waiting for GC?

5. How to make transferring data between CPU and GPU only explicit and disallowed otherwise?

6. If I have more fields, like `x1,...,x20`, should I define them separately, or in an array (with elements of matrix type `T`), or add an extra dimension and store in one array? What are advantages and disadvantages for each approach, like for indexing and loop fusion? Each of these fields will be used separately for a different purpose.

7. If the constructor has some shared logic (like here the `rand(D, N, N)` part), how to refactor it so the shared part are in one place?

8. Could the array type be replaced by GPU boolean flag as value type? I.e. `gpu::Value{true}`.

9. Are you familiar with existing projects which have a similar data structure? The examples I saw for Flux GPU are using single variables which are either on CPU or GPU, or there is a `Flux.Functors: @functor` which transfers fields recursively but is limited on what it can do. Perhaps I can extend it to support my use case?

AbstractGPUArray should be sufficient if you want to dispatch to vendor-neutral GPU implementations (although note that not all back-ends build on this abstract type, but most do). If you use CUDA-specific kernels, of course, you’ll need to use the CuArray type.

`CUDA.unsafe_free!`

This is currently the case, you need to explicitly call the `Array` or `CuArray` constructors, or use `copyto!` methods. No automatic copying happens.

I want to support vendor-neural implementation, but to assert through dynamic dispatch that all parameters passed to my function are from the same device (i.e. in `logic(d1,d2)` both `d1,d2` are on CPU or GPU but not mixed).
What’s the proper way to do this?

With vendor-neutral I meant supporting both CUDA, AMDGPU, oneAPI, etc, so not the GPU. If you use `AbstractGPUArray` as the GPU type in dispatch, you’ll only match GPU code. Of course, the story becomes more complicated when you use wrapped arrays, but that isn’t specific to GPU programming.