CUDA/AMD/CPU device agnostic array creation best practice

Marcell_Havlik · January 23, 2022, 9:34am

Dear Julianners,

What is the best practice to do a device agnostic array initialisation.
I believe this package written by many person already.

For example:

get_worker_type() = :CUDA # This is actually a function that get's the context of the worker... But for simplicity. 
move2hw(arr) = get_worker_type() == :CUDA ? cu(arr) : arr  # also extended with AMD array...
move2hw(arr, device=:CPU) = device == :CUDA ? cu(arr) : arr

I think there should be a best practice at initialising the arrays on the right device. I used Symbol but I know this should be compilation time, so definitely will switch to that.

I believe 10 or 100 of us wrote the switch cases like that… But didn’t find a simple package that solve this issue yet.

Marcell_Havlik · January 23, 2022, 11:53am

The issue is actually, when you want to initialize 10-20 Array and you don’t want to rewrite the code in each case to initialize in a CUDA or AMD or CPU context.
The flag looks promising to initialize the whole algorithm in the appropriate device’s context. But the question arise, are there any standardized way or I have to reinvent in my algorithm again what is the best practice?

Marcell_Havlik · January 23, 2022, 12:20pm

I see there is not too much comment on it.

Are there a better practice then this?

What would you improve on this then?

move2hv(arr::Array, dev::Val{:CPU}) = arr
move2hv(arr::CuArray, dev::Val{:CUDA}) = arr
move2hv(arr::ROCArray, dev::Val{:AMD}) = arr
move2hv(arr, dev::Val{:CPU}) = Array(arr)
move2hv(arr, dev::Val{:CUDA}) = cu(arr)
move2hv(arr, dev::Val{:AMD}) = ROCArray(arr)

Usage:

init!(data, device::Val{DEVICE_TYPE}=Val(:CPU)) where DEVICE_TYPE = begin
    mv2hw = arr -> move2hv(arr, device)  
    input_placeholder = randn(Float32, 1000,256,768) |> mv2hw
    params, states = randn(Float32, 1000,256,600), randn(Float32, 1000,256,200) .|> mv2hw
        # some more initialisation
end

What do you think?

rveltz · January 23, 2022, 1:47pm

Back in the old days, we had GPUArrays working with CuArrays (named modified for security reasons), CLArrays and JLArrays. This was a very promising period

jpsamaroo · January 23, 2022, 1:53pm

Dagger.jl (and DaggerGPU.jl) specifically do this for CUDA and AMDGPU computations automatically, if you use Dagger to execute your computation. If you pair this with KernelAbstractions.jl for writing vendor-agnostic kernels, then you can basically let Dagger take care of all the conversions and data movement for you. See https://github.com/maleadt/juliacon21-gpu_workshop/blob/746ce9cb523cdf0cfe66cbb0cecae182753fe639/case_studies/ImageProcessingDagger.ipynb for an example of this.

jpsamaroo · January 23, 2022, 1:56pm

We still have that, the CuArray, ROCArray, and ZeArray are built on the AbstractGPUArray from GPUArrays.jl. The JLArray also exists within the GPUArrays package, although I don’t know of anyone who uses it, since you can just use a regular Array. CLArray is dead due to transpilation from Julia to OpenCL C being a pain, and no one wants to do the work to bring it back.

Marcell_Havlik · January 23, 2022, 2:35pm

You know what do I don’t understand here? Where do you specify the context?
With this line:
lilly_gpu = GpuArray(map(RGB{Float32}, lilly))
I think in this scenario we just hardwire the initialization code. While I want something like this:

# ctx = initialise!(data, device=:CPU)
ctx = initialise!(data, device=:CUDA)
run_algortihm(ctx, input)

The algorithm code is device agnostic with kernelabstraction.

Worth to note! I realised that cu(array…) is type instable. So I modified these lines:

move2hw(arr::Array, dev::Val{:CPU}) = arr
move2hw(arr::CuArray, dev::Val{:CUDA}) = arr
move2hw(arr::ROCArray, dev::Val{:AMD}) = arr
move2hw(arr, dev::Val{:CPU}) = Array(arr)
move2hw(arr, dev::Val{:CUDA}) = CuArray(arr)
move2hw(arr, dev::Val{:AMD}) = ROCArray(arr)

I checked dagger.jl throught your demo jpsamaroo about 4-7 months ago. For me it was a little bit heavy and had some bigger latency what I didn’t understand and I think I switched too fast. So I dropped it for now till I don’t really need that scaling.
I started to focuse on well written algorithm, optimising every single detail of it to run fast on CUDA/AMDGPU. Also worth note that the kernelabstraction would be great to support two kind of indexing to write optimised kernels, as I realised the column major and row major is opposite on CPU and GPUs. So we lose close to 10-30x speed due to these indexing problems depending on weather the kernel is optimised to CPU or GPU, in certain case it can yeald to 100x slow down, crazy and we are talking about basic alogrithms.

jpsamaroo · January 23, 2022, 3:13pm

That line isn’t actually used for the vendor-agnostic code; if you look at the examples that actually use Dagger, they use the CPU array lilly. lilly_gpu is only used for the negative and brightness examples, which aren’t actually the point of the demo (this code was used for the JuliaCon 2021 GPU workshop, so it uses some other APIs).

That’s fair, although it has definitely improved significantly in the last few months.

How do you figure? Arrays on the CPU and GPU in Julia are (by default) column-major, so the access pattern is the same on either kind of device. Doing row-wise accesses in Julia on the GPU would also be slow, just like they are on the CPU.

Marcell_Havlik · January 23, 2022, 3:37pm

How do you figure? Arrays on the CPU and GPU in Julia are (by default) column-major, so the access pattern is the same on either kind of device. Doing row-wise accesses in Julia on the GPU would also be slow, just like they are on the CPU.

I measured a hell lot and sadly the issue does really there and it is VERY serious. I could copy like 15 tests to show it. But some results for basic c+=a+b:

# 3D going over tests (I, N, M)
# testNMI:    CPU      |     GPU
# Run1:   3.286851861  |   14.22890929
# Run2:   0.32920117  |   1.835222924
# Run3:   0.335266557  |   1.836910909
# testMNI:    CPU      |     GPU
# Run1:   0.293491544  |   1.299087763
# Run2:   0.108282616  |   0.589418815
# Run3:   0.107899356  |   0.587017632
# testNIM:    CPU      |     GPU
# Run1:   0.726314078  |   2.172605271
# Run2:   0.509902442  |   1.366021527
# Run3:   0.509840066  |   1.364689692
# testMIN:    CPU      |     GPU
# Run1:   0.288986778  |   1.182367412
# Run2:   0.093925633  |   0.440663573
# Run3:   0.093540108  |   0.441261982
# testINM:    CPU      |     GPU
# Run1:   1.460204066  |   0.754965978
# Run2:   1.257153308  |   0.030782206
# Run3:   1.228373336  |   0.028862503
# testIMN:    CPU      |     GPU
# Run1:   1.616498055  |   0.781305138
# Run2:   1.310087368  |   0.030445782
# Run3:   1.322033711  |   0.030818496

Kernels do the index cicling differently.
So every case the same test but definitely different results based on the going over algorithm. The best on CPU perform badly on GPU (more than 10 times), best on GPU perform like 30 times worse on CPU. In this scenarios we didn’t even use CUDA.const also. So I believe this is really important. Also worth noting we observed this moment: Slow SUM on 2D which is another 2-10x speed diff. This is for me like a well written code can run like I would have 10-100 GPU.

So I would say, the kernelabstraction should help with indexing as it is definitely something that could be improved yet. But I will be able to post later on on this topic only

Marcell_Havlik · January 23, 2022, 3:42pm

To get back to topic…
I really want to konw if these 6 function would be a good way to do the initialization if I don’t use the dagger.jl yet. I don’t know if there is a package who already made a dot to the end of this question.