Map Performance with CuArrays

I’m trying to benchmark different methods of calling the same in-place function on subsets of a large data block using CuArrays. I’m using fft as an example function because I can baseline against planning an fft over one dimension of a large array. Using this as a minimum working example, I created a testbench with the following methods:

  1. Create an list of 1D CuArrays. map f(x) to each element of the list
  2. Create a 2D CuArray. mapslices to f(x) for each column in array.
  3. Baseline taking f(x) over the full 2D array
using FFTW
using BenchmarkTools
using CuArrays
CuArrays.allowscalar(false)

N = 100000
fftsz = 1024

# list of random vectors
A1_d = [CuArray(rand(ComplexF32, fftsz)) for i in 1:N]

# random 2D array
A2_d = CuArray(rand(ComplexF32, N, fftsz))

p1 = plan_fft!(A1_d[1])
p2 = plan_fft!(A2_d, 1)

function onefft!(data, plan)
    plan * data
end

#1
@btime CuArrays.@sync map(x -> onefft!(x, p1), $A1_d)
#2
@btime CuArrays.@sync mapslices(x -> onefft!(x, p1), $A2_d, dims=2)
#3
@btime CuArrays.@sync onefft!($A2_d, p2)

The results are:

629.991 ms (200011 allocations: 3.81 MiB)
3.618 s (13198542 allocations: 440.96 MiB)
20.603 ms (10 allocations: 176 bytes)

I would think the approaches should be somewhat identical in terms of performance. However, the strangest things I notice are:

  • Why do (1) and (2) allocate memory? All the data should already be on the GPU and all operations should be in-place
  • Why are (1) and (2) so different from one another? Even if there is some penalty to map, it seems like (1) and (2) should be identical.
1 Like

These are CPU allocations; use CuArrays.@time to see if there are GPU allocations (likely not). There’s probably some inefficiency in the CuFFT wrappers leading to a couple of small CPU allocations per FFT here.

It looks like mapslices is badly implemented. I’m not surprised; we don’t have any GPU-optimized implementation in CuArrays/GPUArrays so this falls back to the Base implementation which surprisingly doesn’t use scalar iteration, but is still know to be slow.

Any about the difference between (1) and (3), it’s always recommended to reduce the number of distinct calls into the GPU so that you can saturate the device with as much work as possible. There might also be some synchronizing behavior in how the wrappers are implemented (e.g., copying back some memory to inspect whether the operation succeeded), serializing the GPU at each individual FFT. So prefer batched interfaces whenever possible.

Thank you for the help. I’ll try and avoid mapslices for now.

This is the core of the question I was hoping to get some community advice on.

I used an FFT example here because I knew that I could explicitly create a batch example as a benchmark (3) with near-optimal performance. In a more general case, it seems to me like the batch approach is somewhat divergent from the Julia design philosophy of writing many simple functions. I’ll try and explain what I mean.

If I follow the Cuda.jl Introduction, I get extremely good performance with broadcasting simple functions onto the GPU (ex. writing a scalar addition function and broadcasting it to add N-dimensional arrays). From my limited CUDA knowledge, I think this means these simple atomic actions are being interpreted as “GPU kernels” and they distributed across the GPU in parallel. In other words this broadcast is a single “distinct call into the GPU.”

This doesn’t seem to scale up to slightly more complicated functions like the example I post here (1) (an arbitrary 1D function that can be broadcast across an N-dimensional array). From the Julia user perspective, it isn’t clear why the map call I use isn’t a single distinct call into the GPU the way the simple broadcast call is in the Cuda.jl example.

From a Julia perspective I’d rather write a general function foo

map(x -> foo(x, args...), bunch_of_datasets)

instead of equivalent macro-function function bar,

bar(bunch_of_datasets, args...)

Is this fundamentally a GPU and/or CUDA limitation? Is there anything we can do from the Julia user side to help inform when map/broadcast benefit performance and when it hurts? Are there design guides or principles from the Julia/GPU perspective that we can use? As you can tell, I’m still pretty new to all this.

But you were mapping across a CPU collection of GPU arrays, so isn’t it expected that this will result into multiple separate calls to the GPU? If you map across a GPU dataset, you can use arbitrary functions and not be constrained to simple atomic actions as you mention. Furthermore, with broadcast fusion and using “dots” you can trivially create new and complex kernels that will be executed on the GPU in a single step. If you need more flexibility, you need to resort to writing your own kernels (which can be arbitrarily complex) or use something like GPUifyLoops for the vendor-agnostic equivalent.

This is exactly what I wanted to do originally, but I ran into technical difficulties. I’m probably missing something simple. If I create a multidimensional array (2), then it seems like I need to use mapslices, which is broken.

For the first approach I originally tried the following:

A1_d = CuArray([rand(ComplexF32, fftsz) for i in 1:N])

but got

ERROR: type does not have a fixed size

I also tried

A1_d = CuArray([CuArray(rand(ComplexF32, fftsz)) for i in 1:N])

but it crashed the REPL.

Do you have an approach in mind to fix (1) to be a GPU-only data structure I can map across?

You are not creating a multidimensional array here, you are creating an array of arrays. That is not supported by CuArrays, and you need to create a 2D array instead.

I think my tests 2) and 3) create multidimensional arrays on the GPU. My test 2) didn’t work because of the issues you pointed out with mapslices. Is there another way to map or broadcast a function across rows, columns, etc of a multidimensional array that would make this work?

What kind of function do you want to map? fft calls into CUFFT, and functions like that need an API to explicitly support processing rows/columns of a dataset (either a batched interface like this, or a dims keyword as with sum or reduce). It would be easier to help if you have a more concrete use case, as the one you posted here (fft) is solved through use of its batched API. I can imagine a maprows (or similar) function, but that would be of limited use on the GPU because you would then be performing a coarse operation to process all elements of the dimension using a single thread. Typically, you need specifically-designed parallel kernels for workloads like that (e.g. a scan / prefix sum).

If you map across a GPU dataset, you can use arbitrary functions and not be constrained to simple atomic actions as you mention. Furthermore, with broadcast fusion and using “dots” you can trivially create new and complex kernels that will be executed on the GPU in a single step.

I can imagine a maprows (or similar) function, but that would be of limited use on the GPU because you would then be performing a coarse operation to process all elements of the dimension using a single thread. Typically, you need specifically-designed parallel kernels for workloads like that (e.g. a scan / prefix sum).

I guess my confusion comes from the fact that these two comments seem a little contradictory. Your maprows idea is what I was trying to do with mapslices, but if that is executed serially then it isn’t helpful, as you say. If I need to design specific parallel kernels, rather than mapping/broadcasting arbitrary functions, then I must be missing what Julia adds vs CUDA.

Here’s the CPU code for something that is a little more representative:

using FFTW
using LinearAlgebra

N = 1000
M = 1024
L = 1024
# many datasets
datasets = [rand(M, L) for i in 1:N]

p = plan_fft(datasets[1], 1)

function process_dataset(A, p)
    # the specific code here shouldn't be important
    # some signal processing
    B = p * A
    # some linear algebra
    B = B * Diagonal(B) * B'
    # some custom function
    B = some_custom_function.(A)
    return B
end

function some_custom_function(A)
    return sin(exp(A)) + 5
end

function combine_datasets(A)
    B = reduce(+, A)
    return B
end

# wrap static arguments for broadcasting
f(x) = process_dataset(x, p)

# independent processing on each dataset
data_tmp = f.(datasets)

# processing across datasets
data_out = combine_datasets(data_tmp)

Am I correct in concluding that I can’t broadcast on the GPU like I can on the CPU, at least not efficiently. Based on some of what has been suggested here, I would need to write a custom process_many_datasets that can batch process datasets. To me this seems more “MATLAB-like” and less “Julia-like.”

Do we understand where the limitation is (hardware, software, compiler)?

Am I thinking about this wrong?

OK, that example clarifies a lot. We don’t efficiently support what you’re doing there: broadcasting a function that does its own broadcasting over the inner 1D dataset. If that inner broadcast is coarse enough, i.e. if it does enough work to saturate the GPU, a mapslices-like approach might work (assuming we optimize the implementation of that API). But generally it will be more efficient to create a function that you can broadcast over the entire 2D dataset (not an array of arrays) in one go, e.g., like the batched fft interface in your first post. Some of your operations are element-wise anyway, and the final reduce you can perform in a batched manner with the dims keyword.

Don’t expect magic from the GPU. We implement broadcast to compile to a single kernel; you’re broadcasting a function that does an FFT and further broadcasts; that’s never going to work as a single kernel on the GPU. If you stick to this approach, optimizing mapslices would be the way to go. Else, you’d need to think about restructuring your operations to be dataset-wide, which would have advantages on the CPU too.

Except for the fact you’re writing high-level Julia code instead of CUDA C, you mean? :slightly_smiling_face:

4 Likes

This explanation was extremely helpful.

To summarize:

The issue was that a single call to a function process_dataset does not saturate the GPU, so I was trying to parallelize that call over many datasets using broadcast (or map). GPU parallel processing does not seem to be that generalized. The solution is to follow the common dims= style interface for a function that can explicitly processes many datasets in a single batch.

Also, mapslices needs to be optimized to be used effectively for even serial processing on the GPU - see example (2) above.

:+1:

Are you looking for something like this: https://github.com/FluxML/Hydra.jl ?

re. options #1 and #2 with map and mapslices: shouldn’t those kernels run asynchronously on the GPU? if so, i would expect some inefficiency compared to a monolithic kernel which processes the entire data in a single batch, but not a >30-fold slow down.

my specific problem is as follows. i have a 44x44 matrix of floats (call it P) that i need to multiply by a 44x1 vector (=R). i have 20,000 such products to compute. storing them as a vector of matrices, map((x,y)->x*y, P, R) takes 20,000 times longer than P[1]*R[1]. so seemingly no asynchronous parallelism. i tried pre-allocating the output and using map!, but no improvement. i tried asyncmap and got “GPU program failed to execute”. not sure how i could re-factor to multiply two 3D matrices in that way.

thanks for any advice.

You probably want to call CUDA’s built-in batched matrix multiplication routine. You can get at the batched routines through NNLib.jl, specifically NNlib.batched_mul!.

thanks for the pointer to NNlib!

still curious though that kernel calls do not seem to be asynchronous.

They definitely are, why do you assume they aren’t? Kernel calls aren’t free, of course, so if your kernel execution time is shorter than the launch time you’ll just scale the time it takes to launch.