TensorCast & CUDA

bryce · June 15, 2020, 11:44pm

I’m having problems using TensorCast on CuArrays. When I run the following with normal arrays, it works fine:

using TensorCast
C = ones(10,2)
L = ones(10,3)
@reduce D[m,a] := sum(p) C[p,a] + L[p,m]

3×2 Array{Float64,2}:
 20.0  20.0
 20.0  20.0
 20.0  20.0

But if I do the same with CUDA arrays, it produces an error:

using TensorCast
using CUDA
CUDA.allowscalar(false)
C = cu(ones(10,2))
L = cu(ones(10,3))
@reduce D[m,a] := sum(p) C[p,a] + L[p,m]

ERROR: LoadError: scalar getindex is disallowed

It’s possible to do the CUDA version with @cast as an intermediate step as follows:

using TensorCast
using CUDA
CUDA.allowscalar(false)
C = cu(ones(10,2))
L = cu(ones(10,3))
@cast T[p,m,a] := C[p,a] + L[p,m]
D = reshape(sum(T, dims=1), (3,2))

3×2 CuArray{Float32,2,CuArray{Float32,3,Nothing}}:
 20.0  20.0
 20.0  20.0
 20.0  20.0

But it’s not at all clear to me why these would be different, so I have a few questions:

Is there another way to do this operation with @reduce that I’m missing?
Is there a performance difference between these (assuming they both worked)?
Would the intermediate allocation of T be happening under-the-hood anyway?

haberdashPI · June 16, 2020, 11:32am

I think the problem is that D is being initialized as an Array not a CuArray.

mcabbott · June 16, 2020, 12:16pm

Here’s the expansion, the function orient is usually just reshape to put the dimensions along the :s.

julia> @pretty @reduce D[m,a] := sum(p) C[p,a] + L[p,m]
begin
    local dotterel = orient(PermuteDims(C), (*, :, :))
    local ant = orient(PermuteDims(L), (:, *, :))
    D = dropdims(sum(@__dot__(dotterel + ant), dims = 3), dims = 3)
end

Because reshaping a transposed Array usually gives something very slow, sometimes orient makes a copy, and perhaps it is incorrectly making a CPU array. I thought this was fixed in https://github.com/mcabbott/TensorCast.jl/pull/10 but perhaps not. Do you mind making an issue?

Your work-around puts the index to be summed first instead of last, in which case there is no transposing required. The macro @reduce is unfortunately not smart enough to notice that possibility.

It will always make an intermediate allocation like T, broadcasting before reducing. There is an option lazy which uses LazyArrays.jl to avoid this allocation, but I’m not confident it has kept up with changes in that package, nor whether it will work with a CuArray.

Topic		Replies	Views
CUDA on slices General Usage cuda , broadcasting , cuarrays , tensors , mapslices	5	1364	September 6, 2022
How do I to transform mapreduce function to work well with CUDA? GPU	5	1550	May 14, 2021
Tensor operations with AbstractArrays New to Julia package	3	1857	January 18, 2021
Rotr90 of a CUDA.CuArray GPU	5	333	October 6, 2022
Unrolling of loops with operations on CuArrays GPU gpu , cuarrays , unrolling	4	753	October 27, 2021

TensorCast & CUDA

Related topics