# TensorCast & CUDA

I’m having problems using TensorCast on CuArrays. When I run the following with normal arrays, it works fine:

``````using TensorCast
C = ones(10,2)
L = ones(10,3)
@reduce D[m,a] := sum(p) C[p,a] + L[p,m]

3×2 Array{Float64,2}:
20.0  20.0
20.0  20.0
20.0  20.0
``````

But if I do the same with CUDA arrays, it produces an error:

``````using TensorCast
using CUDA
CUDA.allowscalar(false)
C = cu(ones(10,2))
L = cu(ones(10,3))
@reduce D[m,a] := sum(p) C[p,a] + L[p,m]

ERROR: LoadError: scalar getindex is disallowed
``````

It’s possible to do the CUDA version with @cast as an intermediate step as follows:

``````using TensorCast
using CUDA
CUDA.allowscalar(false)
C = cu(ones(10,2))
L = cu(ones(10,3))
@cast T[p,m,a] := C[p,a] + L[p,m]
D = reshape(sum(T, dims=1), (3,2))

3×2 CuArray{Float32,2,CuArray{Float32,3,Nothing}}:
20.0  20.0
20.0  20.0
20.0  20.0
``````

But it’s not at all clear to me why these would be different, so I have a few questions:

• Is there another way to do this operation with @reduce that I’m missing?
• Is there a performance difference between these (assuming they both worked)?
• Would the intermediate allocation of T be happening under-the-hood anyway?
1 Like

I think the problem is that D is being initialized as an Array not a CuArray.

Here’s the expansion, the function `orient` is usually just `reshape` to put the dimensions along the `:`s.

``````julia> @pretty @reduce D[m,a] := sum(p) C[p,a] + L[p,m]
begin
local dotterel = orient(PermuteDims(C), (*, :, :))
local ant = orient(PermuteDims(L), (:, *, :))
D = dropdims(sum(@__dot__(dotterel + ant), dims = 3), dims = 3)
end
``````

Because reshaping a transposed `Array` usually gives something very slow, sometimes `orient` makes a copy, and perhaps it is incorrectly making a CPU array. I thought this was fixed in https://github.com/mcabbott/TensorCast.jl/pull/10 but perhaps not. Do you mind making an issue?

Your work-around puts the index to be summed first instead of last, in which case there is no transposing required. The macro `@reduce` is unfortunately not smart enough to notice that possibility.

It will always make an intermediate allocation like `T`, broadcasting before reducing. There is an option `lazy` which uses LazyArrays.jl to avoid this allocation, but I’m not confident it has kept up with changes in that package, nor whether it will work with a CuArray.

1 Like