TensorCast & CUDA

I’m having problems using TensorCast on CuArrays. When I run the following with normal arrays, it works fine:

using TensorCast
C = ones(10,2)
L = ones(10,3)
@reduce D[m,a] := sum(p) C[p,a] + L[p,m]

3×2 Array{Float64,2}:
 20.0  20.0
 20.0  20.0
 20.0  20.0

But if I do the same with CUDA arrays, it produces an error:

using TensorCast
using CUDA
CUDA.allowscalar(false)
C = cu(ones(10,2))
L = cu(ones(10,3))
@reduce D[m,a] := sum(p) C[p,a] + L[p,m]

ERROR: LoadError: scalar getindex is disallowed

It’s possible to do the CUDA version with @cast as an intermediate step as follows:

using TensorCast
using CUDA
CUDA.allowscalar(false)
C = cu(ones(10,2))
L = cu(ones(10,3))
@cast T[p,m,a] := C[p,a] + L[p,m]
D = reshape(sum(T, dims=1), (3,2))

3×2 CuArray{Float32,2,CuArray{Float32,3,Nothing}}:
 20.0  20.0
 20.0  20.0
 20.0  20.0

But it’s not at all clear to me why these would be different, so I have a few questions:

  • Is there another way to do this operation with @reduce that I’m missing?
  • Is there a performance difference between these (assuming they both worked)?
  • Would the intermediate allocation of T be happening under-the-hood anyway?

I think the problem is that D is being initialized as an Array not a CuArray.

Here’s the expansion, the function orient is usually just reshape to put the dimensions along the :s.

julia> @pretty @reduce D[m,a] := sum(p) C[p,a] + L[p,m]
begin
    local dotterel = orient(PermuteDims(C), (*, :, :))
    local ant = orient(PermuteDims(L), (:, *, :))
    D = dropdims(sum(@__dot__(dotterel + ant), dims = 3), dims = 3)
end

Because reshaping a transposed Array usually gives something very slow, sometimes orient makes a copy, and perhaps it is incorrectly making a CPU array. I thought this was fixed in https://github.com/mcabbott/TensorCast.jl/pull/10 but perhaps not. Do you mind making an issue?

Your work-around puts the index to be summed first instead of last, in which case there is no transposing required. The macro @reduce is unfortunately not smart enough to notice that possibility.

It will always make an intermediate allocation like T, broadcasting before reducing. There is an option lazy which uses LazyArrays.jl to avoid this allocation, but I’m not confident it has kept up with changes in that package, nor whether it will work with a CuArray.