I’m having problems using TensorCast on CuArrays. When I run the following with normal arrays, it works fine:
using TensorCast
C = ones(10,2)
L = ones(10,3)
@reduce D[m,a] := sum(p) C[p,a] + L[p,m]
3×2 Array{Float64,2}:
20.0 20.0
20.0 20.0
20.0 20.0
But if I do the same with CUDA arrays, it produces an error:
using TensorCast
using CUDA
CUDA.allowscalar(false)
C = cu(ones(10,2))
L = cu(ones(10,3))
@reduce D[m,a] := sum(p) C[p,a] + L[p,m]
ERROR: LoadError: scalar getindex is disallowed
It’s possible to do the CUDA version with @cast as an intermediate step as follows:
using TensorCast
using CUDA
CUDA.allowscalar(false)
C = cu(ones(10,2))
L = cu(ones(10,3))
@cast T[p,m,a] := C[p,a] + L[p,m]
D = reshape(sum(T, dims=1), (3,2))
3×2 CuArray{Float32,2,CuArray{Float32,3,Nothing}}:
20.0 20.0
20.0 20.0
20.0 20.0
But it’s not at all clear to me why these would be different, so I have a few questions:
- Is there another way to do this operation with @reduce that I’m missing?
- Is there a performance difference between these (assuming they both worked)?
- Would the intermediate allocation of T be happening under-the-hood anyway?