TensorCast & CUDA

Here’s the expansion, the function orient is usually just reshape to put the dimensions along the :s.

julia> @pretty @reduce D[m,a] := sum(p) C[p,a] + L[p,m]
begin
    local dotterel = orient(PermuteDims(C), (*, :, :))
    local ant = orient(PermuteDims(L), (:, *, :))
    D = dropdims(sum(@__dot__(dotterel + ant), dims = 3), dims = 3)
end

Because reshaping a transposed Array usually gives something very slow, sometimes orient makes a copy, and perhaps it is incorrectly making a CPU array. I thought this was fixed in https://github.com/mcabbott/TensorCast.jl/pull/10 but perhaps not. Do you mind making an issue?

Your work-around puts the index to be summed first instead of last, in which case there is no transposing required. The macro @reduce is unfortunately not smart enough to notice that possibility.

It will always make an intermediate allocation like T, broadcasting before reducing. There is an option lazy which uses LazyArrays.jl to avoid this allocation, but I’m not confident it has kept up with changes in that package, nor whether it will work with a CuArray.

1 Like