Help to improve performance of gradient calculation on tensor operations

If you must make slices, then I think using SliceMap (+ JuliennedArrays) or TensorCast will usually be quicker than writing into a Buffer.

But can’t you just operate on the whole arrays here? I think you have written this:

r[i, j, k] = sum(s) a[s, j, k] * b[s, i, k] / sqrt(sum(s') a[s', j, k]^2) / sqrt(sum(s'') b[s'', i, k]^2)

in which the major operation is just batched matrix multiplication (with a ‘T’), multiplied by some normalisation factors (done perhaps by broadcasting).

1 Like