Hi all,
I am implementing a function to perform a generalization of matrix multiplication to a general N-dimensional array or tensor. This product is denoted as \times_m to multiply a conformable matrix A with a tensor \mathcal{X} according to dimension n.
A working example is given below (note, I already tried several things to make it more performant: @inbounds, efficient looping over array elements etc).
function tensormult4!(X, A, Y, T, dim)
    @assert size(X, dim) == size(A, 2)
    Tmax = last(CartesianIndices(T))
    m, n = size(A)
    @simd for k in 1:n
        fill!(T, zero(eltype(T)))
        @simd for I in CartesianIndices(X)
            @inbounds T[min(Tmax, I)] += A[k, I[dim]] * X[I]
        end
        @simd for I in CartesianIndices(Y)
            if I[dim] == k  # remove this?
                @inbounds Y[I] = T[min(Tmax, I)]
            end
        end
    end
    return Y
end
a working example:
A = rand(4, 4)
X = rand(3, 4, 2)
Y = similar(X)
T = sum(X, dims=2)  # T is a temporary placeholder
tensormult4!(X, A, Y, T, 2)
Code seems to run not too slowly, but I suspect there is quite a bit of low hanging fruit.