Hi all,

I am implementing a function to perform a generalization of matrix multiplication to a general N-dimensional array or tensor. This product is denoted as \times_m to multiply a conformable matrix A with a tensor \mathcal{X} according to dimension n.

A working example is given below (note, I already tried several things to make it more performant: @inbounds, efficient looping over array elements etc).

```
function tensormult4!(X, A, Y, T, dim)
@assert size(X, dim) == size(A, 2)
Tmax = last(CartesianIndices(T))
m, n = size(A)
@simd for k in 1:n
fill!(T, zero(eltype(T)))
@simd for I in CartesianIndices(X)
@inbounds T[min(Tmax, I)] += A[k, I[dim]] * X[I]
end
@simd for I in CartesianIndices(Y)
if I[dim] == k # remove this?
@inbounds Y[I] = T[min(Tmax, I)]
end
end
end
return Y
end
```

a working example:

```
A = rand(4, 4)
X = rand(3, 4, 2)
Y = similar(X)
T = sum(X, dims=2) # T is a temporary placeholder
tensormult4!(X, A, Y, T, 2)
```

Code seems to run not too slowly, but I suspect there is quite a bit of low hanging fruit.