Tensor multiplication

arrays

#1

Hi all,

I am implementing a function to perform a generalization of matrix multiplication to a general N-dimensional array or tensor. This product is denoted as \times_m to multiply a conformable matrix A with a tensor \mathcal{X} according to dimension n.

A working example is given below (note, I already tried several things to make it more performant: @inbounds, efficient looping over array elements etc).

function tensormult4!(X, A, Y, T, dim)
    @assert size(X, dim) == size(A, 2)
    Tmax = last(CartesianIndices(T))
    m, n = size(A)
    @simd for k in 1:n
        fill!(T, zero(eltype(T)))
        @simd for I in CartesianIndices(X)
            @inbounds T[min(Tmax, I)] += A[k, I[dim]] * X[I]
        end
        @simd for I in CartesianIndices(Y)
            if I[dim] == k  # remove this?
                @inbounds Y[I] = T[min(Tmax, I)]
            end
        end
    end
    return Y
end

a working example:

A = rand(4, 4)
X = rand(3, 4, 2)
Y = similar(X)
T = sum(X, dims=2)  # T is a temporary placeholder

tensormult4!(X, A, Y, T, 2)

Code seems to run not too slowly, but I suspect there is quite a bit of low hanging fruit.


#2

This doesn’t run for me at the moment. Can you say more precisely what you want to achieve?

Many such operations can be written with Einsum.jl or TensorOperations.jl, here is one reading of how your Y is related to X and A:

using Einsum
@einsum Y[i,a,k] := A[a,b] * X[i,b,k]  # sum over b=1:4

#3

My package Grassmann.jl is designed to work with high dimensional conformal tensor algebras, but is still a work in progress. Due to sparse multivector representation and design, it outperforms “higher dimensional arrays” in multiple applications, as more sparse methods get implemented.


#4

There was a mistake in initializing T, which is now fixed in the example.

I want to build indeed similar code as Einsum.jl and TensorOperations.jl. I not that my code seems to be faster than @einsum but substantially slower than @tensor.

At the moment, the reason is mainly autodidactic. It is not clear from the manual how TensorOperations.jl is so fast and how I can improve my own code.


#5

Good, now it works. But I think your loops can be much simpler, there is no need for temporary storage T.

Einsum just generates a set of loops, which you can examine by writing @macroexpand @einsum Y[i,a,k] = A[a,b] * X[i,b,k]. (They may have a few let blocks which IIRC were needed for Julia 0.6.).

TensorOperations is much more complicated, but its goal here is essentially to call mul!(A,B,C) on the arrays. You could simulate this by writing reshape(permutedims(X, (2, 1, 3)),size(X,2),:) to make a matrix which you can multiply by A, and then un-doing that at the end. This should be faster on large arrays, BLAS is optimised in lots of clever ways. But for these small ones it seems to be beaten by Einsum:

julia> using TensorOperations, Einsum
julia> ein(X,A,Y) = @einsum Y[i,a,k] = A[a,b] * X[i,b,k]
julia> ten(X,A,Y) = @tensor Y[i,a,k] = A[a,b] * X[i,b,k]

julia> @btime tensormult4!($X, $A, $Y, $T, 2);
  300.207 ns (0 allocations: 0 bytes)

julia> @btime ein($X, $A, $Y);
  72.467 ns (0 allocations: 0 bytes)

julia> @btime ten($X, $A, $Y);
  1.760 μs (42 allocations: 2.66 KiB)

If you want to treat N-dimensional arrays at once, then things will get more complicated… you may be able to write things like reshape(X,size(X,1),:), but you may have to get into ntuple indices and all that, especially if you want an argument dims like this (which may want to be Val(dims) etc.)


#6

I rewrote the function using a reshape rather than for-loops, this gives a better running time, but of course much more memory use. I think it works faster than Einsum, but still much slower than TensorOperations.

function tensormult5!(X, A, Y, dim)

    @assert size(X, dim) == size(A, 2)
    m, n = size(A)
    s = collect(1:ndims(X))
    d = [size(Y)...]
    for i in 1:(2-1)
       s[i], s[i+1] = s[i+1], s[i]
    end
    T = A * reshape(permutedims(X, s), size(X, dim), :)
    Y[:] = permutedims(reshape(T, d[s]...), s)
    return Y
end

So benchmarking gives:

julia> using BenchmarkTools

julia> using TensorOperations, Einsum

julia> ein(X,A,Y) = @einsum Y[i,a,k] = A[a,b] * X[i,b,k]
ein (generic function with 1 method)

julia> ten(X,A,Y) = @tensor Y[i,a,k] = A[a,b] * X[i,b,k]
ten (generic function with 1 method)

julia> X = rand(100, 200, 300);

julia> A = rand(200, 200);

julia> Y = similar(X);

julia> dim = 2;

julia> @btime tensormult5!($X, $A, $Y, $dim);
  63.654 ms (24 allocations: 137.33 MiB)

julia> @btime ein($X, $A, $Y);
  1.277 s (0 allocations: 0 bytes)

julia> @btime ten($X, $A, $Y);
  39.209 ms (328 allocations: 11.88 KiB)

Is reshaping arrays for such applications the most efficient? Because in general I have to copy the complete array so that it corresponds to the right multiplication. I would expect that for-loops might be more efficient because no data is copied.


#7

The trick for performance is having the correct loops.

For example, optimized matrix multiplication is actually done with at least five loops, instead of the naive three:
The three expected loops on a small computational kernel. This kernel is small enough so that the entire output fits on the CPU registers. Eg, create/update 8x6 block blocks at a time on a computer with avx2 (or a 16x14 block at a time with avx512).
Then you loop over this kernel.

You’d need a similar approach for higher dimensional tensors.

Also, I think you can avoid copying data with the BLAS version.

You could take a look at this:
http://tensor-compiler.org

IIRC, someone was working on something like that for Julia.


#8

Reshaping is quite efficient, but permutedims makes a copy. You can experiment with PermutedDimsArray which is its lazy cousin, however if you use that in here:

then it seems like * falls back to some slow-but-generic code, giving times similar to the @einsum version. But using that for copying into Y, or else permutedims!(Y,..., will save you a copy.

As @Elrod says, the efficient BLAS * is much more complicated than 3 loops, for reasons of cache-efficiency. TensorOperations is doing something similar, and you can also use its back-end Strided.jl alone. Adding @strided I think replaces permutedims with a lazy version specific to that package, and likewise replaces * with something a bit more flexible. Both this and BLAS should be multi-threaded; I believe that Strided uses Julia’s threads (e.g. Threads.nthreads()) but don’t swear to it.


#9

Thanks for the hints @mcabbott and @Elrod!

I gather that my code cannot improved much implementation wise, but should be rather optimized at the level of the algorithm.

I tried @strided, but this does not seem to do anything for time and memory use. As far as I understand, this is what makes TensorOperations.jl so fast.


#10

There was a nice talk on this by Peter Ahrens, including the relationship between Tortilla (which doesn’t seem to be public yet) and Taco: https://youtu.be/Rp7sTl9oPNI