Discussion: an operator for mul!


#1

Hi guys!

After my last post related to optimization (Help to reduce number of allocations) I realized how much I can gain if I use mul! to avoid allocation on every loop iteration. However, the code becomes very “unreadable”. Take a look:

        # Pp = Fk_1*Pu*Fk_1' + Q
        mul!(aux1, Fk_1, Pu)
        mul!(aux2, aux1, Fk_1')
        @. Pp = aux2 + Q

        # K = Pp*Hk'*pinv(R + Hk*Pp*Hk')
        mul!(aux3, Pp, Hk')
        mul!(aux4, Hk, aux3)
        mul!(K, aux3, pinv(R + aux4))

        # Pu = (I - K*Hk)*Pp*(I - K*Hk)' + K*R*K'
        mul!(aux1, K, Hk)
        @. aux2 = I18 - aux1
        mul!(aux5, aux2, Pp)
        mul!(aux6, aux5, aux2')
        mul!(aux3, K, R)
        mul!(aux5, aux3, K')
        @. Pu = aux6 + aux5

compared to

        Pp = Fk_1*Pu*Fk_1' + Q
        K  = Pp*Hk'*pinv(R + Hk*Pp*Hk')
        Pu = (I - K*Hk)*Pp*(I - K*Hk)' + K*R*K'

Hence, I am wondering if we could develop a new operator to replace mul!. I thought very little about this but what about:

     Fk_1*->(aux)Pp

to replace

    mul!(aux, Fk_1, Pp)
    aux

(or anything else that can make me write the entire expression on a single line)

Pp .= Fk_1*->(aux1)Pu*->(aux2)Fk_1' + Q

#2

Infix operators with three operands are difficult to do… A nice approach (although purely hypothetical) would be to have the output placed ‘above’ the operator, something like a *ᶜ b i.e. (a *\^c b) for mul!(c, a, b)


#3

Yes, this seems nice! I am not sure how difficult it will be to implement that. Because, the way it is proposed by me and you, I think it will require changes in the parser… Can any dev comment about it?


#4

Check out InplaceOps.jl.

using InplaceOps

@! temp1 = Fk_1 * Pu;
@! temp2 = temp1 * Fk_1';
@. Pp = temp2 + Q;

#5

Or use a macro … :slight_smile: that looks good as well. @Ronis_BR, the modified operator gets parsed right now as a regular operator so the parser would need modifications.


#6

Putting aside the fact that this would be a breaking change, it would also only work for a very small subset of possible variable names — Unicode has very few superscript characters.

I think macros ala InplaceOps are probably the best solution here.


#7

I am working on creating a similar add! and mul! operations for statically typed Grassmann numbers, which are equivalent representations to 2^n-dimensional matrices, so I have been wondering the same thing.

It looks like InplaceOps does what I am looking for, so I only need to make the syntax compatible.


#8

@stevengj Yes; I was actually not aware of InplaceOps…


#9

I was not aware also of InplaceOps. But how could we use those kind of macros so that we can have many operations in one line? Let’s say, from this:

R = A*B*C*D*E*F

to this:

mul!(aux1, A, B)
mul!(aux2, aux1, C)
mul!(aux3, aux2, D)
mul!(aux4, aux3, E)
mul!(R, aux4, F)

that’s why I though something that you can specified the auxiliary variable for each multiplication:

R = A*->(aux1)B*->(aux2)C*->(aux3)D*->(aux4)E*->(aux5)F

But now I think I can create a macro to parse something like this. Shouldn’t be difficult right?


#10

LazyArrays.jl can already do C .= A ⋆ B (or Mul(A, B)) to call non-allocating matrix multiplication. It can even fuse C .= α .* A ⋆ B .+ β into a single routine call without allocation. (Note that you need explicit import using LazyArrays: ⋆; this is probably still an experimental feature)


#11

Here’s another possible solution if you don’t want extra dependancies or a whole change of philosophy. It’s still a little ugly but not terrible I think:

A = [0 1; 1  0];
B = [1 0; 0 -1];
C = [0 0; 0  0];

*̂(A, (B, C)) = mul!(C, A, B) #  *̂ is typed *\hat<TAB>

julia> A *̂ (B, C)
2×2 Array{Int64,2}:
 0  -1
 1   0

One can play around with the ordering or use a different unicode modifier than \hat but that was just a naïve test implementation.

A big problem (shared with mul!) is that its not obvious from a first glance what C is for when you look at A *̂ (B, C).


Edit:
Here’s something that might be a bit more readable, since it is suggestive of the idea that data is flowing into C

struct MulStore{T,N}
    A::Array{T,N}
    B::Array{T,N}
end

*̂(A, B) = MulStore(A, B)
Base.:(=>)(prod::MulStore, C) = mul!(C, prod.A, prod.B)

julia> (A *̂ B) => C
2×2 Array{Int64,2}:
 0  -1
 1   0

The operator precedence works out such that one doesn’t even need to put parens around the product (though its probably still a good idea)

julia> A *̂ B => C
2×2 Array{Int64,2}:
 0  -1
 1   0

Hence, R = A*B*C*D for example becomes

(((((A *̂ B) => aux1) *̂ C) => aux2) *̂ D) => R

Some simple benchmarks I ran suggest this has zero a very small but non-zero runtime performance cost but who knows what happens in a more complicated example.

Another issue is that this breaks the convention of putting a ! in front of mutating functions and it would be considered a ‘pun’ on the pair operator => which is less offensive than type piracy on => but still discouraged (albeit for stylistic reasons).


#12

What you want to do (in general) is to lazily build up a computational graph, run analysis on it to figure out the order to do the multiplications, what temporaries need to be allocated, what specialized kernels that can be used (gemm), and then finally execute the plan for evaluating the graph. This is more complicated than what can be done by just adding an operator so punting on it for packages (like LazyArrays) seems like a good idea.


#13

Maybe some overkill for your use case, but TensorOperations.jl does, since v1.0, provide a cache to store temporaries:

using TensorOperations, LinearAlgebra, BenchmarkTools
N = 100
A = randn(N,N); B = randn(N,N); C = randn(N,N);
D = randn(N,N); E = randn(N,N); F = randn(N,N);
aux1 = similar(A); aux2 = similar(A); aux3 = similar(A); aux4 = similar(A);
R = similar(A);

@btime mul!($R, mul!($aux4, mul!($aux3, mul!($aux2, mul!($aux1, $A, $B), $C), $D), $E),  $F);
  137.656 μs (0 allocations: 0 bytes)

@btime @tensor $R[a,b] = $A[a,1] * $B[1,2] * $C[2,3] * $D[3,4] * $E[4,5] * $F[5,b];
  147.277 μs (68 allocations: 3.31 KiB)

So some overhead (those allocations are not for the temporaries, they are independent of N), but pretty close.


#14

I’d think this probably the nicest solution here, though for the uninitiated I think using numbered contracted indices its probably a bit confusing. I think it’s more clear to write

@tensor R[a,b] = A[a,i] * B[i,j] * C[j,k] * D[k,l] * E[l,m] * F[m,b];

as i,j,k,l,m are more sensible dummy indices (at least to my eye) than number literals.

It’s also worth mentioning that @tensor performs poorly compared to the direct nested mul! solution for small N. For N=2, doing it with @tensor around 50x slower on my machine than nestedmul!and 20 times worse than regular*` with temporaries.

It doesn’t beat * on my machine until N=5, at which point it happens to be 4x better than standard * and 4x worse than nested mul!


#15

Note that this doesn’t seem to work for what the OP wants which is multiple nested multiplications: https://github.com/JuliaArrays/LazyArrays.jl/issues/15


#16

Sure, here is a prototype which transforms the code

macro mul!(expr)
    ((typeof(expr) ≠ Expr) | (expr.head ≠ :(=)) | (expr.args[2].args[1] ≠ :*)) && (return expr)
    length(expr.args[2].args) == 3 && (return :(mul!($(expr.args[1]),$(expr.args[2].args[2:3]...))))
    out = []
    push!(out,:(mul!(aux1,$(expr.args[2].args[2:3]...))))
    k = length(expr.args[2].args)-2
    for j ∈ 2:k-1
        push!(out,:(mul!($(Symbol(:aux,j)),$(Symbol(:aux,j-1)),$(expr.args[2].args[j+2]))))
    end
    push!(out,:(mul!($(expr.args[1]),$(Symbol(:aux,k-1)),$(expr.args[2].args[k+2]))))
    return Expr(:block,out...)
end

which produces the code example you provided above

julia> @macroexpand @mul! R=A*B*C*D*E*F
quote
    (Main.mul!)(Main.aux1, Main.A, Main.B)
    (Main.mul!)(Main.aux2, Main.aux1, Main.C)
    (Main.mul!)(Main.aux3, Main.aux2, Main.D)
    (Main.mul!)(Main.aux4, Main.aux3, Main.E)
    (Main.mul!)(Main.R, Main.aux4, Main.F)
end

You could make assignments for aux1,aux2,... inside the macro also, if you wanted


#17

Thanks! I can use your code as a start point :slight_smile:

What I am thinking is to also have something like:

@mul! for k = 1:1000
...
end

Then I create a special multiplication symbol that will be replaced by mul! and a auxiliary variable will be declared outside the for to avoid unnecessary allocations.