# Compiler matmul optimization prevents constant folding?

I have a use case where I iterate over a container of small vectors. Say vs is a Vector{SVector{2, Float64}}, M is a global constant SMatrix{2,2,Float64,4}, and an illustrative use case is

function f(vs)
total = zero(eltype(vs))
for v in vs
total += v + M*v + M*M*v
end
end


To my surprise, the compiler does not fold in the constant M*M, as it right-associates the expression to M*(M*v) according to

*(A::AbstractMatrix, B::AbstractMatrix, x::AbstractVector) = A * (B*x) # from matmul.jl


Specifying (M*M)*v explicitly fixes the problem, and invokes StaticArraysâ€™s multiplication definitions. Is this the desired behavior, or should I open an issue? (I set aside that had M not been a global constant, M*M is still a loop invariant.)

This is expected. Sure, the compiler could always be smarter about this, and maybe it should, but itâ€™s pretty normal.

Weâ€™d basically need a transformation that can re-associate multiplications depending on what it knows about constants.

(M*M)*v will give different answers than M*(M*v), so the compiler canâ€™t be allowed to arbitrarily reorder them without fast math flags.
So some default order has to be picked for chained operations.

Note that it still wonâ€™t do this even for types with truly associative multiplication.

1 Like

I do take issue with this definition. * is defined to be left-associative EDIT: dubious â€“ see discussion below. There is an obvious performance reason for the above definition â€“ a matrix multiplication chain with a vector at one end is always efficiently (optimally?) reduced from the vector end. However, there are also obvious performance reasons for @fastmath transformations, but we donâ€™t use those without explicit permission because they violate defined semantics.

I think a bug report should be opened against the above right-associative function since it is invoked from the plain syntax A*B*x.

For your literal issue, I would use the total += v + M*v + (M*M)*v definition since it does the compile-time M*M calculation you want. However, with a slightly broader perspective, I would write your expression as total += v + M*(v + M*v). This is the same number of operations as your (M*M)*v version after constant folding, but is usually slightly-better behaved, numerically. It also doesnâ€™t require constant-folding or storage of M*M.

2 Likes

If the matmuls inline, it could in theory, but Iâ€™m guessing it is too complicated for LLVM to pull off.

But a higher level transform would be cool.
Itâ€™d be great if more high level array-compiler like transforms were possible.

Do you have a source for that? Iâ€™ve never seen such a claim. In fact, the docstring for *(A, B::AbstractMatrix, C) explicitly says itâ€™ll choose whichever multiplication order results in less operations:


*(A, B::AbstractMatrix, C)
A * B * C * D

Chained multiplication of 3 or 4 matrices is done in the most efficient sequence, based on the sizes of the arrays. That is, the
number of scalar multiplications needed for (A * B) * C (with 3 dense matrices) is compared to that for A * (B * C) to choose which
of these to execute.

If the last factor is a vector, or the first a transposed vector, then it is efficient to deal with these first. In particular x' * B
* y means (x' * B) * y for an ordinary column-major B::Matrix. Unlike dot(x, B, y), this allocates an intermediate array.

If the first or last factor is a number, this will be fused with the matrix multiplication, using 5-arg mul!.


It wouldnâ€™t surprise me if we falsely promised left-associativity somewhere in the manual though.

2 Likes

Yes, I suppose my question boils down to what is assumed the default associativity (and perhaps what transformations are feasible when fastmath is permitted). @mikmoore for my particular use case, thatâ€™s an elegant solution! Though I think for code readability and maintainability (there are quite a few geometry gymnastics in my full case) Iâ€™d stick with (M*M). (If itâ€™s of interest, if I remember right the performance difference was of order 10%, and thatâ€™s when the inner loop does considerable more work than the example I used here.)

https://docs.julialang.org/en/v1/manual/mathematical-operations/#Operator-Precedence-and-Associativity

The linked page describes multiplication as left-associative. However, it includes the footnote

The operators +, ++ and * are non-associative. a + b + c is parsed as +(a, b, c) not +(+(a, b), c). However, the fallback methods for +(a, b, c, d...) and *(a, b, c, d...) both default to left-associative evaluation.

So it makes the exception for parsing but doesnâ€™t actually state any situations where exceptions are made. The wording of â€śby defaultâ€ť might suggest that unless a library defines an exception, they do not exist. However, matrix-chain multiplication is part of the standard library and is loaded by default as part of mere Base. The doctring for * with matrix arguments would appear to dictate such an exception.

This is to say that there appears to be some mild disagreement within the docs. The more-specific documentation of ?* has an arguement to take precedence, however.

M*(M*v) requires many less operations than (M*M)*v, assuming M is a square matrix and v is a vector! Itâ€™s O(N^3) ops vs O(N^2).
The problem is, if M is an immutable global constant, we could calculate M2 = M*M ahead of time.

Ah I see, yeah thatâ€™s highly misleading, because itâ€™s talking about associativity at the parser level, not about what multiplication actually means or what gets executed as an optimization. We should probably clarify that section to make it more clear that methods like *(args....) can still choose to re-associate however they like.

(@mikmoore I suppose constant-folding (I + M + M*M)*v is the performance-optimal logical conclusion, if not the most well-behaved? Admittedly, even for associative types, thatâ€™s a much broader optimization to expect the compiler to recognize.)

Yeah I understand that. I was just explaining that it was explicitly documented in the docstring for * that it was not in fact left-associative, it would instead choose whichever associativity it thinks has the least operations.

I think having a compiler pass that can lift the M*M out of the loop would qualify as meeting that semantic because it does result in less runtime operations, even if implementing that pass might be hard.

Yes, although I would avoid getting too dependent on constant-folding since this is the worst way to perform this calculation when constant-folding is not possible. Iâ€™d recommend instead calculating I + M + M*M as a global constant and apply that directly.

My issue with this is that itâ€™s getting harrowingly close to @fastmath-style reassociations. Some algorithms are dependent on things like a*b*c being evaluated in a specific order due to under/overflow concerns or for maintaining precision. Thereâ€™s a branch in our current ComplexF64 multiply that takes specific pains to ensure that 3 Float64s are multiplied in a viable order.

I know this isnâ€™t being literally proposed for IEEEFloat operands. But when those operands are 1 \times 1 or diagonal matrices of floats we expose the same vulnerabilities. This isnâ€™t literally protected by IEEE754 semantics but it still makes me uneasy.

If is feasible to restrict this transformation to associative types? In the opposite case of e.g. eltype(M) an integer, rational, etc., this optimizes performance while also improves stability, if the subsequent multiplication is with real number in v. While not knowing much about compiler inferability, Iâ€™d suppose being aware of the data type is easier than analyzing invariants.

Fastmath style reassociations are already the reality for matrix multiplication. BLAS implementations already do lots of re-association by default. The situation with linear algebra is very different from the situation with scalar code:

using StaticArrays
struct MyAbstractArray{T, N} <: AbstractArray{T, N}
A::AbstractArray{T, N}
end;
Base.size(A::MyAbstractArray) = size(A.A);
Base.getindex(A::MyAbstractArray, inds...) = getindex(A.A, inds...);

julia> let N = 3
A = randn(N, N)
B = randn(N, N)
C = randn(N, N)

sA = SMatrix{N, N}(A)
sB = SMatrix{N, N}(B)
sC = SMatrix{N, N}(C)

norm(A * B * C) === norm(sA * sB * sC)
end
false

julia> let N = 100
A = randn(N, N)
B = randn(N, N)
myA = MyAbstractArray(A)
myB = MyAbstractArray(B)

sum(A * B) === sum(myA * myB)
end
false


If I recall correctly, even if you hit the exact same method, youâ€™ll get a different answer depending on your machine capabilities, like if you hit AVX-512, versus AVX2, or if you have different cache sizes that cause different tiling strategies.

Yes, but these are done at the matrix-matrix level. Here I am arguing for consistent associativity at the expression level. Iâ€™m arguing for only([x;;] * [y;;] * [z;;]) == x*y*z == (x*y)*z, which is still compatible with internal reassociations of matrix-matrix multiplication.

I think that writing A * (B * x) to force right-associativity is less onerous than hoping (or counting on the fact that) that A * B * x quietly right-associates. Itâ€™s also less surprising when it does.

The overwhelming majority of the time, people are simply applying floating point operations to implement math-on-reals and havenâ€™t given thought to this and they wonâ€™t notice or care if we re-order stuff for speed. But in very few cases people really do care and this reassociation isnâ€™t documented well enough (in my opinion) that people can be aware of it.

Also, Iâ€™m reminded of #52333 and the hazard identified there. In that spirit, see this example:

julia> using Octonions

julia> x = Octonion(1,2,3,4,5,6,7,8); y = Octonion(1,-2,3,4,5,6,7,8); z = Octonion(1,2,-3,4,5,6,7,8);

julia> [x;;] * [y;;] * [z;;]
1Ă—1 Matrix{Octonion{Int64}}:
Octonion{Int64}(-652, -1064, 612, -736, -860, -1128, -1036, -1712)

julia> [x;;] * [y;;] * [z;] # re-assocation changes the value
1-element Vector{Octonion{Int64}}:
Octonion{Int64}(-652, -1064, 612, -736, -1148, -888, -1420, -1376)


Reassociation outside of very controlled contexts is a significant hazard.

I think N-ary * is a failed experiment, or very close to one. By association, Iâ€™m worried about N-ary + and ++ as well, although those operations are more-commonly associative.

Yes, this seems misleading. Imho, the situation is similar to reduce which claims:

help?> reduce
search: reduce mapreduce

reduce(op, itr; [init])

[...]

The associativity of the reduction is implementation dependent. This means
that you can't use non-associative operations like - because it is undefined
whether reduce(-,[1,2,3]) should be evaluated as (1-2)-3 or 1-(2-3). Use
foldl or foldr instead for guaranteed left or right associativity.


Maybe, + and * should be documented similarly (which they are in a sense, but could be misunderstood) and clearly state: â€śWhen left- or right-associativity is required use brackets, i.e., a + (b + c) or (a + b) + c instead of a + b + câ€ť