Paper for discussion: The Linear Algebra Mapping Problem

5 Likes

In this paper, they said

However, it does not make use of the Cholesky factorization for SPD matrices nor of the Bunch Kaufman decomposition for symmetric matrices

But I think in the document of LinearAlgebra LinearAlgebra.factorize , it said it uses Bunch-Kaufman for Dense symmetric matrices. Did they make a mistake here?
What interests me is section “4.7. Loop-Invariant Code Motion”, it seems no languages can move the constant assignment out of loop(That is M=A*B ). This is weird, I always think Julia can somehow do this. But I won’t feel too much surprised because * is not pure:

julia> which(*,(Matrix,Matrix))
*(A::AbstractArray{T,2} where T, B::AbstractArray{T,2} where T) in LinearAlgebra at /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.2/LinearAlgebra/src/matmul.jl:142
julia> ans.pure
false

#edit: This issue is related: https://github.com/JuliaLang/julia/issues/29285
Currently, it seems Julia has no proper support for LICM(that is loop invariant code motion, which is exactly this section).
But Julia can reuse the memory, so at least it won’t allocate a new memory location in every iteration.

Here was my comment on the paper on HN:

https://news.ycombinator.com/item?id=21604036

1 Like

I may have missed something, but the paper seems to do 3 things:

  1. Claim that some kind of algebraically equivalent rearrangement would be advantageous. As pointed out by @StefanKarpinski, this is not very clear. Even for something as simple as real numbers, one can easily get into trouble with this for floating point arithmetic (or conversely, improve numerical accuracy by making a formula seemingly more complicated).

  2. Show that optimal common subexpression elimination is NP-complete. This is theoretically interesting, but practical problems are quite small and could be brute forced (assuming, of course, that this is desirable).

  3. Perform “experiments” which mainly verify that languages which don’t manipulate your matrix algebra don’t manipulate your matrix algebra. This mostly benchmarks whatever BLAS routine was called. Why is this necessary? Eg Table 5 follows from the fact that most languages won’t replace your inv(A) * b with A \ b. It is unclear why benchmarking would be the best tool to reveal this.

FWIW, I have yet to see a inv(A) * b in Julia code in the wild (if you see one, please don’t single out the package here, just make a quiet PR).

5 Likes

I agree with the fact that the language should not do arithmetically nontrivial transformations behind the user’s back. Concerns about floating-point arithmetic are important, and should be handled properly. However, leaving it at that misses two important points: optimizing this type of code is not trivial for non-experts (and can be tiresome even to experts), and floating-point arithmetic is irrelevant for 90% of applications. I’m not an expert in the history of computing, but it seems to me that in the competition between abstraction and control, abstraction has consistently won, and denying this is dangerous.

I think julia’s metaprogramming facilities make it well-placed to tackle this. Ie one could have a macro like @fastmath that would enable higher-level optimizations: ie you’d write the math formulas (that would be valid but slow julia code) and the macro would rewrite that into mul! and ldiv!. LazyArrays.jl has started doing that.

FWIW, I have yet to see a inv(A) * b in Julia code in the wild (if you see one, please don’t single out the package here, just make a quiet PR)

Keep in mind the huge selection bias here. The majority of users of julia are not package authors.

6 Likes

I very much agree with @StefanKarpinski’s HN comment: Julia is fundamentally an imperative language that is compiled non-interactively. You tell the computer what to do, instead of telling it what you want.

On the other hand, there definitely is space for a DSL for linear algebra problems. The user would write some code involving linear algebra and external functions, and provide sample input (permit the optimizer to figure out tensor sizes, aliasing and escaping properties, matrices that happen to be symmetric, which matrices are ill-conditioned and which are harmless). The DSL processor would then compile this into an optimized sequence of BLAS calls. Then, the user would look at the result and hopefully learn how to LICM and how to correctly use BLAS by hand. Basically training-wheels for novices.

Note that this uses very different assumptions from usual compiler optimizations: It is an interactive process, and the optimizing compiler doesn’t need to be correct, because a human will check its output anyways. It just needs to produce human readable output, crucially including noalias and noescape assumptions it made (noescape assumptions: Say that inside some loop, an external function is used on a matrix. The compiler would like to hoist memory allocation for temporaries outside of the loop, i.e. allocate a temp outside the loop, and always overwrite it inside. This is only valid if the external function doesn’t leak a pointer. Non-interactive compilation requires either onerous borrow-checkers a la rust, or requires the compiler to reason through the function. Interactive compilation just needs to compiler to figure out that it is important whether that function leaks pointers, and ask the user).

Julia is certainly a nice language to implement such an optimizer in, and julia is possibly a nice compile target for it.

8 Likes

I am not sure what “optimizing” or “experts” mean in this context, but common subexpression elimination is something that most people find easy to do without any kind of special expertise.

As for the rest, a lot of practical stuff about numerical matrix algebra is has become common knowledge even in field-specific advanced undegrad textbooks (basically: think about conditioning, know your basic decompositions, exploit properties of the problem such as symmetry or sparsity).

I am not sure about this, especially for people who are likely to use linear algebra in any form.

2 Likes

Both are wrong.

LLVM does extensive LICM. Regarding the linked issue, the nuclear option would be to expose a Base.@superpure macro that attaches some combination of readnone, nounwind, willreturn, speculatable.

On the other hand, reusing memory is very hard for julia, due to garbage collection.

Reusing memory of giant objects is important, and would likely profit a lot from ref-counting: Once a giant object goes out of scope, it gets immediately added to a pool, and new giant object allocations attempt to grab from the pool first. This would make e.g. the second temporary in A = B*C*D; E = F*G*H; re-use the memory from the first, if sizes match.

Once gc runs, it can free the giant object pool (preferably using MADV_DONTNEED on linux: That way, we only pay again for allocation when the OS starts garbage collecting; full collections could finally free all the MADV_DONTNEED pools if desired).

AFAIU, refcounting would be pretty hard to support for us.

2 Likes

But I did see there is a mul! in the LLVM code. What I mean is that M = A*B will first allocate space for M, then in each iteration it will just write the A*B to this space, instead of free the memory then rebind M to a new generated array. And I know reusing memory is generally hard for Julia if memory areas are different.
And I know LLVM supports LICM , but it does not mean Julia currently properly support it .(Well, at least you need to somehow write a pass, ensure some functions are pure and then you pass this information to LLVM). That’s what the issue is about.
But I will agree you that it’s easier to achieve LICM than memory reusing.

floating-point arithmetic is irrelevant for 90% of applications
I am not sure about this, especially for people who are likely to use linear algebra in any form.

Sorry, that was phrased a bit provocatively. For 90% of applications, it doesn’t matter what you do: either your problem is well-conditioned and anything you do will work, or it’s ill-conditioned and you’re doing something wrong anyway.

Could you explain why?

Your problem is that a matrix multiplication is not hoisted out of loops?

Loops that contain hoistable expensive operations are PEBKAC. The job of non-interactive optimizing compilers is to make it less onerous to write fast code, by taking care of obvious simple or low-level optimizations, and by making it simpler to support high-level languages (i.e. optimize away the cost of many high-level abstractions). Once you get down to “small” operations, it is good to have an optimizing compiler. Deciding whether an integer deserves to be hoisted (i.e. put into a register and kept) or should be spilled (put on the stack and loaded inside the loop) is something for a machine to do. Deciding whether to hoist a multi-millisecond matmul is too important for machines, that’s a job for humans.

The correct course of action for hoistable matrix multiplications is to educate the user to not write crappy code.

Don’t get me wrong, training wheels for unsophisticated users are super important! But I think that attempting to automatically optimize such big-picture things without human oversight is a fool’s errand.

We would get two kinds of object references: refcounted and garbage collected, and have two separate memory management schemes that interact. Giant objects would be refcounted. That’s certainly nontrivial. But you’re right, I was overly pessimistic: It is not obvious why approximate refcounting of some objects (e.g. Array) would be harder in julia than in other languages (looking at C++ / Rust).

2 Likes

Your problem is that a matrix multiplication is not hoisted out of loops?

But this is exactly what the paper wants to test(I point out this because this is an interesting section in that paper). Let me make it clearer(I am not native English speaker, so it might be my mistake)

Original code(A user might write):

for i in 1:n 
   M = A*B
   X[i,i]=M[i,i]
end

An ideal code after LICM (what the author expects):

M = A*B #M computes only once
for i in 1:n 
  X[i,i] = M[i,i]
end

That is a matrix multiplication should be hoisted out of loops(it’s a pure operation, and M doesn’t change in every iteration, why can’t it be hoisted?)
And currently what we get in Julia:

M = ...#preallocate an array that similar to A*B
for i in 1:n 
  M .= A*B #Or mul!(M,A,B),A*B will still be computed in every iteration, but at least we don't need to allocate a new array in every iteration. 
  X[i,i]=M[i,i]
end

As for memory reuse, since Julia’s struct is opaque, I really wonder how programmer can control this. Since we are not supposed to inspect the memory layout of struct, it’s unsafe for us to reinterpret one type of array to another one type of array(of course,array of struct). So we can’t do this manually. And even if compiler can do this for us, can we rely on this behavior? Since we don’t know whether their sizes match( because they are opaque, unless they have the same types), how can we tell whether compiler would do this or not? Thus I guess that this opaque abstraction will finally break when considering memory reuse.

Hoisting the matrix multiplication out of the loop is a change in complexity class.

Relying on compiler optimizations to change complexity class is extremely bad style: The complexity class of an algorithm should be apparent. It is not the job of the compiler to silently fix terrible terrible user code; it is the job of a linter to provide training wheels to help users become less incompetent.

Julia structs are not opaque. They have C layout and this is guaranteed by the language.

The only reinterpretation issue is that we want to forbid aliasing of differently typed arrays; i.e. we want to inform LLVM that accesses to Vector{Int} and Vector{Float32} cannot alias.

If we went the approx ref-count way, there would be no issue at all; the simple optimization would be that the allocator holds pools for recently freed giant buffers (which are already madvise-pseudo-freed), and reuses them if a request for a matching size comes in. That would only happen for e.g. > 8MB arrays. It would save the kernel the work of faulting in and zeroing the new giant allocation; but user code that manually reuses memory would still be preferable. Today, giant allocs are punted to libc, that typically punts to the OS kernel by requesting a new memory mapping; small allocs go through julia’s pool allocator, and medium allocs go into libc’s pools. In theory, the OS kernel should be able to do that for us, under the assumption that userspace is always stupid. However, as long as we purely rely on GC for freeing giant arrays, there is no way of optimizing this out, because these giant arrays are freed lazily / late instead of eagerly.

5 Likes

I agree that it doesn’t make sense for the language to hoist an explicit matmul in a case like that, one scenario where LICM for that sort of thing seems like it could be useful and beneficial is if the matmul is part of a function call occurring in a loop and not explicitly written by the user. There’s a counter-argument to be made that that’s just a bad API then and the user should be prompted to precompute the matmul outside of a loop. Interestingly, that kind of scenario is much more plausible in an ML framework setting where the common code is not explicitly written by the user and therefore not fixable by linting. However, that’s pretty beyond the scope of the paper which specifically ignores ML frameworks that are good at that sort of thing.

3 Likes