Why is LoopVectorization deprecated?

I’ll see if @floops works better.

I’d recommend OhMyThreads.jl instead for right now. In particular, the @tasks macro is probably what you want.

2 Likes

Thanks, The slowdown was a factor of 3.5 with @floop. I’ll give OhMyThreads a shot.

Same level of slowdown with OhMyThreads. Oh well, this exercise was still worth a shot even if VPlan is not working for me.

1 Like

That is consistent with what I saw from vplan.plan.
vplan really likes to use gathers and scatters for contiguous loads and stores, which is a great way to make code run slower. That could get fixed in a future llvm release.

It might be interesting to check this on Bump LLVM to v17 by mofeing · Pull Request #53070 · JuliaLang/julia · GitHub.

Interesting. All my testing was with the latest nightly, which is still LLVM 16, I think.

Note also that LLVM 18 was released this past week. Maybe we can skip LLVM 17?

Probably not. The upgrades are always a pain and doing the 17 upgrade makes the 18 upgrade either (we are ~95% of the way to 17).

1 Like

I disagree on some of this. IMO, it is more maintainable to have one implementation, than it is to have one for Matrix{T} and another for Adjoint{T,Matrix{T}}, and then the cartesian product of this across all arguments.
Most people also often get the optimal loop ordering wrong, as simple heuristics like “fastest index in the inner most loop” are often sub-optimal if you have a good enough vectorizer (LLVM’s LoopVectorizer is not good enough).

Where does the line start/stop with what you want happening at source level, and what you don’t?
Vectorization and unrolling are highly target dependent, so I guess that is where you would draw the line?
In that case, you need to be able to specify this at source level, e.g. with @unroll_and_jam, @vectorize.
Should the source code indicate the order in which the unrolls are “nested”, or should the code generator do this?

I like imperative code.
It is often quite readable to describe what you want by example. Saying “chair” while pointing to a chair will generally get the idea across faster than describing the abstract idea of a chair.

The idea is to be strict on what is observable, so we get a lot of leeway in the “as if” rule so that the imperative code can be treated decoratively. I want simple, easy to read, maintainable, generic code perform optimally.
(Note: while LV does handle PermutedDimsArray, Adjoint, etc correctly, it can only handle primitive eltypes like Float64 and Int16 – so it falls far short of the generic ideal.)

But as someone hypothetically developing a loop and linear algebra optimization library, I do have use for both the ability to manually specify transforms (quickly try out what really is fastest) and of course for introspecting into what it is actually doing (note: LV does have some of those features, but it isn’t documented).

I suspect most people will want these features a lot less than they initially think they do, but I would be happy to provide them (as I’d use them a lot myself).

9 Likes

I want the performance characteristics of the code to be legible to humans. I want the code to be able to serve as an exemplar, to teach its readers how to achieve similar goals. I want to minimize the amount of mandatory background knowledge that readers have to bring along. I want the code to be explicit and adaptable.

Imagine for a second a naive matmul code:

function matmul!(dst, A, B)
    #imagine some boundschecking here
    for i = 1:size(dst, 1)
        for j = 1:size(dst, 2)
            tmp = zero(eltype(dst))
            for  k = 1:size(B, 1)
                tmp += A[i, k] * B[k, j]
            end
            dst[i, j] = tmp
        end
    end
    dst
end

Suppose LoopModels has succeeded beyond your wildest dreams – your compiler turns this into a workable almost-competitive matmul, with all the cache-oblivious blocking, vectorization, allocation of temporaries, etc.

I must ask: Is this matmul! good code?

And I say no, it’s not. It is really really bad code backed by a hidden really good compiler/optimizer. It is not legible to people who have not read your compiler. Understanding the performance characteristics of this piece of code starts by reading your compiler, not by reading this code.

And this is not just philosophical. Somebody will use SafeInt that throws on overflow, and performance will crater because you cannot reorder / block anymore becaue LLVM cannot prove that addition/multiplication is side-effect free.

So this matmul! is not generically performant, and seeing it does not help to write a good SafeInt matmul.

Drawing bright lines is always hard. I know it when I see it? There will always be borderline cases. This also depends very much on the intended reader.

I am guilty of this myself – there are some compiler transformations that I mentally auto-apply to the code when reading it, without writing that into the source. And then a colleague with a different background reads my code and stumbles – ideally they ask me, or just snort and think that my code sucks; worst case they learn to copy the pattern and apply it in cases where it is not applicable (e.g. a pattern depends for performance on escape analysis – well, then you better know what kind of patterns confuse the compiler).

PS. Loop duplication / hoisting of constant conditionals is a big example:

while condition
   #lots of code
   if loop_constant_bool
      #do something
   else
       #do something else that allows some refs to escape
   end
   #more code
end

In many cases the compiler transforms that into

if loop_constant_bool
  while condition
    #lots of code
    #do something
    #more code
  end
else
  while condition
    #lots of code
    #do something else that allows some refs to escape
    #more code
  end
end

In most settings, it is OK to let the compiler decide.

Sometimes, this is not OK. For example if escape analysis / allocation-free-ness depends on that! Then ideally one would either do the transform in source, or have some @hoist_duplicate if loop_constant_bool macro.

I am super guilty of relying on that implicit compiler transformation, to the point that performance characteristics of my code are sometimes hard to read.

5 Likes

One thing I’d like/plan on is allowing for a change in semantics to make the contents of the written arrays undefined when code does throw.
In cases where all checks are functions of the loop induction variables only, this allows us to hoist the checks and throw early.
In cases of functions like log or sqrt that throw for negative numbers, we can still perform SIMD evaluation of batches.
It would be able to disable this, as it could make debugging more difficult, and may be a strike against my response to:

The result of executing the code is more legible than when optimizing the code (in the non-erroring case, hence need to disable it).
There is often a tradeoff between readability of what the code does, and performance of the code.

Introspecting the output should be definitely be made easy.
E.g., a simple -S/-S -emit-llvm or @code_native/@code_llvm away.
This is much easier than actually figuring out what OpenBLAS is doing, which is very well obfuscated and takes considerable detective work to unravel.

I’d thus argue that your matmul!'s performance characteristics (a) wouldn’t require reading the compiler, and (b) would be much more readable than OpenBLAS.

However…a point Mojo has drilled home is that they don’t rely on magic compiler transforms. They stress predictability of their performance; vectorization, unrolling, tiling are all done with templated functions (something that is easy enough to do with Julia or C++, Mojo’s value add is mostly solving the annoying FFI problem in bridging static and dynamic languages).

I guess this is part of the point you’ve been making, and one that resonates with me:
the endless frustration of compilers silently not applying the optimizations you’re expecting them to and know they are capable of, and wrestling with them to do the right thing. I’ve lost too much of my life to this already.

Loop duplication / hoisting is also called loop unswitching. This regressed with LLVM’s new pass manager, and caused a long delay in upgrading IIRC, because broadcasting in Julia is heavily dependent on that optimization. Loop unswitching is reliable in simple cases, but you can also get it to fail by having a broadcasting statement with many matrices (or higher dimensional arrays).
IMO, broadcasting is a case where it isn’t okay to let the compiler decide (which is why SciML has FastBroadcast.jl, which special cases the non-broadcasting case to force specialized code for the most common case; if compile times weren’t a concern, one could take the other approach of generating all possible broadcast combinations to force full specialization).

So I see your point on code needing to somehow provide a guarantee that the intended optimizations are actually being applied.
I’ll have to think about this, but am open to suggestions.

I think it’s important to

  1. Be able to avoid regressions. You don’t want people to feel like they ought to recheck all the native code or llvm IR when they upgrade a toolchain.
  2. Be able to describe the sort of transforms one expects to be applicable.

Regarding “2.”, one should be able to say, for example, that they expect register tiling, with vectorization of some loop other than the reduction loop. With that, LV would currently fail for matmul!(C, A', B) because it misses an important optimization there; one would want to make sure it’s able to find the optimal patterns.
I think there can be fairly course descriptions.
Given preconditions, some optimizations can be extremely reliable, so that we don’t need thoroughly list everything.
For example, if type inference succeeds in Julia, devirtualization almost certainly will, too. Thus, people focus only on the type instability problem, assuming runtime dispatch will be solved with it.

23 Likes

Sad to hear about LoopVectorization.jl to be deprecated if noone helps with maintenance. : (

@Elrod Thank you a lot for all your hard work and quality results!!!

May I ask how this will affect Polyester.jl, Octavian.jl, Tullio.jl?

1 Like

Polyester.jl doesn’t depend on LoopVectorization.jl or VectorizationBase.jl, but I think one of the same problems Julia 1.11 introduces impacts it as well, so we shall see.
Octavian.jl depends heavily on LV, so it is dead without it.
Tullio.jl will still work and be multithreaded, but performance will suffer for a lot CPU code using it without LV.

2 Likes

Would you mind elaborating on the problems introduced in 1.11? (I can’t see a discussion elsewhere in this thread – apologies if I missed it!)

4 Likes

There is some information spread on GitHub:

2 Likes

This was it, it was on the tip of my tongue! Thanks.

Yep. Reliable and well-understood optimizations are fine to leave to the compiler.

However, this does impose a significant knowledge requirement on all readers: They need to known what optimizations work reliably.

On my day job, I work a lot with JVM. The interactions between loop unswitching and escape analysis (which enables scalar replacement aka SROA) is a pain, especially since we cannot conclusively decide whether we’re writing for hotspot-C2 or graal-CE or graal-EE and on what runtime version (thanks god not java8 anymore). “Write once, crawl like a snail everywhere except for the specific compiler version the author had in mind, then run” :frowning:

The difference is that people know that they don’t know the exact details of what OpenBLAS does. A blackbox with well-understood API boundaries is not a problem. But they may very well get the false impression that they understand the performance of the matmul! (after all, they can read the imperative source code!).

This is different in more declarative languages: When you write a complex SQL query, there is no expectation that anybody can understand its performance characteristics without looking in depth into what the query optimizer of the specific system does with it (indeed, the question is meaningless without all this context!).

3 Likes

I didn’t expect that 1.11 would introduce breaking changes.

From my understanding, Julia only promises non-breaking changes for exported and documented features (“public API”).
LoopVectorization.jl apparently uses a lot of the compiler internals which can change.

12 Likes

The short answer is that I don’t use LV for work, and I don’t really use LV myself except for winning benchmarks to get internet points. I need to draw some hard lines to manage my own time better.
Time management is still a weakness of mine.

I think a successful open source project should be able to survive on its own. LV does not seem to have reached this status.

As such, I have not actually looked into the problem at all.
Odds are, looking into the problem is most of the battle of solving it.
That said, I suspect there are two primary issues:

  1. LV normally avoids passing Array objects. However, in certain circumstances, it may pass them anyway. As of Julia 1.11, LV’s argument passing approach is no longer valid for Arrays, due to the addition of Memory. Should just require a fix to handle Arrays correctly, e.g. opting out of the decomposition and reconstruction LV does.
  2. LLVM will drop support for typed pointers. This should only require some code changes in VectorizationBase.jl. Just move the types from the pointers to the getelementptr and various load/store instructions. This will reduce the amount of code, as casts are no longer needed. The load/store and getelementptr already needed to know the types of the input and output arguments, so we don’t even need extra information – it’s simply a reduction in complexity. Stil, it is a breaking change.

I was told LV segfaults when using an assertions-enabled build of LLVM, but I haven’t tested this.
That would also be something to look into; if asserts fire, it is obviously doing something wrong that will likely lead to unexpected behavior in normal operation.

24 Likes