Why is LoopVectorization deprecated?

Mason · March 9, 2024, 3:22pm

I’d recommend OhMyThreads.jl instead for right now. In particular, the @tasks macro is probably what you want.

ctkelley · March 9, 2024, 3:39pm

Thanks, The slowdown was a factor of 3.5 with @floop. I’ll give OhMyThreads a shot.

ctkelley · March 9, 2024, 3:54pm

Same level of slowdown with OhMyThreads. Oh well, this exercise was still worth a shot even if VPlan is not working for me.

Elrod · March 9, 2024, 4:41pm

That is consistent with what I saw from vplan.plan.
vplan really likes to use gathers and scatters for contiguous loads and stores, which is a great way to make code run slower. That could get fixed in a future llvm release.

Oscar_Smith · March 9, 2024, 4:57pm

It might be interesting to check this on Bump LLVM to v17 by mofeing · Pull Request #53070 · JuliaLang/julia · GitHub.

ctkelley · March 9, 2024, 4:59pm

Interesting. All my testing was with the latest nightly, which is still LLVM 16, I think.

Elrod · March 9, 2024, 5:04pm

Note also that LLVM 18 was released this past week. Maybe we can skip LLVM 17?

Oscar_Smith · March 9, 2024, 5:08pm

Probably not. The upgrades are always a pain and doing the 17 upgrade makes the 18 upgrade either (we are ~95% of the way to 17).

Elrod · March 9, 2024, 5:45pm

I disagree on some of this. IMO, it is more maintainable to have one implementation, than it is to have one for Matrix{T} and another for Adjoint{T,Matrix{T}}, and then the cartesian product of this across all arguments.
Most people also often get the optimal loop ordering wrong, as simple heuristics like “fastest index in the inner most loop” are often sub-optimal if you have a good enough vectorizer (LLVM’s LoopVectorizer is not good enough).

Where does the line start/stop with what you want happening at source level, and what you don’t?
Vectorization and unrolling are highly target dependent, so I guess that is where you would draw the line?
In that case, you need to be able to specify this at source level, e.g. with @unroll_and_jam, @vectorize.
Should the source code indicate the order in which the unrolls are “nested”, or should the code generator do this?

I like imperative code.
It is often quite readable to describe what you want by example. Saying “chair” while pointing to a chair will generally get the idea across faster than describing the abstract idea of a chair.

The idea is to be strict on what is observable, so we get a lot of leeway in the “as if” rule so that the imperative code can be treated decoratively. I want simple, easy to read, maintainable, generic code perform optimally.
(Note: while LV does handle PermutedDimsArray, Adjoint, etc correctly, it can only handle primitive eltypes like Float64 and Int16 – so it falls far short of the generic ideal.)

But as someone hypothetically developing a loop and linear algebra optimization library, I do have use for both the ability to manually specify transforms (quickly try out what really is fastest) and of course for introspecting into what it is actually doing (note: LV does have some of those features, but it isn’t documented).

I suspect most people will want these features a lot less than they initially think they do, but I would be happy to provide them (as I’d use them a lot myself).

foobar_lv2 · March 10, 2024, 12:03am

I want the performance characteristics of the code to be legible to humans. I want the code to be able to serve as an exemplar, to teach its readers how to achieve similar goals. I want to minimize the amount of mandatory background knowledge that readers have to bring along. I want the code to be explicit and adaptable.

Imagine for a second a naive matmul code:

function matmul!(dst, A, B)
    #imagine some boundschecking here
    for i = 1:size(dst, 1)
        for j = 1:size(dst, 2)
            tmp = zero(eltype(dst))
            for  k = 1:size(B, 1)
                tmp += A[i, k] * B[k, j]
            end
            dst[i, j] = tmp
        end
    end
    dst
end

Suppose LoopModels has succeeded beyond your wildest dreams – your compiler turns this into a workable almost-competitive matmul, with all the cache-oblivious blocking, vectorization, allocation of temporaries, etc.

I must ask: Is this matmul! good code?

And I say no, it’s not. It is really really bad code backed by a hidden really good compiler/optimizer. It is not legible to people who have not read your compiler. Understanding the performance characteristics of this piece of code starts by reading your compiler, not by reading this code.

And this is not just philosophical. Somebody will use SafeInt that throws on overflow, and performance will crater because you cannot reorder / block anymore becaue LLVM cannot prove that addition/multiplication is side-effect free.

So this matmul! is not generically performant, and seeing it does not help to write a good SafeInt matmul.

Drawing bright lines is always hard. I know it when I see it? There will always be borderline cases. This also depends very much on the intended reader.

I am guilty of this myself – there are some compiler transformations that I mentally auto-apply to the code when reading it, without writing that into the source. And then a colleague with a different background reads my code and stumbles – ideally they ask me, or just snort and think that my code sucks; worst case they learn to copy the pattern and apply it in cases where it is not applicable (e.g. a pattern depends for performance on escape analysis – well, then you better know what kind of patterns confuse the compiler).

PS. Loop duplication / hoisting of constant conditionals is a big example:

while condition
   #lots of code
   if loop_constant_bool
      #do something
   else
       #do something else that allows some refs to escape
   end
   #more code
end

In many cases the compiler transforms that into

if loop_constant_bool
  while condition
    #lots of code
    #do something
    #more code
  end
else
  while condition
    #lots of code
    #do something else that allows some refs to escape
    #more code
  end
end

In most settings, it is OK to let the compiler decide.

Sometimes, this is not OK. For example if escape analysis / allocation-free-ness depends on that! Then ideally one would either do the transform in source, or have some @hoist_duplicate if loop_constant_bool macro.

I am super guilty of relying on that implicit compiler transformation, to the point that performance characteristics of my code are sometimes hard to read.

Elrod · March 10, 2024, 2:52pm

One thing I’d like/plan on is allowing for a change in semantics to make the contents of the written arrays undefined when code does throw.
In cases where all checks are functions of the loop induction variables only, this allows us to hoist the checks and throw early.
In cases of functions like log or sqrt that throw for negative numbers, we can still perform SIMD evaluation of batches.
It would be able to disable this, as it could make debugging more difficult, and may be a strike against my response to:

The result of executing the code is more legible than when optimizing the code (in the non-erroring case, hence need to disable it).
There is often a tradeoff between readability of what the code does, and performance of the code.

Introspecting the output should be definitely be made easy.
E.g., a simple -S/-S -emit-llvm or @code_native/@code_llvm away.
This is much easier than actually figuring out what OpenBLAS is doing, which is very well obfuscated and takes considerable detective work to unravel.

I’d thus argue that your matmul!'s performance characteristics (a) wouldn’t require reading the compiler, and (b) would be much more readable than OpenBLAS.

However…a point Mojo has drilled home is that they don’t rely on magic compiler transforms. They stress predictability of their performance; vectorization, unrolling, tiling are all done with templated functions (something that is easy enough to do with Julia or C++, Mojo’s value add is mostly solving the annoying FFI problem in bridging static and dynamic languages).

I guess this is part of the point you’ve been making, and one that resonates with me:
the endless frustration of compilers silently not applying the optimizations you’re expecting them to and know they are capable of, and wrestling with them to do the right thing. I’ve lost too much of my life to this already.

Loop duplication / hoisting is also called loop unswitching. This regressed with LLVM’s new pass manager, and caused a long delay in upgrading IIRC, because broadcasting in Julia is heavily dependent on that optimization. Loop unswitching is reliable in simple cases, but you can also get it to fail by having a broadcasting statement with many matrices (or higher dimensional arrays).
IMO, broadcasting is a case where it isn’t okay to let the compiler decide (which is why SciML has FastBroadcast.jl, which special cases the non-broadcasting case to force specialized code for the most common case; if compile times weren’t a concern, one could take the other approach of generating all possible broadcast combinations to force full specialization).

So I see your point on code needing to somehow provide a guarantee that the intended optimizations are actually being applied.
I’ll have to think about this, but am open to suggestions.

I think it’s important to

Be able to avoid regressions. You don’t want people to feel like they ought to recheck all the native code or llvm IR when they upgrade a toolchain.
Be able to describe the sort of transforms one expects to be applicable.

Regarding “2.”, one should be able to say, for example, that they expect register tiling, with vectorization of some loop other than the reduction loop. With that, LV would currently fail for matmul!(C, A', B) because it misses an important optimization there; one would want to make sure it’s able to find the optimal patterns.
I think there can be fairly course descriptions.
Given preconditions, some optimizations can be extremely reliable, so that we don’t need thoroughly list everything.
For example, if type inference succeeds in Julia, devirtualization almost certainly will, too. Thus, people focus only on the type instability problem, assuming runtime dispatch will be solved with it.

Leo_I · March 13, 2024, 10:10pm

Sad to hear about LoopVectorization.jl to be deprecated if noone helps with maintenance. : (

@Elrod Thank you a lot for all your hard work and quality results!!!

May I ask how this will affect Polyester.jl, Octavian.jl, Tullio.jl?

Elrod · March 14, 2024, 1:06am

Polyester.jl doesn’t depend on LoopVectorization.jl or VectorizationBase.jl, but I think one of the same problems Julia 1.11 introduces impacts it as well, so we shall see.
Octavian.jl depends heavily on LV, so it is dead without it.
Tullio.jl will still work and be multithreaded, but performance will suffer for a lot CPU code using it without LV.

willtebbutt · March 14, 2024, 5:56am

Would you mind elaborating on the problems introduced in 1.11? (I can’t see a discussion elsewhere in this thread – apologies if I missed it!)

roflmaostc · March 14, 2024, 9:55am

There is some information spread on GitHub:

github.com/JuliaLang/julia

LLVM assertion failure during StrideArraysCore.jl since Memory{T} change

opened 04:48PM - 02 Jan 24 UTC

closed 06:04PM - 02 Jan 24 UTC

maleadt

regression

As seen on PkgEval: ``` julia: /workspace/srcdir/llvm-project/llvm/lib/IR/In…structions.cpp:2561: void llvm::InsertValueInst::init(llvm::Value*, llvm::Value*, llvm::ArrayRef<unsigned int>, const llvm::Twine&): Assertion `ExtractValueInst::getIndexedType(Agg->getType(), Idxs) == Val->getType() && "Inserted value must match indexed type!"' failed. [2047348] signal (6.-6): Aborted in expression starting at /home/maleadt/.julia/packages/StrideArraysCore/VyBzA/test/runtests.jl:75 pthread_kill at /lib/x86_64-linux-gnu/libc.so.6 (unknown line) raise at /lib/x86_64-linux-gnu/libc.so.6 (unknown line) abort at /lib/x86_64-linux-gnu/libc.so.6 (unknown line) unknown function (ip: 0x7ffbf8c0571a) __assert_fail at /lib/x86_64-linux-gnu/libc.so.6 (unknown line) _ZN4llvm15InsertValueInst4initEPNS_5ValueES2_NS_8ArrayRefIjEERKNS_5TwineE at /tmp/jl_FwegDd/bin/../lib/julia/libLLVM-15jl.so (unknown line) InsertValueInst at /source/usr/include/llvm/IR/Instructions.h:2640 [inlined] Create at /source/usr/include/llvm/IR/Instructions.h:2565 [inlined] CreateInsertValue at /source/usr/include/llvm/IR/IRBuilder.h:2343 emit_new_struct at /source/src/cgutils.cpp:3845 emit_new_struct at /source/src/julia.h:1684 emit_expr at /source/src/codegen.cpp:5729 emit_ssaval_assign at /source/src/codegen.cpp:5178 emit_stmtpos at /source/src/codegen.cpp:5428 [inlined] emit_function at /source/src/codegen.cpp:8545 jl_emit_code at /source/src/codegen.cpp:8880 jl_emit_codeinst at /source/src/codegen.cpp:8963 ``` Bisected to #51319 (cc @vtjnash @oscardssmith)

github.com/JuliaSIMD/LoopVectorization.jl

LoopVectorization.jl causing segfaults on 1.11

opened 01:59PM - 10 Jan 24 UTC

maleadt

LoopVectorization.jl's generated IR seems to cause segfaults on 1.11, as observe…d on PkgEval with at least 6 packages (MCPhylo,jl, LocalPoly.jl, VectorizedReduction.jl, NaNStatistics.jl, TimeSeriesClassification.jl, PlmDCA.jl). See this report for details: https://s3.amazonaws.com/julialang-reports/nanosoldier/pkgeval/by_hash/2cbecf4_vs_18b4f3f/report.html @chriselrod I'm opening a new issue because https://github.com/JuliaSIMD/LoopVectorization.jl/issues/518 was closed, and to list all issues in case somebody wants to tackle this. --- Some of the errors that I've encountered: An LLVM assertion, as seen with MCPhylo.jl (requires assertions build of Julia): ``` julia: /workspace/srcdir/llvm-project/llvm/lib/IR/Instructions.cpp:2561: void llvm::InsertValueInst::init(llvm::Value*, llvm::Value*, llvm::ArrayRef<unsigned int>, const llvm::Twine&): Assertion `ExtractValueInst::getIndexedType(Agg->getType(), Idxs) == Val->getType() && "Inserted value must match indexed type!"' failed. [177] signal 6 (-6): Aborted in expression starting at /home/pkgeval/.julia/packages/MCPhylo/KWPlY/test/distributions/phylodist.jl:1 gsignal at /lib/x86_64-linux-gnu/libc.so.6 (unknown line) abort at /lib/x86_64-linux-gnu/libc.so.6 (unknown line) unknown function (ip: 0x7fbeac0f040e) __assert_fail at /lib/x86_64-linux-gnu/libc.so.6 (unknown line) _ZN4llvm15InsertValueInst4initEPNS_5ValueES2_NS_8ArrayRefIjEERKNS_5TwineE at /opt/julia/bin/../lib/julia/libLLVM-15jl.so (unknown line) InsertValueInst at /source/usr/include/llvm/IR/Instructions.h:2640 [inlined] Create at /source/usr/include/llvm/IR/Instructions.h:2565 [inlined] CreateInsertValue at /source/usr/include/llvm/IR/IRBuilder.h:2343 emit_new_struct at /source/src/cgutils.cpp:3870 emit_new_struct at /source/src/julia.h:1704 emit_expr at /source/src/codegen.cpp:5945 emit_ssaval_assign at /source/src/codegen.cpp:5367 emit_stmtpos at /source/src/codegen.cpp:5642 [inlined] emit_function at /source/src/codegen.cpp:8810 jl_emit_code at /source/src/codegen.cpp:9144 jl_emit_codeinst at /source/src/codegen.cpp:9227 _jl_compile_codeinst at /source/src/jitlayers.cpp:220 jl_generate_fptr_impl at /source/src/jitlayers.cpp:525 jl_compile_method_internal at /source/src/gf.c:2509 [inlined] jl_compile_method_internal at /source/src/gf.c:2397 _jl_invoke at /source/src/gf.c:2912 [inlined] ijl_apply_generic at /source/src/gf.c:3097 logpdf at /home/pkgeval/.julia/packages/MCPhylo/KWPlY/src/distributions/Phylodist.jl:118 ``` A segfault during `vload`, as seen with NaNStatistics.jl and PlmDCA.jl: ``` [60] signal 11 (2): Segmentation fault in expression starting at /home/pkgeval/.julia/packages/NaNStatistics/oBRaH/test/testArrayStats.jl:80 macro expansion at /home/pkgeval/.julia/packages/VectorizationBase/0dXyA/src/llvm_intrin/memory_addr.jl:987 [inlined] __vload at /home/pkgeval/.julia/packages/VectorizationBase/0dXyA/src/llvm_intrin/memory_addr.jl:987 [inlined] _vload at /home/pkgeval/.julia/packages/VectorizationBase/0dXyA/src/strided_pointers/stridedpointers.jl:95 [inlined] macro expansion at /home/pkgeval/.julia/packages/VectorizationBase/0dXyA/src/vecunroll/memory.jl:60 [inlined] _vload_unroll at /home/pkgeval/.julia/packages/VectorizationBase/0dXyA/src/vecunroll/memory.jl:535 [inlined] _vload at /home/pkgeval/.julia/packages/VectorizationBase/0dXyA/src/vecunroll/memory.jl:771 [inlined] macro expansion at /home/pkgeval/.julia/packages/LoopVectorization/7iB2K/src/reconstruct_loopset.jl:1107 [inlined] _turbo_! at /home/pkgeval/.julia/packages/LoopVectorization/7iB2K/src/reconstruct_loopset.jl:1107 [inlined] _nanmean at /home/pkgeval/.julia/packages/NaNStatistics/oBRaH/src/ArrayStats/ArrayStats.jl:344 __nanmean at /home/pkgeval/.julia/packages/NaNStatistics/oBRaH/src/ArrayStats/ArrayStats.jl:308 [inlined] #nanmean#5 at /home/pkgeval/.julia/packages/NaNStatistics/oBRaH/src/ArrayStats/ArrayStats.jl:307 [inlined] nanmean at /home/pkgeval/.julia/packages/NaNStatistics/oBRaH/src/ArrayStats/ArrayStats.jl:307 ``` A segfault during `vadd_fast` as seen with VectorizedReductions.jl: ``` [12] signal 11 (2): Segmentation fault in expression starting at /home/pkgeval/.julia/packages/VectorizedReduction/bsnWJ/test/reduce.jl:4 macro expansion at /home/pkgeval/.julia/packages/VectorizationBase/xE5Tx/src/llvm_intrin/binary_ops.jl:31 [inlined] vadd_fast at /home/pkgeval/.julia/packages/VectorizationBase/xE5Tx/src/llvm_intrin/binary_ops.jl:31 [inlined] fmap at /home/pkgeval/.julia/packages/VectorizationBase/xE5Tx/src/vecunroll/fmap.jl:11 [inlined] fmap at /home/pkgeval/.julia/packages/VectorizationBase/xE5Tx/src/vecunroll/fmap.jl:11 [inlined] fmap at /home/pkgeval/.julia/packages/VectorizationBase/xE5Tx/src/vecunroll/fmap.jl:11 [inlined] fmap at /home/pkgeval/.julia/packages/VectorizationBase/xE5Tx/src/vecunroll/fmap.jl:11 [inlined] fmap at /home/pkgeval/.julia/packages/VectorizationBase/xE5Tx/src/vecunroll/fmap.jl:11 [inlined] fmap at /home/pkgeval/.julia/packages/VectorizationBase/xE5Tx/src/vecunroll/fmap.jl:11 [inlined] fmap at /home/pkgeval/.julia/packages/VectorizationBase/xE5Tx/src/vecunroll/fmap.jl:11 [inlined] vadd_fast at /home/pkgeval/.julia/packages/VectorizationBase/xE5Tx/src/vecunroll/fmap.jl:111 [inlined] add_fast at /home/pkgeval/.julia/packages/VectorizationBase/xE5Tx/src/base_defs.jl:91 [inlined] macro expansion at /home/pkgeval/.julia/packages/LoopVectorization/7iB2K/src/reconstruct_loopset.jl:1107 [inlined] _turbo_! at /home/pkgeval/.julia/packages/LoopVectorization/7iB2K/src/reconstruct_loopset.jl:1107 [inlined] macro expansion at /home/pkgeval/.julia/packages/VectorizedReduction/bsnWJ/src/vmapreduce.jl:236 [inlined] vvmapreduce at /home/pkgeval/.julia/packages/VectorizedReduction/bsnWJ/src/vmapreduce.jl:231 vvreduce at /home/pkgeval/.julia/packages/VectorizedReduction/bsnWJ/src/vmapreduce.jl:147 ``` The source of bad IR hasn't been fully determined yet, but it seems to be the `Expr(:new)` that's generated to pass structs by value instead of by reference: https://github.com/JuliaLang/julia/issues/52702#issuecomment-1874492883. --- Deprecating LoopVectorization.jl isn't possible, because: - some packages, e.g. RecursiveFactorizations.jl, inspect LoopVectorization.jl internals: https://github.com/JuliaSIMD/LoopVectorization.jl/issues/520 - the transformation of `@turbo` changes semantics, https://github.com/JuliaSIMD/LoopVectorization.jl/pull/523#issuecomment-1884883071 So the only solution forwards seems fixing LoopVectorization.jl. I've taken a first attempt at it in https://github.com/JuliaSIMD/LoopVectorization.jl/pull/523, but just removing the `Expr(:new)` optimization isn't sufficient, and there's other issues (see above).

foobar_lv2 · March 14, 2024, 10:53am

This was it, it was on the tip of my tongue! Thanks.

Yep. Reliable and well-understood optimizations are fine to leave to the compiler.

However, this does impose a significant knowledge requirement on all readers: They need to known what optimizations work reliably.

On my day job, I work a lot with JVM. The interactions between loop unswitching and escape analysis (which enables scalar replacement aka SROA) is a pain, especially since we cannot conclusively decide whether we’re writing for hotspot-C2 or graal-CE or graal-EE and on what runtime version (thanks god not java8 anymore). “Write once, crawl like a snail everywhere except for the specific compiler version the author had in mind, then run”

The difference is that people know that they don’t know the exact details of what OpenBLAS does. A blackbox with well-understood API boundaries is not a problem. But they may very well get the false impression that they understand the performance of the matmul! (after all, they can read the imperative source code!).

This is different in more declarative languages: When you write a complex SQL query, there is no expectation that anybody can understand its performance characteristics without looking in depth into what the query optimizer of the specific system does with it (indeed, the question is meaningless without all this context!).

photor · March 14, 2024, 10:55am

I didn’t expect that 1.11 would introduce breaking changes.

roflmaostc · March 14, 2024, 10:58am

From my understanding, Julia only promises non-breaking changes for exported and documented features (“public API”).
LoopVectorization.jl apparently uses a lot of the compiler internals which can change.

Elrod · March 14, 2024, 3:29pm

The short answer is that I don’t use LV for work, and I don’t really use LV myself except for winning benchmarks to get internet points. I need to draw some hard lines to manage my own time better.
Time management is still a weakness of mine.

I think a successful open source project should be able to survive on its own. LV does not seem to have reached this status.

As such, I have not actually looked into the problem at all.
Odds are, looking into the problem is most of the battle of solving it.
That said, I suspect there are two primary issues:

LV normally avoids passing Array objects. However, in certain circumstances, it may pass them anyway. As of Julia 1.11, LV’s argument passing approach is no longer valid for Arrays, due to the addition of Memory. Should just require a fix to handle Arrays correctly, e.g. opting out of the decomposition and reconstruction LV does.
LLVM will drop support for typed pointers. This should only require some code changes in VectorizationBase.jl. Just move the types from the pointers to the getelementptr and various load/store instructions. This will reduce the amount of code, as casts are no longer needed. The load/store and getelementptr already needed to know the types of the input and output arguments, so we don’t even need extra information – it’s simply a reduction in complexity. Stil, it is a breaking change.

I was told LV segfaults when using an assertions-enabled build of LLVM, but I haven’t tested this.
That would also be something to look into; if asserts fire, it is obviously doing something wrong that will likely lead to unexpected behavior in normal operation.

simsurace · March 15, 2024, 8:39am

I just looked at VectorizationBase tests and they seem to be broken even on 1.10, with the same failures and errors happening on nightly. Are the expected failures on 1.11 different from those and are they not visible in the nightly run?

Topic		Replies	Views
[ANN] LoopVectorization Package Announcements	157	23594	May 27, 2020
Help understanding vectorization (or lack thereof) Performance	15	1244	June 8, 2018
Simple loop won't vectorize New to Julia	12	1665	January 29, 2019
Julia motivation: why weren't Numpy, Scipy, Numba, good enough? Community history	123	83255	September 21, 2018
The Julia Challenge Community	33	3429	November 20, 2018

Why is LoopVectorization deprecated?

Related topics