Why is LoopVectorization deprecated?

ufechner7 · March 7, 2024, 5:37pm

I guess you mean “maintenance nightmare”…

simsurace · March 7, 2024, 5:59pm

Yes, corrected.

Oscar_Smith · March 7, 2024, 11:48pm

The short answer is because no one’s written one. Chris and I actually did a very initial prototyping of making Julia’s tfuncs natively support vectorization (it turns out the LLVM side supports this already). Doing it properly looks like it would be fairly doable and would be a pretty good project for someone who wanted to get into Julia and LLVM ir.

photor · March 8, 2024, 6:04am

Provided Julia is supposed to be a language for high performance computing, what you said is definitely necessary.

foobar_lv2 · March 8, 2024, 10:19am

There appears to be some misunderstanding about what vectorization is, and how it is supported in julia.

So first: Vectorized code. When you compute a = b + c with normal Int64 or Float64, then on modern CPU only a tiny tiny percentage of the effort goes into the actual computation. Most if it is overhead: The instruction needs to be fetched, decoded, scheduled, etc. Think like e.g. python: As an interpreted language, loops are slow, because there is considerable interpretation overhead that dwarfs the actual cost of the operation. This is the same on CPUs, which you should consider as a hardware-assisted interpreter/JIT for assembly (or more precisely, for machine code of your ISA).

As CPUs get more advanced, this ratio is becoming worse, not better. The simple reason is that there are physics/engineering limits on the speed of sequential operations (end of Dennard scaling); so the CPU becomes more and more fancy (e.g. speculation, pipelining, superscalar / ILP) – you gotta spend your transistor/energy budget somehow.

One way out is to amortize the fixed-by-operation costs over more integers. In python or matlab, you do this by writing a = b + c for large arrays; you then pay the interpreter overhead only once, and this dispatches to a hand-written C or assembly function that runs the loop. Much much faster than writing e.g. a = [b[i] + c[i] for i in indices(b)] in python.

When we say “loops are fast in julia”, then we mean that this makes no big difference in julia – your hand-written julia loop will be compiled to an equivalent of an OK written C loop. So this kind of high-level vectorization is unneeded in julia (it is still sometimes nice for code readability! That’s why e.g. broadcasting exists. But broadcasting is not a performance feature, it’s a convenience feature).

OK, back to CPUs. There you do the same; this is called SIMD (same instruction, multiple data). Ta-da, you can add 2, 4 or 8 Float64s, paying the considerable fixed-per-operation dispatch cost only once. Indeed, nowadays the cost for adding 4 floats in a single instruction is roughly the same as adding one float.

So, how do we use these cool SIMD instructions?

There are basically four schools:

First is you write assembly. What you see is what you get. This is the most reliable way, but it is not for the faint of heart, and only used in exceptional circumstances (e.g. crypto-code that must be constant-time).

Second is the old-school C approach. You include immintrin.h or its analogue, and go cracking! You known the exact instruction you’re getting, but the compiler will help you with register allocation and so on, and will give its best to optimize your code. This is reasonably reliable. But it is annoying on many levels – apart from the fact that you’re coding on a relatively low level, there is a portability issue. The portability issue is that your code is specific to a target; if you write avx512 code and try to run on a machine that doesn’t support that, your CPU will trap (low-level analogue of an exception).

Third school is to use a library that abstracts actual CPU capabilities, and offers portable high-level things. This is exemplified by e.g. java’s new SIMD API, and is what you do in julia if you write c = ntuple(i->VecElement(a[i].value + b[i].value), length(a)) for e.g. a::NTuple{4, VecElement{Int64}} or the analogue llvmcall (or you use SIMD.jl, which is a very convenient syntactic sugar on top of the language features). This is portable, i.e. you won’t have to think about whether your code runs on x64-avx2 or arm7 or aarch64 with whatever vector extension. This is a lot less reliable than the second approach, but mostly works well. The big pitfall is that the code will run, but might not vectorize for some reason. Say, your machine doesn’t support SIMD at all, or only supports 128 bit wide SIMD – then julia/llvm will compute your thing in a slow sequential way instead of throwing an error into your face. Or you run java SIMD code on a JVM like older graal-CE – you’ll get correct results, but much slower than if you had written sequential code. It is very much up to debate whether slow-but-correct is better than blow-up-in-your-face in cases where the user explicitly requested SIMD.

The fourth school is called auto-vectorization. You write simple sequential code, and the compiler will try to optimize the code into SIMD code. But: Like all compiler transformations, this is unreliable and difficult, because the compiler MUST preserve the semantics of your sequential code. Auto-vectorization exists and is done in julia, by virtue of including the requisite llvm-passes in the optimization pipeline.

Now, what about macros like @simd? This macro transforms your code in order to make auto-vectorization more likely. People are using it like magic performance pixie dust.

What about loopvectorization @turbo? This is almost a compiler upon itself.

It produces very fast code by playing very fast and loose with julia semantics. For example:

julia> using LoopVectorization

julia> function shift(B,A)
       for i=1:(length(A)-1)
       B[i+1] += A[i]
       end 
       end
shift (generic function with 1 method)

julia> A=collect(1:10); shift(A,A); A
10-element Vector{Int64}:
  1
  3
  6
 10
 15
 21
 28
 36
 45
 55

julia> function shift(B,A)
       @turbo for i=1:(length(A)-1)
       B[i+1] += A[i]
       end 
       end;

julia> A=collect(1:10); shift(A,A); A
10-element Vector{Int64}:
  1
  3
  5
  7
  9
 11
 13
 15
 17
 19

Loopvectorization made the assumption that B and A do not alias / are not the same, and therefore produced incorrect results.

This is afaiu not a LoopVectorization bug, it is a user error: Either the author of the shift function messed up, or the user of the shift function messed up, depending on whether the documentation of shift warns about aliasing or not.

LoopVectorization is in the magic performance pixie dust school. It is not necessary for performance / vectorization, and it is not essential for a high-performance language.

I will never use loopvectorization, it’s not for me.

But people apparently like their magic performance pixie dust, and adopt it. So i have deep respect for what Chris Elrod achieved with LV and gave to the ecosystem, and it’s a pity that it is going away. But don’t overstate its importance.

Elrod · March 8, 2024, 10:57am

@simd ivdep should also be producing incorrect results, at least if your loop trip count is at least the vectorization * unroll factors (length(A) >= 33 should be enough for all architectures).
So non-aliasing, as an example, is easy enough to promise without LoopVectorization.jl. Just use @simd ivdep for.
It won’t actually help performance by a measurable amount in most cases, because in most cases llvm will produce an alias check and branch. The check is fast.

So it isn’t only that it plays faster and looser. It is also both cleverer in some respects, and better tuned to larger vector widths.
LLVM makes a lot of tradeoffs that are only good for vector widths of 2 or less.
LLVM’s auto vectorizer is mostly stupid and heuristics-based, while LV at least has a naive cost model and then an even more naive search of the solution space.
A lot of work also went into the code generation.
It isn’t easy to write a matrix multiply micro-kernel using SIMD.jl that performs as well as @turbo, even if you’re manually applying the same transforms it is. Few people have the expertise to do this.

Still, it is something I was once proud of and poured a lot of time into. I wish it were important, but it literally isn’t worth much at all. A few hundred dollars, and that is it.

Elrod · March 8, 2024, 11:30am

For anyone wondering, this would reduce the compile- time overhead of the third school by reducing the need for llvmcall.
But it wouldn’t cover vector loads or vector stores. At least, not without much larger additions/ changes.

foobar_lv2 · March 8, 2024, 11:43am

Yep. Not your bug, user error.

You should be proud!

It’s not just worth a few hundred dollars. Not monetizeable != worthless . In a sane world, it should be trivial for you to get a grant on strength of LV from any funding agency. Or get industry funding.

Unfortunately, the world is not sane; industry doesn’t fund their upstream, and academia tests your grant-application-writing-skill, not the strength of your work. OSS funding generally sucks.

If you do have a personal funding issue, you might get help by asking around? I suck at grant applications and washed out of the academy, but there are a lot of successful academics in the julia ecosystem. Or juliacomputing itself?

I think I would like to see LV more as a domain-specific language + compiler for kernel functions, targetting LLVM, with julia-inspired syntax and excellent julia interop.

And I think there is lots of space for such DSLs. ispc also comes to mind, for slightly different styles of loops and without seamless interop.

With targeting llvm, this could also have excellent clang-C/C++ and rust interop (excellent == cross language LTO / inlining; mainstream languages means more funding opportunities than pure julia).

^this!

As a matter of personal preference, I’d prefer to do all high-level transforms on the source level (reliable, maintainable, transparent, readable) without any automatic loop re-ordering. And yet, you took so much care in the code generation. Can I have one without the other ?

Can I have an interactive compiler for high-level things? I give you the loops, you give me the transformed (not lowered!) loops and thereby teach me how to write performant code.

I’d also love that for high-level linear algebra. User writes code and provides sample input, the tooling helps to interactively improve the code, especially with respect to things like copies/views, hoisting BLAS operations out of loops, using appropriate algorithms (“your sample matrix happened to be spd. Don’t you want to use cholesky instead of lu here? And don’t you dare using inv. Also, that sparse matrix has way too many nonzero entries, would be faster to represent it as dense matrix.”).

RoyiAvital · March 8, 2024, 11:46am

Monetizing wise, I think LoopVectorization.jl in the context of StaticCompiler.jl would worth a lot.

Namely, if one could create highly efficient library using the Julia + LoopVectorization.jl + StaticCompiler.jl it would something worth paying.

I know I’d pay for it 500$ / Year easily for my hobby coding. Probably could use it in commercial context as well and pay more.
The problem fast code limited to the Julia eco system, the way it is now, can hardly be monetized, hence no money was thrown at it.

From my point of view, something like your project should have been one of the pillars of the Julia eco system as the main selling ticket of Julia, the way I see it, is speed.

This is basically one for the go to market strategies of Modular with Mojo.
So I think you were right with your initial thinking, it was the eco system around you which didn’t follow the direction coherent with your effort.
Only appreciation for your effort and how impressive is your achievement.

AndrewRadcliffe · March 8, 2024, 9:48pm

An excellent idea for didactic purposes. Funding would likely be restricted to academic sources (industry of any kind benefits from novices). Julia being a Lisp facilitates things.

This idea reminded me of Stop Writing Dead Programs - Jack Rusher (Strange Loop 2022).

foobar_lv2 · March 8, 2024, 10:18pm

Not just for didactic purposes.

The idea is excellent, but it isn’t new nor mine. I came upon it from an old djb talk, and he attributes the insight to either Hoare or Don Knuth.

The thing for linear algebra ops would probably be a big-money killer feature for matlab.

vchuravy · March 8, 2024, 10:58pm

I have some hope for VPlan in LLVM Vectorization Plan — LLVM 19.0.0git documentation

We should try and see how performance is with JULIA_LLVM_ARGS="-enable-vplan-native-path" julia

ctkelley · March 8, 2024, 11:42pm

Is that something an innocent civilian can try now?

Oscar_Smith · March 8, 2024, 11:52pm

yeah that’s just setting an environment varialbe to change what LLVM passes are run. Note that it was supposed to be JULIA_LLVM_ARGS="-enable-vplan-native-path" julia

ctkelley · March 9, 2024, 1:28pm

I’m still not clear on what to do.
Do I put

ENV["JULIA_LLVM_ARGS"] = "-enable-vplan-native-path"

in my startup.jl? What is the julia at the end for?

Elrod · March 9, 2024, 2:00pm

To start julia. You run it from the command line with that variable defined.

ctkelley · March 9, 2024, 2:18pm

Forgive me. I’m still not getting it. Do I put

setenv JULIA_LLVM_ARGS  "-enable-vplan-native-path"

in .cshrc?

Oscar_Smith · March 9, 2024, 2:31pm

Just run Julia from the command line with JULIA_LLVM_ARGS="-enable-vplan-native-path" julia rather than julia.

ctkelley · March 9, 2024, 2:36pm

I tried that (Mac, latest OS, csh, command line) and got this

% JULIA_LLVM_ARGS="-enable-vplan-native-path" julia
JULIA_LLVM_ARGS=-enable-vplan-native-path: Command not found.

Putting the line in my .cshrc file made a difference in my half precision lu, but it made it roughly 5 times slower.

I see you’re using bash. Ran with bash → same slowdown.

Is there any reason that this would conflict with @batch, @simd, …

Mason · March 9, 2024, 3:11pm

I can imagine it conflicting with @batch because can sneakily replace arrays with StrideArrays.