MLIR is LLVM’s way of raising the white flag and saying it’s not the right level to do this kind of stuff either. LoopVectorization has a bright future, hopefully integrated into Julia’s compiler at some point. That might allow it to operate at a higher level and leave the architecture-specific stuff to LLVM.
Alternatively we could embrace MLIR, but Chris’ progress with LV is such that we might do better by having control over this stuff ourselves. @vchuravy is diving deep with MLIR, so we should have both approaches covered and will be able to make an informed decision.
I can confirm that LoopVectorization also generates essentially a perfect microkernel for Haswell on Julia 1.4, too.
Julia code:
julia> function mymul!(C, A, B, α, β)
@avx for m in axes(A, 1), n in axes(B, 2)
ABmn = zero(eltype(C))
for k in axes(A, 2)
ABmn += A[m, k] * B[k, n]
end
Cmn = C[m, n]
C[m, n] = α * ABmn + β * Cmn
end
return C
end
I’m happy with that. I’d like to make it more robust.
As an extreme stop-gap, it would also be possible to define check_args to simply return false (regardless of what the arguments are) so that it uses an @inbounds @fastmath fallback loop instead.
This may be the better approach for those platforms, since the cost modeling currently assumes more or less the same instruction costs regardless of architecture.
Things have been reasonably consistent since Haswell, and Ryzen seems close enough as well.
I wanted to describe how it works, but yes, in those examples 100% of the benefit comes from the fallback loops.
Dot product:
julia> function mydot(a, b)
s = zero(eltype(a))
@inbounds @simd for i ∈ eachindex(a,b)
s += a[i]*b[i]
end
s
end
mydot (generic function with 1 method)
julia> using LoopVectorization, BenchmarkTools
julia> function mydotavx(a, b)
s = zero(eltype(a))
@avx for i ∈ eachindex(a,b)
s += a[i]*b[i]
end
s
end
mydotavx (generic function with 1 method)
julia> x128 = rand(128); y128 = rand(128);
julia> @btime mydot($x128, $y128)
7.993 ns (0 allocations: 0 bytes)
31.004598810919788
julia> @btime mydotavx($x128, $y128)
8.737 ns (0 allocations: 0 bytes)
31.004598810919788
julia> code_native(mydot, Tuple{Vector{Float64},Vector{Float64}}, syntax=:intel, debuginfo=:none)
julia> code_native(mydotavx, Tuple{Vector{Float64},Vector{Float64}}, syntax=:intel, debuginfo=:none)
Lets look at the hot loops, mytdot (@inbounds @simd):
There are two primary differences I noticed in what happens when the hot loop runs:
The former is aligned to a 32-byte boundary, and the latter to a 16-byte boundary. IIRC, my instruction decoder works in 32-byte chunks, but this is probably okay.
Once the jump to the start of the loop isn’t taken, the first example immediately sums of the vector. The LoopVectorization version instead has to determine the correct cleanup approach and branches.
Do either of these points support:
I don’t think so, but the former (to the extant it matters) may be evidence that @llvm.expect is not enough to communicate which loop is hot.
Point 2 is thanks to LLVM’s poor tail-handling. If this improved, this advantage is likely to disapear.
We pick the same unrolling approach there.
But LLVM basically always picks either 1 or 4, which sounds more like heuristics than a model. But not knowing the internals, I can’t say for sure.
LoopVectorization uses latency vs throughput of the inner loop to decide.
Other examples, like just a sum of self-dot are cases where LLVM also picks 4, but LoopVectorization picks 8. It takes too long for that to pay off:
So I should probably get it to adjust this behavior, as it wont be much longer before it becomes memory bound. Maybe they factor memory bandwidth into it.
I believe it’s slower than when using x128 because of this issue, but I am not sure.
LoopVectorization does not do loop versioning.
Which approach of handling unknown stride you prefer:
Assume it is not one, and optimize that case.
Emit a check, make it fast if it is 1, give up otherwise.
I think 1. is obviously a better choice.
Folks shouldn’t use views that have an Int as the first index if they don’t have to. They should vec or rehape as necessary.
I’d bet in the large majority of calls featuring views with dynamic leading strides, it will in fact be discontiguous – otherwise, they’re making a mistake.
I am skeptical of that claim. It struggles with things as basic as pointer offsetting, unless I generate code lining up with what I want the asm to look like.
Okay. No edge effects, we’ll make everything a clean 128:
using BenchmarkTools, LoopVectorization
function mygemm!(C, A, B)
@inbounds @fastmath for m ∈ 1:size(A,1), n ∈ 1:size(B,2)
Cmn = zero(eltype(C))
for k ∈ 1:size(A,2)
Cmn += A[m,k] * B[k,n]
end
C[m,n] = Cmn
end
end
function mygemmavx!(C, A, B)
@avx for m ∈ 1:size(A,1), n ∈ 1:size(B,2)
Cmn = zero(eltype(C))
for k ∈ 1:size(A,2)
Cmn += A[m,k] * B[k,n]
end
C[m,n] = Cmn
end
end
M = K = N = 128;
C1 = Matrix{Float64}(undef, M, N); A = randn(M, K); B = randn(K, N);
C2 = similar(C1); C3 = similar(C1);
This yields:
julia> @benchmark mygemmavx!($C1, $A, $B)
BenchmarkTools.Trial:
memory estimate: 0 bytes
allocs estimate: 0
--------------
minimum time: 34.779 μs (0.00% GC)
median time: 34.910 μs (0.00% GC)
mean time: 34.965 μs (0.00% GC)
maximum time: 100.442 μs (0.00% GC)
--------------
samples: 10000
evals/sample: 1
julia> @benchmark mygemm!($C2, $A, $B)
BenchmarkTools.Trial:
memory estimate: 0 bytes
allocs estimate: 0
--------------
minimum time: 1.548 ms (0.00% GC)
median time: 1.552 ms (0.00% GC)
mean time: 1.552 ms (0.00% GC)
maximum time: 1.636 ms (0.00% GC)
--------------
samples: 3222
evals/sample: 1
Over 44 times faster.
I was actually a little dishonest here. While LLVM wasn’t suffering from edge effects in this example, LoopVectorization actually was because its modeling picked different unrolling factors (not 32).
But we can help LLVM do a little better by reordering loops ourselves:
julia> function mygemm!(C, A, B)
fill!(C, 0)
@inbounds @fastmath for k ∈ axes(B,1), n ∈ axes(C,2), m ∈ axes(C,1)
C[m,n] += A[m,k] * B[k,n]
end
end
mygemm_better_order! (generic function with 1 method)
julia> @benchmark mygemm_better_order!($C2, $A, $B)
BenchmarkTools.Trial:
memory estimate: 0 bytes
allocs estimate: 0
--------------
minimum time: 1.072 ms (0.00% GC)
median time: 1.096 ms (0.00% GC)
mean time: 1.096 ms (0.00% GC)
maximum time: 1.126 ms (0.00% GC)
--------------
samples: 4562
evals/sample: 1
Now it’s only 30 times slower.
Now, let’s talk a little about memory layouts again.
julia> @benchmark mygemm_better_order!($C2, $A', $B')
BenchmarkTools.Trial:
memory estimate: 0 bytes
allocs estimate: 0
--------------
minimum time: 1.303 ms (0.00% GC)
median time: 1.346 ms (0.00% GC)
mean time: 1.340 ms (0.00% GC)
maximum time: 1.432 ms (0.00% GC)
--------------
samples: 3732
evals/sample: 1
julia> @benchmark mygemmavx!($C2, $A', $B')
BenchmarkTools.Trial:
memory estimate: 0 bytes
allocs estimate: 0
--------------
minimum time: 41.487 μs (0.00% GC)
median time: 42.472 μs (0.00% GC)
mean time: 42.487 μs (0.00% GC)
maximum time: 103.311 μs (0.00% GC)
--------------
samples: 10000
evals/sample: 1
LoopVectorization suffered less than a 20% performance degredation from these transposes, vs the 30% degradation of the LLVM version.
Given that LoopVectorization had a lot more to lose here from vectorization failure, but lost less in % performance anyway (stating the obvious: it still vectorized the loops)…
What do you consider to be the immediate term? Do you believe that fixes to these problems will appear in LLVM within a year or two? I think even if LLVM were to improve to a point where this package was totally unnecessary by one year from now, this effort would be far from wasted.
Even if LLVM can produce the same loops, doing so can be tricky and Julia doesn’t always make it easy to communicate exactly what we want to LLVM.
In general, I’m quite skeptical of the idea of relying on LLVM to do these sorts of optimizations well. We have very little control over the development and priroties of the LLVM team and have very few people willing and able to contribute directly to LLVM.
Also, integrating into the compiler after all the type inference are done is something very different. I do believe we want more optimization there, mainly because of memory model differences. However, for all other ones, I don’t see why it is harder to pass information from arbitrary julia types to LoopVectorizer than to LLVM. All the informations mentioned below are the ones that LLVM is clearly aware of, but failed to make the right decision based on those information. And that’s what I meant by doing the thing at the right level. At least from the examples currently givien, the pointer arithmetics are all very simple and AFAICT in no case is LLVM generating worse code due to inability to figure out what the code is doing.
What do you mean by pointer offset?
OK, so yes, this is another case, in additional to the edge handling. I’m not surprised at all that LLVM struggles with strided memory access. I’m pretty disappointed by LLVM’s AVX512 performance. I don’t have the hardware yet so I’m not worrying too much. However, again, I don’t see how is LLVM missing information here. It can even figure out if the stride is 1 at runtime so I believe figuring out a different stride isn’t going to be a problem.
By wasting effort (or whatever I said above) I did not mean wasting time figuring out the best code to generate for each cases. That is the good part of the workaround. In fact, since it is exactly where LLVM’s problem are, it is the workaround itself. What I mean is that,
Bending over for the limitation of doing this at the current level shouldn’t be ecouraged.
The lesson learnt from here, all code generation ones not figuring out informtion ones AFAICT, should be reported to LLVM since they should be of general interest. Otherwise this is going to be duplicated effort.
Related to the previous one. There are existing effort on this for sure. I assume polly has something to say about this and MLIR would have at least something to say regarding heterogeneous archs though I’m not convinced it helps on traditional CPUs fundamentally.
Again, in general I agree, hence I’m not proposing removing optimizations from type inference. For vectorization on simple types I have not yet seen a case yet.
And again, another statement I would agree in general. However, I strongly believe vectorized loop is an area that many major groups using LLVM are interested in so I don’t really see a conflict of interest here. Contributing directly to LLVM is also not necessary. We have many passes ourselves and we can certainly add more.
Yes, that WILL BE the right level. If this was implemented as a compiler stage AFTER type inference as I already mentioned above, I’ll have very little issue about it except that many lessons should still be communicated to LLVM. It is exactly the current implementation and the intrinsic limitation one have to work with that I’m talking about (point 1 above).
Let me start by saying that I definitely agree with you (and Mason, and Tim Holy, and everyone else I’ve talked about this with).
The nice thing about the generated function approach is that it does work at the level of a lot of Julia-abstractions we’re all familiar with reasoning about. But I’m sure we can preserve that.
And there’s a lot more than “the nice thing” about doing it at a lower level.
Because of course you’re right that there is more information available. It would be easier to support more types than simply StridedArray{T} where {T <: Base.HWReal} and a few extra additions like Adjoint and PermutedDimsArray without bailing out. Just being able to inline / scalar reduction of aggregate as necessary to turn code into operations on native types would be a boon.
Currently, LoopVectorization uses check_args to bail out on (ideally) anything violating those assumptions:
julia> using LoopVectorization
julia> x = rand(Complex{Float64}, 4); y = rand(4);
julia> LoopVectorization.check_args(x), LoopVectorization.check_args(y)
(false, true)
But compile times would also be much better. A 32-bit build of Julia runs LoopVectorization’s test suite several times faster than the 64-bit build, because the 32-bit build only has 8 floating point registers to work with*. That severely limits the amount of unrolling it can do without spilling registers – and that “severely” reduces the amount of work Julia and LLVM have to do to compile.
Inference should return the same result for everything that gets unrolled, there are obvious patterns in the indices that the compiler shouldn’t have to work to unravel…
But I’m not really sure why. It strikes me as weird to see zmm registers used, but only zmm0 through zmm7.
Three examples:
Matrix multiplication
Before I made the changes, the assembly for the loop body here had a large number of addq, leaq, and sometimes even movq instructions to calculate the addresses of the broadcasted elements.
In the code generated now, the rdi register contains the stride of the array B in bytes, while r9 = 3 * rdi.
The loads from B now look like:
But at least it isn’t doing these calculations within the loop anymore.
LLVM 8 and 9 also needed me to actually inttoptr the pointers into i16*, i32*, and i64* and use these as the base for getelementptr operations to get the 2*, 4*, and 8* to appear. LLVM 10 did this on its own. I don’t think I checked LLVM 6 before the fix.
Convolutions
using OffsetArrays, Cthulhu, LoopVectorization
Cthulhu.CONFIG.asm_syntax = :intel
function avxgeneric!(out, A, kern)
@avx for I in CartesianIndices(out)
tmp = zero(eltype(out))
for J in CartesianIndices(kern)
tmp += A[I+J]*kern[J]
end
out[I] = tmp
end
out
end
A = rand(100, 100);
kern = OffsetArray(rand(T, 3, 3), -1:1, -1:1);
out = OffsetArray(view(similar(A, size(A) .+ 32), (1:98) .+ 32, (1:98) .+ 32), 1, 1);
@descend avxgeneric!(out, A, kern)
For a while, it was generating code like this (sorry for the AT&T syntax, this is an old copy/paste before I switched; memory addressing is much easier to read IMO with Intel syntax):
In the above, LLVM calculates pointer offsets in a ymm register, and then uses expensive extract instructions to load them:
vpextrq $1, %xmm0, %rbx
After I made a few tweaks, it actually started using zmm registers to calculate even more pointer offsets, making the code even slower.
I did not use SIMD intrinsics for any of the indexing calculations; that was LLVM’s ingenuity.
This example needs more work, because I still have the series of add instructions I managed to get rid of in the matrix-multiplication example, but the add instructions are a major improvement over what it was:
This is about 30% faster in the benchmark I ran than when it used ymm registers, and 50% faster than zmm, even though the amount of unrolling is now actually less suited to the benchmark – actually, the above loop doesn’t get run at all in the benchmark, but instead some clean up loop I’d have to dig for.
That regression was so severe that I never released a LoopVectorization version generating it.
Repeated (e.g., diagonal) indices: for i in 1:N; A[i,i]; end
function logdettriangle(B::Union{LowerTriangular,UpperTriangular})
A = parent(B) # using a triangular matrix would fall back to the default loop.
ld = zero(eltype(A))
@avx for n ∈ axes(A,1)
ld += log(A[n,n])
end
ld
end
This is what it looked like before, and this after I made those changes.
LLVM would calculate the pointer offset from the strides inside the loop.
I now handle this via A[i,i] -> Adiagonalview[i], where stride(Adiagonalview[i],1) == stride(A,1) + stride(A,2), which reduces addressing calculations by enough to make a notable difference vs the cost of a SIMD-log.
This particular issue may be in part because it has to use vectors for the discontiguous offsets into A, so for A[i,i], we have something like ptr + <0,1,2,3,4,5,6,7> * st + stride(A,2) * <0,1,2,3,4,5,6,7> * st where st = sizeof(eltype(T)), and LLVM might be hesitant to optimize the vectors / realize that
<0,1,2,3,4,5,6,7> * st * i + stride(A,2) * <0,1,2,3,4,5,6,7> * st * i
== (<0,1,2,3,4,5,6,7> * st * (1 + stride(A,2))) * i
and that from there it can avoid the calculations inside the loop, and just keep vpaddq-ing <0,1,2,3,4,5,6,7> * st * (1 + stride(A,2)) onto the pointer vector instead.
LLVM sometimes figured this out, but not always. Flattening repeated indices into one with combined stride made it more reliable.
If you like to waste time with micro-optimizations as much as I do, AVX512 is fun because of the extra options it provides and large possible reward when you properly leverage it (that large reward being just as much a function of the compiler’s poor AVX512 performance as you note as it is the larger vectors; the opportunity cost of hand optimizing / autogenerating optimized code is what the compiler would have done).
I used to have Clang-polly in the benchmarks, but I removed it because the performance was bad, it took a long time to compile using Polly, and then once I wiped my hard drive I decided rebuilding Clang with Polly from source instead of just using my distro’s Clang (which wasn’t built with Polly) wasn’t worthwhile. The only benchmark that seemed to improve with Polly was A * B, but it didn’t improve to a level beyond how the Intel compilers performed, and they have vastly better compilation times.
None of the transposed cases were notably better than regular Clang.
I’m looking forward to trying MLIR, whether through Brutus, Flang, or some other convenient frontend.
I think Flang is spearheaded by NVidea, so perhaps that is the primary goal of them lowering to MLIR.
I should look into it more than I have, but there has at least been discussion of an AVX512 dialect and this paper got a pretty good gemm on AVX2.
Maybe another example I could have added is that LoopVectorization will (under certain conditions) use asm call instead of @llvm.muladd to specifically use the vfmadd231 instruction on CPUs with the FMA3 instruction set. vfmadd231 basically does: c = a * b + c
While @llvm.muladd would often choose to use vfmadd213 instead: b = a * b + c
and then add an extra vmovapd instruction to set c equal to b.
This has now been fixed on LLVM master, after I asked for help getting rid of those vmovapds online. That means on Julia versions with an LLVM using that commit, I can quit using asm call. I think that’ll be starting with LLVM 11?
So at least one of the optimization’s here has already led to an improvement in LLVM.
I am surprised that llvm left such optimization on the table, though for most (all?) of it, it again seems that LLVM does know what’s the offsets are but didn’t manage to emit the optimal code to generate it. It can certainly be added to the cases were better codgen is needed but I don’t think this info has ever be hidden from LLVM IR.
I like to do it when it matters for me ;-p… So yes when I use AVX512 capable hardware as my main computer.
As far as my concern, you are very welcome to submit that patch to be included in our LLVM build. I basically did that for all of the few LLVM performance related issues that I’ve fixed/experienced, and ususally before they are merged upstream. There’s no need to wait for a new LLVM release to be used.
And I’ll also add that doing these pattern matching with LLVM IR is not hard. In fact I always found it to be easier than working with julia IR. The switch to SSA for julia IR have cerntainly helpped (though that’s not the level this package uses) but LLVM still have WAY more utilities to do all of these analysis.
This presentation lays out the relationship and motivation better than I could: 2020-02-26 - CGO 2020 Talk - Google Slides. MLIR is being pushed by core developers of LLVM, which is what I meant by “raising the white flag” on solving all problems at the level of LLVM.
There’s interest in turning LoopVectorization into a compiler pass, see Add a compiler pass for iterate(::CartesianIndices) by timholy · Pull Request #35074 · JuliaLang/julia · GitHub. I hope to get back to that “real soon” (invalidations and Revise have dominated lately…). But honestly, what Chris is doing at the package level will all be useful once we pave the way for integrating at a deeper level, so there’s every reason to be excited about seeing his work continue to move forward.
Sure I knew that they want MLIR since the current IR can’t solve some problems. I don’t see how that applies here though. There’s nothing language specific here and there’s nothing that LLVM isn’t already aware of. Moving to an IR might make existing code easier to (re)write but that does not seem to be the limitation here. Again, all what’s new here seem to be emitting better code given information already known.
I’ve already said that working on better code generation isn’t what I’m complaining about. Bending over to fit intrinsic limitation of current implementation is.
Similarly, if this was implemented on LLVM IR, nothing would be wasted either and it’ll even have a much broader audience. (in that sense, it is wasted effort since it duplicate the LLVM one and ones from other people. It’s very unfortunate but I have less direct issue about that.)
for my own education since i know very little about the Julia compiler…
i would imagine that Julia takes code, and then attempts to re-write it into a form which LLVM can then optimize, much like a lot of scheme compilers try to emit very boring C code which is really easy to optimize because it kind of looks like assembly written in C. Does Julia emit some form of intermediate language and then LLVM works on compiling that secondary thing ? i’ll make another guess and suppose that the secondary thing is in fact a processor independent assembly language of some sort.
loop vectorization, it seems, is a macro which attempts to take your code, and because it’s a macro, can actually emit assembly directly bypassing LLVM entirely ?
And so loopvectorization as a compiler pass would do what ? actually emit assembly or emit that intermediate code for LLVM (assuming i’m right about that) ?
A small addition to @Oscar_Smith’s excellent recommendation (short answer: Julia’s compiler emits LLVM IR and then calls LLVM passes), Base.llvmcall means that the line between “ordinary Julia code” and “emit assembly” (via LLVM) is blurry.
While all the information is available, I think it’s easier to make those sorts of optimizations at the same time as you perform loop transformations, as the patterns and relations are clear at this time; e.g. if you’re unrolling a loop in which you’re loading from a strided array, you know that some of the pointer offsets are 2x, 4x, and possibly 8x larger than a base, and thus you can use the 2, 4,or 8x multipliers available for addressing (and that this is a better use for them than sizeof(eltype(pointer)) as LLVM normally likes to do).
Some of LLVM’s mistakes are truly bizarre.
How could it possibly not figure this one out?
This happens on LLVM 8 and 9 (and I think 6), but not with LLVM 10 (meaning no use reporting it, as it’s already been fixed):
rbp == r9 + 8.
Every time it uses rbp, it subtracts 8.
Once it’s done with all the loads, before repeating the loop, it sets r9 = rbp, and then adds 8 to rbp.
It seems that LLVM is obviously aware of the relationship here.
Yet, why doesn’t it eliminate rbp?
The most bewildering case of bad code gen I came across is that, if it possible for a loop to have 0 iterations, LLVM likes to generate code that:
zero initializes all the registers
checks if the loop will iterate at least once. If it does not, it jumps to the appropriate clean up. If the loop will iterate, it continues and…
zero initializes all the registers have just been zero initialized and not used since said xor-ing to 0.
This look really bad when the number of registers being xored twice is something like 25 or 27.
As a workaround, LoopVectorization emits non-unrolled loops of the form
cond = true
while cond
#do stuff
cond = # check if it should exit; (`cond || break` leads to worse code gen)
end
(It can check to make sure they’re not empty outside if you pass @avx check_empty=true.)
But this trick doesn’t work when loops are unrolled, as I wont guarantee that any particular one will iterate at least once.
I’d still have to carry the code managing asm call until I drop support for Julia versions that don’t carry that patch.
But asm call is definitely a hack that it would be better to be rid of.
I have a fair bit of logic deciding whether or not to use a the asm version. On AVX512, LLVM can optimize the @llvm.muladd version by combining it with select statements or with loads/broadcasts.
So I have to check if it seems likely LLVM may want to do that, to avoid missing those opportunities.
I’ll have to look at other PRs adding LLVM patches to Julia to see what it’d take.
If Julia 1.6 is to be the next LTS, it would be worth it for me to get it in there.
Also, regarding performance on multiples of 32, I now have:
By only calculating the mask if we need it. Some of the remaining difference is probably because, after the loop, the LLVM version immediately reduces the vector to a scalar, while mine checks for the appropriate clean up loop.
At these sizes, selfdot and sum examples are also helped by only unrolling by 4 instead of 8. As the size range where 8 is better is probably small (wont be long before they’re memory bound), I may limit the maximum unrolling of such reductions to 4.
I’ll have to think about it.
I’d have to look into that, but it of course makes sense that this would be true given how many passes there are that would want to know about these things.
I know close to 0 about LLVM’s internals.
@purplishrock here, in answer of how things work (note also that LoopVectorization has some developer docs):
The macro generates a call to the generated function _avx_!:
julia> using LoopVectorization
julia> @macroexpand @avx for i ∈ eachindex(a,b)
s += a[i]*b[i]
end
quote
begin
var"##loopi#592" = LoopVectorization.maybestaticrange(eachindex(a, b))
var"##i_loop_lower_bound#593" = LoopVectorization.maybestaticfirst(var"##loopi#592")
var"##i_loop_upper_bound#594" = LoopVectorization.maybestaticlast(var"##loopi#592")
end
if LoopVectorization.check_args(a, b)
var"##vptr##_a" = LoopVectorization.stridedpointer(a)
var"##vptr##_b" = LoopVectorization.stridedpointer(b)
local var"##s_0"
begin
$(Expr(:gc_preserve, :(var"##s_0" = LoopVectorization._avx_!(Val{(0, 0, 0)}(), Tuple{:LoopVectorization, :getindex, LoopVectorization.OperationStruct(0x0000000000000001, 0x0000000000000000, 0x0000000000000000, 0x0000000000000000, LoopVectorization.memload, 0x01, 0x01), :LoopVectorization, :getindex, LoopVectorization.OperationStruct(0x0000000000000001, 0x0000000000000000, 0x0000000000000000, 0x0000000000000000, LoopVectorization.memload, 0x02, 0x02), :LoopVectorization, :LOOPCONSTANTINSTRUCTION, LoopVectorization.OperationStruct(0x0000000000000000, 0x0000000000000000, 0x0000000000000001, 0x0000000000000000, LoopVectorization.constant, 0x00, 0x03), :LoopVectorization, :vfmadd_fast, LoopVectorization.OperationStruct(0x0000000000000001, 0x0000000000000001, 0x0000000000000000, 0x0000000000010203, LoopVectorization.compute, 0x00, 0x03)}, Tuple{LoopVectorization.ArrayRefStruct{:a,Symbol("##vptr##_a")}(0x0000000000000001, 0x0000000000000001, 0x0000000000000000), LoopVectorization.ArrayRefStruct{:b,Symbol("##vptr##_b")}(0x0000000000000001, 0x0000000000000001, 0x0000000000000000)}, Tuple{0, Tuple{4}, Tuple{3}, Tuple{}, Tuple{}, Tuple{}, Tuple{}}, Tuple{:i}, (var"##i_loop_lower_bound#593":var"##i_loop_upper_bound#594",), var"##vptr##_a", var"##vptr##_b", s)), :a, :b))
end
s = LoopVectorization.reduced_add(var"##s_0", s)
else
$(Expr(:inbounds, true))
local var"#79#val" = for i = eachindex(a, b)
#= REPL[52]:2 =#
s = Base.FastMath.add_fast(s, Base.FastMath.mul_fast(a[i], b[i]))
end
$(Expr(:inbounds, :pop))
var"#79#val"
end
end
The function _avx_! is given everything it needs to know to reconstruct a summary of the loop’s structure, encoded in type information.
The macro LoopVectorization.@avx_debug gives us this summary:
julia> a = rand(147); b = rand(147);
julia> s = 0.0;
julia> ls = LoopVectorization.@avx_debug for i ∈ eachindex(a,b)
s += a[i]*b[i]
end;
OPS = Tuple{:LoopVectorization,:getindex,LoopVectorization.OperationStruct(0x0000000000000001, 0x0000000000000000, 0x0000000000000000, 0x0000000000000000, LoopVectorization.memload, 0x01, 0x01),:LoopVectorization,:getindex,LoopVectorization.OperationStruct(0x0000000000000001, 0x0000000000000000, 0x0000000000000000, 0x0000000000000000, LoopVectorization.memload, 0x02, 0x02),:LoopVectorization,:LOOPCONSTANTINSTRUCTION,LoopVectorization.OperationStruct(0x0000000000000000, 0x0000000000000000, 0x0000000000000001, 0x0000000000000000, LoopVectorization.constant, 0x00, 0x03),:LoopVectorization,:vfmadd_fast,LoopVectorization.OperationStruct(0x0000000000000001, 0x0000000000000001, 0x0000000000000000, 0x0000000000010203, LoopVectorization.compute, 0x00, 0x03)}
ARF = Tuple{LoopVectorization.ArrayRefStruct{:a,Symbol("##vptr##_a")}(0x0000000000000001, 0x0000000000000001, 0x0000000000000000),LoopVectorization.ArrayRefStruct{:b,Symbol("##vptr##_b")}(0x0000000000000001, 0x0000000000000001, 0x0000000000000000)}
AM = Tuple{0,Tuple{4},Tuple{3},Tuple{},Tuple{},Tuple{},Tuple{}}
LPSYM = Tuple{:i}
LB = Tuple{VectorizationBase.StaticLowerUnitRange{1}}
vargs = (VectorizationBase.PackedStridedPointer{Float64,0}(Ptr{Float64} @0x00007f5936f90050, ()), VectorizationBase.PackedStridedPointer{Float64,0}(Ptr{Float64} @0x00007f5936f905a0, ()), 0.0)
julia> typeof(ls)
LoopVectorization.LoopSet
The LoopSet contains all the operations involved in the loop, and their dependencies on each other and the individual loops. This is used both for (a) estimating the cost of evaluating the loop set following different loop orders and unrolling factors (the estimate is less sophisticated than MCA, but uses estimates of instruction throughput, register utilization, etc).
It will only choose to unroll up to 2 loops:
The returned arguments are loop order (in a vector), the first and second unrolled loop (second is Symbol("##undefined##")), which loop is vectorized, and then the unrolling factors for the unrolled loops (-1 is a sentinel value for not unrolled).
Then, given these values, it emits a Julia Expr object. You can call LoopVectorization.lower(ls) explicitly, or it will print this as the show method via simply:
julia> ls
This is largely reduced to a sequence of llvmcalls and control flow; the main loop:
If there’s more than one loop, the expression probably wont be inlined, so I’d recomend Cthulhu for introspection.
As a compiler pass, LoopVectorization would similarly take nested loops, build its internal representation, analyze it to estimate which method of evaluating the loop will generally yield the best runtime performance (under the assumption that the loops are hot), and then emit the corresponding IR. Ideally, this should be locally opt-in via pragmas/macros/some form of decorator, as it is now.
I think we all agree that this would be better off integrated into a compiler. A better question is whether it should work with LLVM, or with Julia IR.
An incentive for the latter is that it is still my intention to have this support AD, i.e. use the LoopSet’s understanding of relationships between variables to create the reverse passes for loops.
Also meaning if you want the fix, it should be just a bisect away.
I’ve honestly not seen that one before and it should certainly be reported. I have seen quite a few cases where it does a better job for the empty case/loop entry better then what I can easily come up with…
As early as possible is always good.
It’s very easy to pick up, at least the LLVM IR part. Machine IR is probably not harder though it is somewhat harder to dump and look at IMHO…
And back to the point of wasted effort. Sadly, most of the time you spent on debugging and working around the three issues mentioned here are wasted in the long term, and some even not that long term since a LLVM patch could be included in the next/previous julia release. I almost always find it much more productive to report and even fix the issue at the right level first before developping a temporary workaround. This is more true the more issues you have to workaround which is the whole point about this package. With so many individual issues (and apparently many fixed already) you’ve hit so far, it should be very easy for you to look at the fix/debug the issue in LLVM and it should get you started in fixing those in LLVM in no time.
After the 27 vxor instructions, if the ja L896 doesn’t happen, it xors them all again.
The jump is also unlikely to happen (which I attempted to communicate with @llvm.expect.i1(condtion), meaning the typical case is double-xoring.
I also would not call all of this a workaround. LLVM itself seems averse to nested loop optimizations, and (in my tests) polly generally did not help.
The simple single-loops primarily benefit either from (a) “workaround” for LLVM’s slow cleanup, or (b) using SIMD special functions, but the nested loop cases involve loop switching and register blocking.
EDIT:
I wonder what sort of benchmarks could convince LLVM it’s worth handling clean up loops differently?