[ANN] LoopVectorization

I’ve released LoopVectorization v0.8.1 today (and v0.8.0 yesterday).
The 0.8-series introduced a breaking change:

It now assumes that none of the iterables are empty. That means for i in 1:0; end is invalid, but for I in CartesianIndices(()); end is still fine.

julia> isempty(1:0)
true

julia> isempty(CartesianIndices(()))
false

If you want LoopVectorization to emit a check, you can pass check_empty=true, and it will only run the code if the iterables are in fact not empty.

julia> using LoopVectorization, Test

julia> function mysum_checked(x)
           s = zero(eltype(x))
           @avx check_empty=true for i ∈ eachindex(x)
               s += x[i]
           end
           s
       end
mysum_checked (generic function with 1 method)

julia> function mysum_unchecked(x)
           s = zero(eltype(x))
           @avx for i ∈ eachindex(x)
               s += x[i]
           end
           s
       end
mysum_unchecked (generic function with 1 method)

julia> x = fill(9999, 100, 10, 10);

julia> @test mysum_checked(x) == mysum_unchecked(x) == sum(x)
Test Passed

julia> xv = view(x, :, 1:0, :);

julia> @test iszero(mysum_checked(xv))
Test Passed

julia> @test iszero(mysum_unchecked(xv))
Test Failed at REPL[26]:1
  Expression: iszero(mysum_unchecked(xv))
ERROR: There was an error during testing

Maybe checking should be the default, as it takes extremely little time. But on Zulip, a few folks thought no-checks were reasonable.

LoopVectorization 0.8 had a major overhaul on how the resulting expression is emitted, with the focus on improving both performance and the quality of the generated assembly.
Matrix-multiplication performance is now very good:


While you’d still need to add extra loops and pack for a real matrix multiplication function, the raw macrokernel does extremely well on my system with AVX512.

[Brief aside: If you have any idea how to make all the colors easier to tell apart, please chime in on this issue or in this thread.]

I may have to tune it a bit for AVX2. I haven’t tested it there, but I think I am currently underestimating a cost for those architectures. Were I to make it more accurate, it may make a different decision that might improve performance there. “May”, “might” – I don’t know until I test.

Improvements in the assembly include eliminating a lot of redundant blocks and unnecessary addressing calculations.
Sample of the generated assembly on Julia master with LLVM 10 (disclaimer: there are still a few unnecessary instructions with LLVM 6, 8, and 9; I didn’t check 7 because Julia skipped from 6 to 8 with 1.3 → 1.4):

L1024:
        prefetcht0      byte ptr [r11 + r14]
        vbroadcastsd    zmm30, qword ptr [r8]
        vmovups zmm29, zmmword ptr [r11]                                                                                                                                                                                                                                                                                                                                                               vmovups zmm28, zmmword ptr [r11 + 64]
        vmovupd zmm27, zmmword ptr [r11 + 128]                                                                                                                                                                                                                                                                                                                                                         prefetcht0      byte ptr [r11 + r14 + 64]
        prefetcht0      byte ptr [r11 + r14 + 128]
        vfmadd231pd     zmm26, zmm30, zmm29 # zmm26 = (zmm30 * zmm29) + zmm26
        vfmadd231pd     zmm23, zmm30, zmm28 # zmm23 = (zmm30 * zmm28) + zmm23
        vbroadcastsd    zmm31, qword ptr [r8 + rdi]
        vfmadd231pd     zmm17, zmm30, zmm27 # zmm17 = (zmm30 * zmm27) + zmm17
        vfmadd231pd     zmm25, zmm31, zmm29 # zmm25 = (zmm31 * zmm29) + zmm25
        vfmadd231pd     zmm21, zmm31, zmm28 # zmm21 = (zmm31 * zmm28) + zmm21
        vfmadd231pd     zmm14, zmm31, zmm27 # zmm14 = (zmm31 * zmm27) + zmm14
        vbroadcastsd    zmm30, qword ptr [r8 + 2*rdi]
        vfmadd231pd     zmm24, zmm30, zmm29 # zmm24 = (zmm30 * zmm29) + zmm24
        vfmadd231pd     zmm19, zmm30, zmm28 # zmm19 = (zmm30 * zmm28) + zmm19
        vfmadd231pd     zmm11, zmm30, zmm27 # zmm11 = (zmm30 * zmm27) + zmm11
        vbroadcastsd    zmm30, qword ptr [r8 + r9]
        vfmadd231pd     zmm22, zmm30, zmm29 # zmm22 = (zmm30 * zmm29) + zmm22
        vfmadd231pd     zmm16, zmm30, zmm28 # zmm16 = (zmm30 * zmm28) + zmm16
        vfmadd231pd     zmm8, zmm30, zmm27 # zmm8 = (zmm30 * zmm27) + zmm8
        vbroadcastsd    zmm30, qword ptr [r8 + 4*rdi]
        vfmadd231pd     zmm20, zmm30, zmm29 # zmm20 = (zmm30 * zmm29) + zmm20
        vfmadd231pd     zmm13, zmm30, zmm28 # zmm13 = (zmm30 * zmm28) + zmm13
        vfmadd231pd     zmm6, zmm30, zmm27 # zmm6 = (zmm30 * zmm27) + zmm6
        vbroadcastsd    zmm30, qword ptr [r8 + r15]
        vfmadd231pd     zmm18, zmm30, zmm29 # zmm18 = (zmm30 * zmm29) + zmm18
        vfmadd231pd     zmm10, zmm30, zmm28 # zmm10 = (zmm30 * zmm28) + zmm10
        vfmadd231pd     zmm4, zmm30, zmm27 # zmm4 = (zmm30 * zmm27) + zmm4
        vbroadcastsd    zmm30, qword ptr [r8 + r12]
        vfmadd231pd     zmm15, zmm30, zmm29 # zmm15 = (zmm30 * zmm29) + zmm15
        vfmadd231pd     zmm7, zmm30, zmm28 # zmm7 = (zmm30 * zmm28) + zmm7
        vbroadcastsd    zmm31, qword ptr [r8 + rbp]
        vfmadd231pd     zmm2, zmm30, zmm27 # zmm2 = (zmm30 * zmm27) + zmm2
        vfmadd231pd     zmm12, zmm31, zmm29 # zmm12 = (zmm31 * zmm29) + zmm12
        vfmadd231pd     zmm5, zmm31, zmm28 # zmm5 = (zmm31 * zmm28) + zmm5
        vfmadd231pd     zmm1, zmm31, zmm27 # zmm1 = (zmm31 * zmm27) + zmm1
        vbroadcastsd    zmm30, qword ptr [r8 + 8*rdi]
        vfmadd231pd     zmm9, zmm30, zmm29 # zmm9 = (zmm30 * zmm29) + zmm9
        vfmadd231pd     zmm3, zmm30, zmm28 # zmm3 = (zmm30 * zmm28) + zmm3
        vfmadd231pd     zmm0, zmm30, zmm27 # zmm0 = (zmm30 * zmm27) + zmm0
        add     r11, r10
        add     r8, 8
        cmp     r11, rdx
        jbe     L1024

To see how far this has come, compare the above graphic with the opening post!

20 Likes

I don’t know any direct way in Julia, but some dude made a tool for Matlab some years ago, called “Maximally perceptually distinct colors” for Matlab plots. Maybe it could be ported to Julia?Generate maximally perceptually-distinct colors - File Exchange - MATLAB Central

I agree that the colors are a bit tough to distinguish.

3 Likes

This works really well:
http://juliagraphics.github.io/Colors.jl/stable/colormapsandcolorscales/#Generating-distinguishable-colors-1

4 Likes

Ah, I would guess that these two are connected :smiley:

I had tried that, but apparently I thought what I have now looked better.

Maybe this is what that looked like?
Now I’m not so sure.

Looks better to me, but still heavy on the blues. Some traces are almost identical (e.g. ifort/ifort-builtin, icc/GFortran if I read correctly), aggregating them should already help color distinction. You can always provide detailed traces in a separate plot for those interested in these details.

Thanks, it’s looking great! I think we’re seeing some installation issues with LoopVectorization on weird CPUs / platforms, but instead of just giving up on LoopVectorization, I hope you’re okay with us just pinging you each time we identify that it could be due to LV. I want to help make sure we can get this stable enough that we can use it without DiffEq, and then just bail on the weird cases, because this kind of performance leg-up is exactly the kind of thing we always want more of.

I was not going to bring this up but since it seems that people actually want to use this in real packages, I just want to point out that,

  1. All of the optimizations mentioned in README are actually handled by LLVM and some of the descriptions there are fairly misleading. In fact, LLVM’s issue is that it is too aggressive and the fallback isn’t gental enough.
  2. This is really not the right level to do auto-vectorization. LLVM has so much better understanding of what’s going on and can take advantage of that much better. Hand vectorization, like SIMD.jl, is completely different and isn’t a problem.

Now with the current implementation of LLVM, you are certainly free to bypass it’s vectorizer to solve any immediate problems. However, don’t expect this to be a long term solution…

1 Like

Right now, short term solutions are much better than long term solutions.

3 Likes

Another clue that could be used is line thickness.

1 Like

…and/or line-styles (dotted, dashed, dash-dotted). You could also use the same colour for the three ifort lines but with different styles, you immediate gain two more colours.

1 Like

This kind of comment is quite unfortunate and demoralizing. A lot of people have seen tremendous performance benefits from using LoopVectorization and I sure hope we continue to see this excellent work. If you have some ideas on how to improve upon things, I’m sure constructive criticism will be appreciated.

5 Likes

Yes I do

And I didn’t give too detail comment and even mention that it is certainly OK to use it now exactly because I understand that a workaround for a problem is OK to have.

And that’s to a certain degree exactly the point, i.e. to discourage people from going down the wrong path too deep.

A workaround is a workaround. The correct level for this to happen is much lower so it wouldn’t be wise to encourage people to sink more time into it than what’s necessary to workaround existing issues since the workaround has it’s intrinsic limitations. (There’s no way it can know memory access and tradeoffs as much as LLVM do, without basically reimplementing the whole compiler infrastructure).

Also, the point is to not let the wrong information around mislead people to think this is what it really isn’t.


And if you do want more detailed comment.

For this reason, the @avx version is roughly twice as fast. The @inbounds @simd version, however, is not, because it runs into the problem of loop carried dependencies: to add a[i]*b[i] to s_new = s_old + a[i-j]*b[i-j] , we must have first finished calculating s_new , but – while two fma instructions can be initiated per cycle – they each take several clock cycles to complete. For this reason, we need to unroll the operation to run several independent instances concurrently. The @avx macro models this cost to try and pick an optimal unroll factor.

Nop, LLVM does it too. In fact, from the benchmark above, it’s clear that LLVM only do badly for odd element sizes and is actually doing somewhat better for sizes that doesn’t need scaler loop. That’s because LLVM has a better model but doesn’t have a good handling for the tail.

GitHub - JuliaSIMD/LoopVectorization.jl at b09005245452de1f72b1f0fca59a2bc05bc9a6a5 appears to be the same edge effect. Again, it’s real, but the gain isn’t from the claimed reason above.

The syntax trick for broadcast is nice, but not particularly tied to vectorization…

5 Likes

And again, I have no problem people uses this as long as they are aware of it’s intrinsic limitations and know where the fix would be and don’t be too surprised in the future. And I believe I’m qualified to say this because I even made GitHub - yuyichao/FunctionWrappers.jl and it’s even written in the README that it is just a proof of principle.

Is the problem because Julia doesn’t expose enough information to LLVM or because LLVM doesn’t use it? If the latter, is there any reason to expect LLVM (which by now is a pretty mature piece of software) will get around to it?

No it’s because LLVM’s optimizer optimized deeply for the vectorized loop (apparently better than the package) but not for the rest of the element that it can’t handle. When you are talking about each loop iteration going through 32 elements in a optimized vector loop, the (order of) 31 element unoptimized scalar loop is a huge hit for dimensions as small as 255 (dot product example) or 70-80 (matrix multiply example). The scalar loop can easily take as much or longer time than the vectorized loop while processing fewer elements in those cases.

= = if it is mature there won’t be much LLVM vs GCC (I still frequently see cases where GCC compiles much better than Clang, at least as often than the other way around…) or API breaking …

It’s a pure LLVM internal fix and I do expect it to happen as long as someone reports it. If anyone is intersted, a good way to do so would be to reproduce in clang (fairly easy) and to show what should the generated code be by using the proper compiler intrinsics, say, by essentially disassemble/reverse compile the asm/llvm ir generated by this package. More tweak would be needed on top of that for sure but an example of better codegen should get it started…

But wouldn’t then the solution be unavailable until Julia catches up to the appropriate LLVM version? At this point we seem to be a couple of versions behind the current…

can you articulate exactly what the substantial fix would be and how hard that would be to implement?

No, we carry many patches.

Not really.

It shouldn’t be hard for people that have worked on the llvm vectorization patches but I don’t know all the detail there. This is also the reason I didn’t want to say anything initialially before people got too excited…

1 Like