Vectorization heuristics are hard! Is there a way to ask/get more information to/from the compiler/LLVM about loop vectorization

In the PR below, we have made a lot of progress speeding up isascii with something as simple as essentially: (which is faster than using UInt64)

function _isascii(cu::AbstractVector{CU}, first, last) where {CU}
    r = zero(CU)
    for n = first:last
        @inbounds r |= cu[n]
    return r < 0x80

isascii(s::AbstractString) = @inline _isascii(codeunits(s),1,ncodeunits(s))

This is blazing fast for large strings doing on average 32 bytes/cycle on a computer with avx2. With LLVM breaking the loop into a SIMD “Main loop” which in my case is blocks of 128 bytes, and then also creates a “epilog loop” which is still got some SIMD in it but takes care of everything that doesn’t fit in the 128 bit blocks.

The implementation gets a bit worse when you realize you need a fast out for short strings, and then would be better off looking at big chunks for larger strings.

With that you end up wanting to know how is that function being vectorized. For example, it would be great in you could ask:

  1. How big are the SIMD blocks in the function?
  2. Can I get just the epilog portion of the vectorization?

These would make it easier to to build the heuristic which switches between the different best strategies.

The inverse to that is you can tell the compiler a loop is going to be exactly a specific size with something like:

function  _isascii(::Val{N},cu::AbstractVector{CU}, first) where {N,CU}
    return @inline _isascii(cu,first,first+N-1)

The when you see N high enough like _isascii(Val(1024)... you can get a function that only gives you the SIMD portion of the function. Though if you set N to something small like 64 you get much worse performance than using the loop that has no idea about size:

julia> cu=codeunits("12345678"^8);

julia> _isascii64(cus,s) = @inline _isascii(Val(64),cus,s)
_isascii64 (generic function with 1 method)

julia> @btime _isascii64($cu,1)
  10.049 ns (0 allocations: 0 bytes)

julia> @btime _isascii($cu,1,64)
  5.658 ns (0 allocations: 0 bytes)

That said there is also no way to tell the compiler a loop is of a size no bigger than a given size. This would be great because maybe we could trick the compiler into only giving us the fast epilog, which would allow us to build better hueristics.

Please note that this is not about help on isascii but more about how these things could be done in general.

1 Like

Getting the non optimized code out of julia via code_llvm with the correct options(dump_module,raw,optimize) and using opt, Working with LLVM · The Julia Language This page has some documentation on how to debug these things. And opt can give you optimization remarks and other things.

I mean more from a programmatic sense, how can a method do these things so it can get the heuristic right? Or how can we be a bit more forceful with LLVM. I understand how to look at the compiled code.