Vectorization heuristics are hard! Is there a way to ask/get more information to/from the compiler/LLVM about loop vectorization

ndinsmore · February 10, 2023, 8:24pm

In the PR below, we have made a lot of progress speeding up isascii with something as simple as essentially: (which is faster than using UInt64)

function _isascii(cu::AbstractVector{CU}, first, last) where {CU}
    r = zero(CU)
    for n = first:last
        @inbounds r |= cu[n]
    end
    return r < 0x80
end

isascii(s::AbstractString) = @inline _isascii(codeunits(s),1,ncodeunits(s))

This is blazing fast for large strings doing on average 32 bytes/cycle on a computer with avx2. With LLVM breaking the loop into a SIMD “Main loop” which in my case is blocks of 128 bytes, and then also creates a “epilog loop” which is still got some SIMD in it but takes care of everything that doesn’t fit in the 128 bit blocks.

The implementation gets a bit worse when you realize you need a fast out for short strings, and then would be better off looking at big chunks for larger strings.

With that you end up wanting to know how is that function being vectorized. For example, it would be great in you could ask:

How big are the SIMD blocks in the function?
Can I get just the epilog portion of the vectorization?

These would make it easier to to build the heuristic which switches between the different best strategies.

The inverse to that is you can tell the compiler a loop is going to be exactly a specific size with something like:

function  _isascii(::Val{N},cu::AbstractVector{CU}, first) where {N,CU}
    return @inline _isascii(cu,first,first+N-1)
end

The when you see N high enough like _isascii(Val(1024)... you can get a function that only gives you the SIMD portion of the function. Though if you set N to something small like 64 you get much worse performance than using the loop that has no idea about size:

julia> cu=codeunits("12345678"^8);

julia> _isascii64(cus,s) = @inline _isascii(Val(64),cus,s)
_isascii64 (generic function with 1 method)

julia> @btime _isascii64($cu,1)
  10.049 ns (0 allocations: 0 bytes)
true

julia> @btime _isascii($cu,1,64)
  5.658 ns (0 allocations: 0 bytes)
true

That said there is also no way to tell the compiler a loop is of a size no bigger than a given size. This would be great because maybe we could trick the compiler into only giving us the fast epilog, which would allow us to build better hueristics.

Please note that this is not about help on isascii but more about how these things could be done in general.

github.com/JuliaLang/julia

Vectorized isascii using simple loop 25+bytes/cycle for large strings

JuliaLang:master ← ndinsmore:faster_isascii

opened 04:14AM - 07 Feb 23 UTC

ndinsmore

+61 -6

This changes `isascii` to a simple loop that checks the whole string. LLVM is …doing a disturbingly good job vectorizing this function which slightly hurts it with small strings because the overhead of loading the function is higher. The benchmarking below shows that the loop-based method is 50x faster than the current method. The funny thing is that I had a fancy `isascii` built to use the `UInt64` trick, and was just doing some final benchmarking when I realized the simple loop gave the best results. This does make me a little worried that the result here is very sensitive to the optimizations that it gets. Another note is that any attempts at checking early if the string has encountered a non ASCII character just dramatically slows down the overall function. **UPDATE*** *as per @oscardssmith `isascii` now looks at chunks. I did a little optimization and 1024 seemed to be the right size. `isascii` is the version now in this PR and was refined by @matthias314 The benchmark should be an average of the two extremes 1. All ascii 2.) non asci first character benchmark code ```julia using BenchmarkTools function benchmark_isascii(fun) for p=1:14 n = (2 * 2^(p-1))-1 s='S'^n s2 = 'λ' * 'S'^(n-1) b = @benchmark $fun($s)&$fun($s2) seconds=1 cpu_info = Sys.cpu_info() cpu_ghz= mean(i.speed for i in Sys.cpu_info()) /1_000 parse_time_ns = time(median(b)) GB_per_second= 2*n / parse_time_ns bytes_per_cycle = GB_per_second / cpu_ghz print("$fun -> $n bytes $GB_per_second GB/second @ $cpu_ghz GHz -> $bytes_per_cycle bytes/cycle\n") end end ## function isascii_nochunks(s::AbstractString) bytes = codeunits(s) l = ncodeunits(s) r = UInt8(0) for n = 1:l @inbounds r |= bytes[n] end return r < 0x80 end function _isascii(bytes, first, last) r = UInt8(0) for n = first:last @inbounds r |= bytes[n] end return r < 0x80 end function isascii(s::AbstractString) chunk_size = 1024 bytes = codeunits(s) l = ncodeunits(s) start = 1 fastmin(a,b) = ifelse(a < b, a, b) while start <= l @inline _isascii(bytes, start, fastmin(l, start + chunk_size)) || return false start += chunk_size end return true end isascii_all(c::Char) = bswap(reinterpret(UInt32, c)) < 0x80 isascii_all(s::AbstractString) = all(isascii_all, s) isascii_all(c::AbstractChar) = UInt32(c) < 0x80 ## benchmark_isascii(isascii_all) ## benchmark_isascii(isascii_nochunks) ## benchmark_isascii(isascii) ## ``` results ``` isascii_all -> 1 bytes 0.06623796354912871 GB/second @ 2.7 GHz -> 0.024532579092269892 bytes/cycle isascii_all -> 3 bytes 0.21423568080176733 GB/second @ 2.7 GHz -> 0.079346548445099 bytes/cycle isascii_all -> 7 bytes 0.3949880668257757 GB/second @ 2.7 GHz -> 0.14629187660213913 bytes/cycle isascii_all -> 15 bytes 0.7953936797000536 GB/second @ 2.7 GHz -> 0.2945902517407606 bytes/cycle isascii_all -> 31 bytes 1.3267578400808178 GB/second @ 2.7 GHz -> 0.4913917926225251 bytes/cycle isascii_all -> 63 bytes 1.792814953381155 GB/second @ 2.7 GHz -> 0.6640055382893166 bytes/cycle isascii_all -> 127 bytes 2.0539571873077573 GB/second @ 2.7 GHz -> 0.7607248841880582 bytes/cycle isascii_all -> 255 bytes 2.442238909701151 GB/second @ 2.7 GHz -> 0.9045329295189447 bytes/cycle isascii_all -> 511 bytes 2.66197056596995 GB/second @ 2.7 GHz -> 0.9859150244333148 bytes/cycle isascii_all -> 1023 bytes 2.7845331630383012 GB/second @ 2.7 GHz -> 1.0313085789030745 bytes/cycle isascii_all -> 2047 bytes 2.8371448371448373 GB/second @ 2.7 GHz -> 1.0507943841277174 bytes/cycle isascii_all -> 4095 bytes 2.8627466210967842 GB/second @ 2.7 GHz -> 1.0602765263321423 bytes/cycle isascii_all -> 8191 bytes 2.893238748417861 GB/second @ 2.7 GHz -> 1.07156990682143 bytes/cycle isascii_all -> 16383 bytes 2.8853469531525184 GB/second @ 2.7 GHz -> 1.068647019686118 bytes/cycle isascii_nochunks -> 1 bytes 0.16941096588015617 GB/second @ 2.7 GHz -> 0.06274480217783561 bytes/cycle isascii_nochunks -> 3 bytes 0.44490675384501077 GB/second @ 2.7 GHz -> 0.16478027920185584 bytes/cycle isascii_nochunks -> 7 bytes 0.8653977307954616 GB/second @ 2.7 GHz -> 0.3205176780723932 bytes/cycle isascii_nochunks -> 15 bytes 1.569987389659521 GB/second @ 2.7 GHz -> 0.5814768109850077 bytes/cycle isascii_nochunks -> 31 bytes 2.8405009669398655 GB/second @ 2.7 GHz -> 1.0520373951629132 bytes/cycle isascii_nochunks -> 63 bytes 5.303299492385786 GB/second @ 2.7 GHz -> 1.9641849971799208 bytes/cycle isascii_nochunks -> 127 bytes 10.465010351966875 GB/second @ 2.7 GHz -> 3.875929759987731 bytes/cycle isascii_nochunks -> 255 bytes 18.675950486295314 GB/second @ 2.7 GHz -> 6.917018698627894 bytes/cycle isascii_nochunks -> 511 bytes 34.2668152350081 GB/second @ 2.7 GHz -> 12.691413050003 bytes/cycle isascii_nochunks -> 1023 bytes 58.472315145922245 GB/second @ 2.7 GHz -> 21.656413017008237 bytes/cycle isascii_nochunks -> 2047 bytes 76.48904580152671 GB/second @ 2.7 GHz -> 28.32927622278767 bytes/cycle isascii_nochunks -> 4095 bytes 97.76940343334071 GB/second @ 2.7 GHz -> 36.21089016049656 bytes/cycle isascii_nochunks -> 8191 bytes 99.35550547582335 GB/second @ 2.7 GHz -> 36.79833536141605 bytes/cycle isascii_nochunks -> 16383 bytes 98.05788949825984 GB/second @ 2.7 GHz -> 36.31773685120734 bytes/cycle isascii -> 1 bytes 0.12132643748098569 GB/second @ 2.7 GHz -> 0.044935717585550254 bytes/cycle isascii -> 3 bytes 0.31225833420420107 GB/second @ 2.7 GHz -> 0.11565123489044483 bytes/cycle isascii -> 7 bytes 0.6210432456531431 GB/second @ 2.7 GHz -> 0.23001601690857149 bytes/cycle isascii -> 15 bytes 1.2457223937901678 GB/second @ 2.7 GHz -> 0.4613786643667288 bytes/cycle isascii -> 31 bytes 2.430222011908987 GB/second @ 2.7 GHz -> 0.900082226632958 bytes/cycle isascii -> 63 bytes 4.547217078749592 GB/second @ 2.7 GHz -> 1.6841544736109597 bytes/cycle isascii -> 127 bytes 8.880743635787473 GB/second @ 2.7 GHz -> 3.289164309550916 bytes/cycle isascii -> 255 bytes 15.564374711582834 GB/second @ 2.7 GHz -> 5.76458322651216 bytes/cycle isascii -> 511 bytes 30.51297732921594 GB/second @ 2.7 GHz -> 11.301102714524422 bytes/cycle isascii -> 1023 bytes 42.45936692642148 GB/second @ 2.7 GHz -> 15.725691454230176 bytes/cycle isascii -> 2047 bytes 79.80871569659996 GB/second @ 2.7 GHz -> 29.558783591333317 bytes/cycle isascii -> 4095 bytes 113.6582026746599 GB/second @ 2.7 GHz -> 42.09563062024441 bytes/cycle isascii -> 8191 bytes 136.41434343065026 GB/second @ 2.7 GHz -> 50.52383090024083 bytes/cycle isascii -> 16383 bytes 156.27125780921037 GB/second @ 2.7 GHz -> 57.878243633040874 bytes/cycle ```

gbaraldi · February 10, 2023, 8:31pm

Getting the non optimized code out of julia via code_llvm with the correct options(dump_module,raw,optimize) and using opt, Working with LLVM · The Julia Language This page has some documentation on how to debug these things. And opt can give you optimization remarks and other things.

ndinsmore · February 10, 2023, 10:11pm

I mean more from a programmatic sense, how can a method do these things so it can get the heuristic right? Or how can we be a bit more forceful with LLVM. I understand how to look at the compiled code.

Topic		Replies	Views
Help understanding vectorization (or lack thereof) Performance	15	1221	June 8, 2018
LoopVectorization.jl's @avx does not store results Performance question , loopvectorization	9	704	April 2, 2021
LoopVectorization almost doubles execution time? Performance loopvectorization	6	655	July 9, 2021
Simple loop won't vectorize New to Julia	12	1629	January 29, 2019
Experiments with VectorizationBase Performance	6	673	March 23, 2021

Vectorization heuristics are hard! Is there a way to ask/get more information to/from the compiler/LLVM about loop vectorization

Related topics