Odd BenchmarkTools timings using @threads and Octavian

Octavian’s tasks keep spinning while waiting for work to do for a few milliseconds, getting in the way of the new tasks created by Threads.@threads, but also making them much faster to respond to new work during this time.
CheapThreads and LoopVectorization reuse the same tasks as Octavian, therefore, individual timings:

julia> using LinearAlgebra, Octavian, CheapThreads, LoopVectorization

julia> @btime mul!($b,$A,$x);
  9.781 μs (0 allocations: 0 bytes)

julia> @btime matmul!($b,$A,$x);
  2.321 μs (0 allocations: 0 bytes)

julia> function bmap!(f,out,x)
           @batch for i = 1:length(x)
               out[i] = f(x[i])
           end
       end
bmap! (generic function with 1 method)

julia> @btime tmap!($f,$b,$x);
  112.966 μs (181 allocations: 16.12 KiB)

julia> @btime bmap!($f,$b,$x);
  53.152 μs (0 allocations: 0 bytes)

julia> @btime vmapt!($f,$b,$x); # LoopVectorization
  5.561 μs (0 allocations: 0 bytes)

Grouped timings:

function time4(b,A,x) 
    matmul!(b,A,x) 
    bmap!(f,b,x)   
end
function time5(b,A,x) 
    matmul!(b,A,x) 
    vmapt!(f,b,x)   
end

Yields:

julia> @btime time1($b,$A,$x); # matmul! tmap!
  10.750 ms (183 allocations: 16.19 KiB)

julia> @btime time2($b,$A,$x); # mul! tmap!
  144.676 μs (181 allocations: 16.12 KiB)

julia> @btime time3($b,$A,$x); # mul! map!
  1.194 ms (0 allocations: 0 bytes)

julia> @btime time4($b,$A,$x); # matmul! bmap!
  58.434 μs (0 allocations: 0 bytes)

julia> @btime time5($b,$A,$x); # matmul! vmapt!
  7.831 μs (0 allocations: 0 bytes)

julia> versioninfo()
Julia Version 1.7.0-DEV.873
Commit ab6652ab9a* (2021-04-08 11:17 UTC)
Platform Info:
  OS: Linux (x86_64-generic-linux)
  CPU: Intel(R) Core(TM) i9-10980XE CPU @ 3.00GHz
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-11.0.1 (ORCJIT, cascadelake)
Environment:
  JULIA_NUM_THREADS = 36

time4 and time5 are additive (i.e. matmul! time + bmap! and vmapt! time, respectively) as expected.

3 Likes