Octavian’s tasks keep spinning while waiting for work to do for a few milliseconds, getting in the way of the new tasks created by Threads.@threads, but also making them much faster to respond to new work during this time.
CheapThreads and LoopVectorization reuse the same tasks as Octavian, therefore, individual timings:
julia> using LinearAlgebra, Octavian, CheapThreads, LoopVectorization
julia> @btime mul!($b,$A,$x);
9.781 μs (0 allocations: 0 bytes)
julia> @btime matmul!($b,$A,$x);
2.321 μs (0 allocations: 0 bytes)
julia> function bmap!(f,out,x)
@batch for i = 1:length(x)
out[i] = f(x[i])
end
end
bmap! (generic function with 1 method)
julia> @btime tmap!($f,$b,$x);
112.966 μs (181 allocations: 16.12 KiB)
julia> @btime bmap!($f,$b,$x);
53.152 μs (0 allocations: 0 bytes)
julia> @btime vmapt!($f,$b,$x); # LoopVectorization
5.561 μs (0 allocations: 0 bytes)
Grouped timings:
function time4(b,A,x)
matmul!(b,A,x)
bmap!(f,b,x)
end
function time5(b,A,x)
matmul!(b,A,x)
vmapt!(f,b,x)
end
Yields:
julia> @btime time1($b,$A,$x); # matmul! tmap!
10.750 ms (183 allocations: 16.19 KiB)
julia> @btime time2($b,$A,$x); # mul! tmap!
144.676 μs (181 allocations: 16.12 KiB)
julia> @btime time3($b,$A,$x); # mul! map!
1.194 ms (0 allocations: 0 bytes)
julia> @btime time4($b,$A,$x); # matmul! bmap!
58.434 μs (0 allocations: 0 bytes)
julia> @btime time5($b,$A,$x); # matmul! vmapt!
7.831 μs (0 allocations: 0 bytes)
julia> versioninfo()
Julia Version 1.7.0-DEV.873
Commit ab6652ab9a* (2021-04-08 11:17 UTC)
Platform Info:
OS: Linux (x86_64-generic-linux)
CPU: Intel(R) Core(TM) i9-10980XE CPU @ 3.00GHz
WORD_SIZE: 64
LIBM: libopenlibm
LLVM: libLLVM-11.0.1 (ORCJIT, cascadelake)
Environment:
JULIA_NUM_THREADS = 36
time4 and time5 are additive (i.e. matmul! time + bmap! and vmapt! time, respectively) as expected.