 # Odd BenchmarkTools timings using @threads and Octavian

I’m getting some odd behavior using `@threads` and Octavian. If I time `matmul!` and a simple `tmap!` implementation, I see reasonable speedup.

``````using Octavian
using LinearAlgebra

A = randn(8,8)
x = randn(8,10000)
b = similar(x)
f(x) = exp(x+1) + sin(x)
function tmap!(f,out,x)
out[i] = f(x[i])
end
end

@btime mul!(\$b,\$A,\$x)    # 14.778 μs (0 allocations: 0 bytes)
@btime matmul!(\$b,\$A,\$x) # 4.702 μs (0 allocations: 0 bytes)
@btime tmap!(\$f,\$b,\$x)   # 195.632 μs (41 allocations: 3.16 KiB)
@btime map!(\$f,\$b,\$x)    # 1.126 ms (0 allocations: 0 bytes)
``````

However, if I put these two functions inside a function, I get drastically different timings

``````function time1(b,A,x)
matmul!(b,A,x)
tmap!(f,b,x)
end
function time2(b,A,x)
mul!(b,A,x)
tmap!(f,b,x)
end
function time3(b,A,x)
mul!(b,A,x)
map!(f,b,x)
end

@btime time1(\$b,\$A,\$x) # 42.934 ms (42 allocations: 3.19 KiB)
@btime time2(\$b,\$A,\$x) # 323.995 μs (41 allocations: 3.16 KiB)
@btime time3(\$b,\$A,\$x) # 1.404 ms (0 allocations: 0 bytes)
``````

e.g., running `matmul!` and `tmap!` in the function `time1` is about 200x slower than individual timings of `matmul!` and `tmap!`.

Can anyone explain what’s happening here?

Octavian’s tasks keep spinning while waiting for work to do for a few milliseconds, getting in the way of the new tasks created by `Threads.@threads`, but also making them much faster to respond to new work during this time.
`CheapThreads` and `LoopVectorization` reuse the same tasks as Octavian, therefore, individual timings:

``````julia> using LinearAlgebra, Octavian, CheapThreads, LoopVectorization

julia> @btime mul!(\$b,\$A,\$x);
9.781 μs (0 allocations: 0 bytes)

julia> @btime matmul!(\$b,\$A,\$x);
2.321 μs (0 allocations: 0 bytes)

julia> function bmap!(f,out,x)
@batch for i = 1:length(x)
out[i] = f(x[i])
end
end
bmap! (generic function with 1 method)

julia> @btime tmap!(\$f,\$b,\$x);
112.966 μs (181 allocations: 16.12 KiB)

julia> @btime bmap!(\$f,\$b,\$x);
53.152 μs (0 allocations: 0 bytes)

julia> @btime vmapt!(\$f,\$b,\$x); # LoopVectorization
5.561 μs (0 allocations: 0 bytes)
``````

Grouped timings:

``````function time4(b,A,x)
matmul!(b,A,x)
bmap!(f,b,x)
end
function time5(b,A,x)
matmul!(b,A,x)
vmapt!(f,b,x)
end
``````

Yields:

``````julia> @btime time1(\$b,\$A,\$x); # matmul! tmap!
10.750 ms (183 allocations: 16.19 KiB)

julia> @btime time2(\$b,\$A,\$x); # mul! tmap!
144.676 μs (181 allocations: 16.12 KiB)

julia> @btime time3(\$b,\$A,\$x); # mul! map!
1.194 ms (0 allocations: 0 bytes)

julia> @btime time4(\$b,\$A,\$x); # matmul! bmap!
58.434 μs (0 allocations: 0 bytes)

julia> @btime time5(\$b,\$A,\$x); # matmul! vmapt!
7.831 μs (0 allocations: 0 bytes)

julia> versioninfo()
Julia Version 1.7.0-DEV.873
Commit ab6652ab9a* (2021-04-08 11:17 UTC)
Platform Info:
OS: Linux (x86_64-generic-linux)
CPU: Intel(R) Core(TM) i9-10980XE CPU @ 3.00GHz
WORD_SIZE: 64
LIBM: libopenlibm
Environment:
``````

`time4` and `time5` are additive (i.e. `matmul!` time + `bmap!` and `vmapt!` time, respectively) as expected.

2 Likes

Thanks! I guess this means the slowdown wasn’t a BenchmarkTools artifact, but a real slowdown due to conflicting threading approaches.

I’ll try out CheapThreads. Unfortunately, my actual application uses StructArrays, so I think `vmap!/vmapt!` use the standard `map` as a fallback.

It would, but theoretically `StructArray`s should work very well with `LoopVectorization.vmap(t)!`, so that’d be a good first PR if someone wants to try (I’d be happy to answer any questions/explain an approach/etc).
The `vmap!` code is very simple and independent from the rest of `LoopVectorization`, making it much more approachable.
Otherwise, you could file an issue and add it to the pile.

It would, but theoretically `StructArray` s should work very well with `LoopVectorization.vmap(t)!` , so that’d be a good first PR if someone wants to try (I’d be happy to answer any questions/explain an approach/etc).

I’ll probably need some time to familiarize myself more with StructArray and LoopVectorization internals, but I’d like to give it a shot. Maybe in a week once the semester ends? I can ping again around then.

2 Likes