Odd BenchmarkTools timings using @threads and Octavian

jlchan · April 23, 2021, 1:30am

I’m getting some odd behavior using @threads and Octavian. If I time matmul! and a simple tmap! implementation, I see reasonable speedup.

using Octavian 
using LinearAlgebra

A = randn(8,8)
x = randn(8,10000)
b = similar(x)
f(x) = exp(x+1) + sin(x)
function tmap!(f,out,x)
    Threads.@threads for i = 1:length(x)
        out[i] = f(x[i])
    end
end

@btime mul!($b,$A,$x)    # 14.778 μs (0 allocations: 0 bytes)
@btime matmul!($b,$A,$x) # 4.702 μs (0 allocations: 0 bytes)
@btime tmap!($f,$b,$x)   # 195.632 μs (41 allocations: 3.16 KiB)
@btime map!($f,$b,$x)    # 1.126 ms (0 allocations: 0 bytes)

However, if I put these two functions inside a function, I get drastically different timings

function time1(b,A,x) 
    matmul!(b,A,x) 
    tmap!(f,b,x)   
end
function time2(b,A,x) 
    mul!(b,A,x) 
    tmap!(f,b,x) 
end
function time3(b,A,x) 
    mul!(b,A,x) 
    map!(f,b,x) 
end

@btime time1($b,$A,$x) # 42.934 ms (42 allocations: 3.19 KiB)
@btime time2($b,$A,$x) # 323.995 μs (41 allocations: 3.16 KiB)
@btime time3($b,$A,$x) # 1.404 ms (0 allocations: 0 bytes)

e.g., running matmul! and tmap! in the function time1 is about 200x slower than individual timings of matmul! and tmap!.

Can anyone explain what’s happening here?

Elrod · April 23, 2021, 2:26am

Octavian’s tasks keep spinning while waiting for work to do for a few milliseconds, getting in the way of the new tasks created by Threads.@threads, but also making them much faster to respond to new work during this time.
CheapThreads and LoopVectorization reuse the same tasks as Octavian, therefore, individual timings:

julia> using LinearAlgebra, Octavian, CheapThreads, LoopVectorization

julia> @btime mul!($b,$A,$x);
  9.781 μs (0 allocations: 0 bytes)

julia> @btime matmul!($b,$A,$x);
  2.321 μs (0 allocations: 0 bytes)

julia> function bmap!(f,out,x)
           @batch for i = 1:length(x)
               out[i] = f(x[i])
           end
       end
bmap! (generic function with 1 method)

julia> @btime tmap!($f,$b,$x);
  112.966 μs (181 allocations: 16.12 KiB)

julia> @btime bmap!($f,$b,$x);
  53.152 μs (0 allocations: 0 bytes)

julia> @btime vmapt!($f,$b,$x); # LoopVectorization
  5.561 μs (0 allocations: 0 bytes)

Grouped timings:

function time4(b,A,x) 
    matmul!(b,A,x) 
    bmap!(f,b,x)   
end
function time5(b,A,x) 
    matmul!(b,A,x) 
    vmapt!(f,b,x)   
end

Yields:

julia> @btime time1($b,$A,$x); # matmul! tmap!
  10.750 ms (183 allocations: 16.19 KiB)

julia> @btime time2($b,$A,$x); # mul! tmap!
  144.676 μs (181 allocations: 16.12 KiB)

julia> @btime time3($b,$A,$x); # mul! map!
  1.194 ms (0 allocations: 0 bytes)

julia> @btime time4($b,$A,$x); # matmul! bmap!
  58.434 μs (0 allocations: 0 bytes)

julia> @btime time5($b,$A,$x); # matmul! vmapt!
  7.831 μs (0 allocations: 0 bytes)

julia> versioninfo()
Julia Version 1.7.0-DEV.873
Commit ab6652ab9a* (2021-04-08 11:17 UTC)
Platform Info:
  OS: Linux (x86_64-generic-linux)
  CPU: Intel(R) Core(TM) i9-10980XE CPU @ 3.00GHz
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-11.0.1 (ORCJIT, cascadelake)
Environment:
  JULIA_NUM_THREADS = 36

time4 and time5 are additive (i.e. matmul! time + bmap! and vmapt! time, respectively) as expected.

jlchan · April 23, 2021, 2:40am

Thanks! I guess this means the slowdown wasn’t a BenchmarkTools artifact, but a real slowdown due to conflicting threading approaches.

I’ll try out CheapThreads. Unfortunately, my actual application uses StructArrays, so I think vmap!/vmapt! use the standard map as a fallback.

Elrod · April 23, 2021, 2:50am

It would, but theoretically StructArrays should work very well with LoopVectorization.vmap(t)!, so that’d be a good first PR if someone wants to try (I’d be happy to answer any questions/explain an approach/etc).
The vmap! code is very simple and independent from the rest of LoopVectorization, making it much more approachable.
Otherwise, you could file an issue and add it to the pile.

jlchan · April 23, 2021, 2:54am

It would, but theoretically StructArray s should work very well with LoopVectorization.vmap(t)! , so that’d be a good first PR if someone wants to try (I’d be happy to answer any questions/explain an approach/etc).

I’ll probably need some time to familiarize myself more with StructArray and LoopVectorization internals, but I’d like to give it a shot. Maybe in a week once the semester ends? I can ping again around then.

Topic		Replies	Views
Why with @threads, the execution time is worse? Performance question , multithreading	19	2797	September 16, 2021
V1.3.1 no gain using multithread General Usage multithreading	33	2028	March 4, 2020
Threads.@threads gives unexpected results Performance multithreading , threads	3	450	November 5, 2020
Expected 72X speedup, observed 1.1X Performance	5	730	April 28, 2021
Why doesn't multithreading help here? Performance	12	1414	August 22, 2020

Odd BenchmarkTools timings using @threads and Octavian

Related topics