Argument initialisation affects performance of @threads?

kcr · August 8, 2019, 3:09pm

I’m observing a change in performance in a threaded function which depends on how a write only argument to that function was initialised. Specifically creating the argument by Matrix{Float64}(undef,…) leads to the expected scaling with number of threads and creating the argument with zeros(…) leads to no scaling. A working example and observed output when run across different settings of JULIA_NUM_THREADS is given below. Reordering of the two benchmarked calls does not affect the results.
My understanding is that both dst and dst1 should have the same type (and checking with typeof seems to confirm this), the only difference being that dst1 has been zeroed with dst has not been touched after allocating.
I’m new to julia so I might be missing something rather obvious? Or, if this is unexpected, can anyone reproduce the behaviour?

using BenchmarkTools

function unpack_threaded(src::AbstractMatrix{UInt8}, dst::AbstractMatrix{T}) where T
    B,M  = size(src)
    tbls = collect(Matrix{T}(undef, 4, 256) for n in 1:Threads.nthreads()) #dummy lookup table
    Threads.@threads for m in 1:M
        tbl = tbls[Threads.threadid()]
        for b in 1:B
            w = src[b,m]+1
            for k in 1:4
                @inbounds dst[b*4+k,m] = tbl[k,w]
            end
        end
    end
end

function main()
    print("Threads=$(Threads.nthreads()):\n")

    B = 100000
    M = 512

    src = Matrix{UInt8}(undef, B, M)

    dst1 = zeros(B*4,M)
    print("zeros(...)                 :")
    @btime unpack_threaded($src,$dst1)
    
    dst = Matrix{Float64}(undef, B*4, M)
    print("Matrix{Float64}(undef,...) :")
    @btime unpack_threaded($src,$dst)
end

main()

Output:

Threads=1:
zeros(...)                 :  172.220 ms (3 allocations: 8.28 KiB)
Matrix{Float64}(undef,...) :  171.251 ms (3 allocations: 8.28 KiB)
Threads=2:
zeros(...)                 :  184.228 ms (4 allocations: 16.41 KiB)
Matrix{Float64}(undef,...) :  84.084 ms (4 allocations: 16.41 KiB)
Threads=4:
zeros(...)                 :  170.576 ms (6 allocations: 32.67 KiB)
Matrix{Float64}(undef,...) :  44.229 ms (6 allocations: 32.67 KiB)
Threads=8:
zeros(...)                 :  176.688 ms (10 allocations: 65.20 KiB)
Matrix{Float64}(undef,...) :  35.130 ms (10 allocations: 65.20 KiB)

julia> versioninfo()
Julia Version 1.1.1
Commit 55e36cc308 (2019-05-16 04:10 UTC)
Platform Info:
  OS: Linux (x86_64-pc-linux-gnu)
  CPU: Intel(R) Xeon(R) CPU E5-2630 v3 @ 2.40GHz
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-6.0.1 (ORCJIT, haswell)
Environment:
  JULIA_NUM_THREADS = 8

LaurentPlagne · August 8, 2019, 3:16pm

Unverified hypothesis:
It maybe due to thread affinity issues.
If you initialize the matrix with zeros from one core, then the memory must be put close to this core. With undef, the values of the matrix could be initialized by different cores and this will imply better perf if those cores operate on the data they have initialized.

kcr · August 8, 2019, 3:26pm

The matrix in question is ~1.6Gb, it shouldn’t sit in the cache of one core. Additionally my understanding is that @btime times over repeated calls to the function, if the matrix was in the cache somewhere after the first call it shouldn’t matter anymore as in both cases parts of the matrix will have been processed by different cores.

LaurentPlagne · August 8, 2019, 3:32pm

I didn’t take time to read the code but the way the arrays are initialized can modified the data location on the RAM. If you have multiple memory (RAM) channels on your hardware then you maybe unable to use them because all the DATA stays on one RAM part.

Topic		Replies	Views
Question for lower performance by using @threads in for loop New to Julia question	13	1054	July 9, 2021
Slower execution with multi-threading using @threads macro Performance question , parallel , multithreading	5	736	August 13, 2020
Overhead of `Threads.@threads` Performance question , multithreading	30	5354	March 13, 2021
Loosing performance with `Threads.@threads` for loop Performance parallel , multithreading , threads	10	704	October 7, 2021
Same multi-threaded code, scaling observed only on some machines Performance	2	73	August 14, 2024

Argument initialisation affects performance of @threads?

Related topics