Argument initialisation affects performance of @threads?

I’m observing a change in performance in a threaded function which depends on how a write only argument to that function was initialised. Specifically creating the argument by Matrix{Float64}(undef,…) leads to the expected scaling with number of threads and creating the argument with zeros(…) leads to no scaling. A working example and observed output when run across different settings of JULIA_NUM_THREADS is given below. Reordering of the two benchmarked calls does not affect the results.
My understanding is that both dst and dst1 should have the same type (and checking with typeof seems to confirm this), the only difference being that dst1 has been zeroed with dst has not been touched after allocating.
I’m new to julia so I might be missing something rather obvious? Or, if this is unexpected, can anyone reproduce the behaviour?

using BenchmarkTools

function unpack_threaded(src::AbstractMatrix{UInt8}, dst::AbstractMatrix{T}) where T
    B,M  = size(src)
    tbls = collect(Matrix{T}(undef, 4, 256) for n in 1:Threads.nthreads()) #dummy lookup table
    Threads.@threads for m in 1:M
        tbl = tbls[Threads.threadid()]
        for b in 1:B
            w = src[b,m]+1
            for k in 1:4
                @inbounds dst[b*4+k,m] = tbl[k,w]
            end
        end
    end
end

function main()
    print("Threads=$(Threads.nthreads()):\n")

    B = 100000
    M = 512

    src = Matrix{UInt8}(undef, B, M)

    dst1 = zeros(B*4,M)
    print("zeros(...)                 :")
    @btime unpack_threaded($src,$dst1)
    
    dst = Matrix{Float64}(undef, B*4, M)
    print("Matrix{Float64}(undef,...) :")
    @btime unpack_threaded($src,$dst)
end

main()

Output:

Threads=1:
zeros(...)                 :  172.220 ms (3 allocations: 8.28 KiB)
Matrix{Float64}(undef,...) :  171.251 ms (3 allocations: 8.28 KiB)
Threads=2:
zeros(...)                 :  184.228 ms (4 allocations: 16.41 KiB)
Matrix{Float64}(undef,...) :  84.084 ms (4 allocations: 16.41 KiB)
Threads=4:
zeros(...)                 :  170.576 ms (6 allocations: 32.67 KiB)
Matrix{Float64}(undef,...) :  44.229 ms (6 allocations: 32.67 KiB)
Threads=8:
zeros(...)                 :  176.688 ms (10 allocations: 65.20 KiB)
Matrix{Float64}(undef,...) :  35.130 ms (10 allocations: 65.20 KiB)
julia> versioninfo()
Julia Version 1.1.1
Commit 55e36cc308 (2019-05-16 04:10 UTC)
Platform Info:
  OS: Linux (x86_64-pc-linux-gnu)
  CPU: Intel(R) Xeon(R) CPU E5-2630 v3 @ 2.40GHz
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-6.0.1 (ORCJIT, haswell)
Environment:
  JULIA_NUM_THREADS = 8

Unverified hypothesis:
It maybe due to thread affinity issues.
If you initialize the matrix with zeros from one core, then the memory must be put close to this core. With undef, the values of the matrix could be initialized by different cores and this will imply better perf if those cores operate on the data they have initialized.

The matrix in question is ~1.6Gb, it shouldn’t sit in the cache of one core. Additionally my understanding is that @btime times over repeated calls to the function, if the matrix was in the cache somewhere after the first call it shouldn’t matter anymore as in both cases parts of the matrix will have been processed by different cores.

I didn’t take time to read the code but the way the arrays are initialized can modified the data location on the RAM. If you have multiple memory (RAM) channels on your hardware then you maybe unable to use them because all the DATA stays on one RAM part.