It shouldβve been 13
. The motivation is to avoid false sharing.
That is, a cacheline is the smallest unit of memory that can be treated independently by many parts of the system. To avoid synchronization between cores, we need each thread to work on separate cachelines.
On x64 CPUs, cachelines are 64 bytes.
On M1 CPUs, theyβre 128 bytes.
julia> function bincount_polyester(array, weights, minlength, offset = 13)
l = (minlength+offset) & -8
nt = Threads.nthreads()
acc = zeros(Float64, l, nt)
N = length(array)
@batch for tid = 1:nt
lb = ((tid-1) * N Γ· nt) + 1
ub = (tid * N) Γ· nt
bincount!(@view(acc[:,tid]), @view(array[lb:ub]), @view(weights[lb:ub]))
end
vec(sum(@view(acc[1:minlength,:]), dims=2))
end
bincount_polyester (generic function with 2 methods)
julia> array = rand(1:100, 10^6); weights = rand(10^6); minlength=101;
julia> @benchmark bincount_polyester($array, $weights, $minlength, 7)
BenchmarkTools.Trial: 10000 samples with 1 evaluation.
Range (min β¦ max): 112.959 ΞΌs β¦ 43.887 ms β GC (min β¦ max): 0.00% β¦ 2.50%
Time (median): 265.806 ΞΌs β GC (median): 0.00%
Time (mean Β± Ο): 290.294 ΞΌs Β± 758.469 ΞΌs β GC (mean Β± Ο): 0.11% Β± 0.04%
ββββ
β
βββββββββββ
βββββββββ
ββββββββββββββββββββββββββββββββββββββββ
ββ
β
ββββββββββββββββββ β
113 ΞΌs Histogram: frequency by time 516 ΞΌs <
Memory estimate: 15.97 KiB, allocs estimate: 11.
julia> @benchmark bincount_polyester($array, $weights, $minlength, 13)
BenchmarkTools.Trial: 10000 samples with 1 evaluation.
Range (min β¦ max): 67.563 ΞΌs β¦ 43.869 ms β GC (min β¦ max): 0.00% β¦ 2.70%
Time (median): 150.542 ΞΌs β GC (median): 0.00%
Time (mean Β± Ο): 170.990 ΞΌs Β± 620.456 ΞΌs β GC (mean Β± Ο): 0.14% Β± 0.04%
βββ
β
ββββββββββ
β
βββββββ
βββββββββββββββββββββββββββββββββ
ββ
β
β
ββββββββββββββββββββββββ β
67.6 ΞΌs Histogram: frequency by time 346 ΞΌs <
Memory estimate: 17.09 KiB, allocs estimate: 11.
Why 13?
Julia arrays are 16-byte aligned by default (not 64-byte), and Iβm assuming 64 byte cachelines here.
Given 16 byte alignment, that means our arraysβ alignment mod 64 could be 0, 16, 32, or 48.
Dividing everything by 16 so we have 0, 1, 2, or 3β¦
julia> reshape(0:39, 4, 10)
4Γ10 reshape(::UnitRange{Int64}, 4, 10) with eltype Int64:
0 4 8 12 16 20 24 28 32 36
1 5 9 13 17 21 25 29 33 37
2 6 10 14 18 22 26 30 34 38
3 7 11 15 19 23 27 31 35 39
This matrix represents the linear memory addresses divided by the minimum alignment (16), so. 0 is 0, 1 is 16, 2 is 32, 3 is 48β¦
Each column represents a separate cacheline.
Thus, for good performance, we canβt have any two threads ever reading to the same column.
Imagine a column of a matrix snaking its way across linear indices above. E.g., if you had 12 Float64
per column, thatβd be 6 * 16 bytes long (i.e. 12 * 8 bytes).
So if you started at 0
, the first row would be 0:5
, the second 6:11
, third 12:17
.
Obviously, many of these are sharing the same cacheline.
So we need to add some padding.
If we could guarantee that it starts at 0
, (minlength + 7) & -8
is all weβd need, as that is enough to guarantee that each column of our real matrix starts at the same row in the above alignment matrix as the one before it.
But if we create a random matrix, itβll start at 0, 1, 2, or 3. I got 13 by just adding 3x 16 bytes (i.e. 6x Float64) to push us to the next column if we were offset.
For example
julia> hcat(1:16, a.(1:16))'
2Γ16 adjoint(::Matrix{Int64}) with eltype Int64:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
8 8 16 16 16 16 16 16 16 16 24 24 24 24 24 24
So if we have 3x Float64
, weβll need rows per column.
Translating that to the previous matrix, each successive column will start in the same row of the matrix, but 2 columns over.
So if our first column starts at 3
, itβll end inside 4
, and the second one starts at 11
.
This is probably excessive padding, as we could probably start at an earlier rowβ¦
But itβd probably be easier at that point to just over allocate a vector, create a view that is 64-byte aligned so we can start at 0
, and then use (n + 7) & -8
.