I have been reading about multithreading and the general feel is that it’s very hard to get good performance out of it. After reading the post (https://github.com/JuliaLang/julia/issues/17395), I still feel hazy on what I need to do.

Below an MWE I have where I take Vector{UInt64} and try to construct 4 histograms, one for each of the 4 16bits blocks that make up a 64 bit Int. This MWE in the sense that when I try to create a simpler example the scaling seems ok (3x performance on 4 cores).

On my machine with 4 threads (I have 4 cores), the single threaded version is almost always faster. I don’t understand the advice about wrapping @threads for loop in a function as my @threads is almost in a function and can’t be broken down further.

using Base.Threads
function threadedcount(bits)
hist = zeros(UInt, 65536, 4, nthreads())
@threads for j = 1:length(bits)
for i = 0:3
@inbounds hist[1 + Int((bits[j] >> (i << 4)) & 0xffff), i+1, threadid()] += 1
end
end
@threads for j in 1:4
for i = 2:nthreads()
@inbounds hist[:, j, 1] .+= hist[:, j, i]
end
end
hist[:,:,1]
end
function count(bits)
hist = zeros(UInt, 65536, 4)
for x1 in bits
for i in 0:3
@inbounds hist[1 + Int((x1 >> (i << 4)) & 0xffff), i+1] += 1
end
end
hist
end
# svec = rand("id".*dec.(1:1_000_000,10), 100_000_000);
svec = rand([randstring(8) for i =1:1_000_000], 100_000_000);
@time a = threadedcount(svec);
@time b = count(svec);
(a .== b) |> all

I can’t really answer your question, but you have an off-by one that corrupts memory; it should probably be:

hist[1+Int((x1 >> (i << 4)) & 0xffff), i+1] += 1

Further, your code does not really run on my machine, throwing MethodError: no method matching >>(::String, ::Int64) (but I guess you are happy with UInt64 which should be equivalent to a randstring(8)?)

Also, your single-threaded code already looks really slow, because hist does not fit into cache and you need lots of random access. Maybe consider saving some space?

Example: Only store UInt16 in the hot histogram, check for overflow on every increment, if overflow then increment some (out-of-cache) second histogram. Possibly even use only an UInt8.

Consider consuming less bytes at a time. Example: consume only a single 2byte at a time (but store the other bytes somewhere if you need indirect access in bits). Then your histogram shrinks down to 2^16*1*2 = 128 kB which should fit into L2.

Can you get away with not using a histogram of UInt16? For example, if you could get away with e.g. 12 bits you could get down to 2^12*1*2 = 8 kB which should nicely fit into L1.

Ok, I could not resist giving this a try. Still awfully slow, though.

function count64(bits)
hist = zeros(UInt, 1<<16, 4)
for x1 in bits
for i in 0:3
@inbounds hist[1+Int((x1 >> (i << 4)) & 0xffff), i+1] += 1
end
end
hist
end
function count16(bits)
hist = zeros(UInt64, 1<<16, 4)
hist_hot = zeros(UInt16, 1<<16)
@inbounds for i in 0:3
for x1 in bits
idx = 1+ Int((x1 >> (i << 4)) & 0xffff)
hist_hot[idx] += UInt16(1)
if hist_hot[idx] == 0
hist[idx, i+1] += (1<<16)
end
end
for idx in 1:(1<<16)
hist[idx,i+1] += hist_hot[idx]
end
fill!(hist_hot, zero(UInt16))
end
hist
end
function count8(bits)
hist = zeros(UInt64, 1<<16, 4)
hist_hot = zeros(UInt8, 1<<16)
@inbounds for i in 0:3
for x1 in bits
idx = 1+ Int((x1 >> (i << 4)) & 0xffff)
hist_hot[idx] += UInt8(1)
if hist_hot[idx] == 0
hist[idx, i+1] += (1<<8)
end
end
for idx in 1:(1<<16)
hist[idx,i+1] += hist_hot[idx]
end
fill!(hist_hot, zero(UInt8))
end
hist
end

yielding post warm-up:

svec = rand(UInt64, 100_000_000);
a = @time count64(svec);
# 1.666775 seconds (6 allocations: 2.000 MiB)
b = @time count16(svec);
# 1.081748 seconds (8 allocations: 2.125 MiB)
c = @time count8(svec);
# 0.870814 seconds (8 allocations: 2.063 MiB)
( (a .== b) |> all ) && ( (a.==c) |>all )
# true

PS: This is a really, really bad idea if your svec does not fit into main memory. Then you should probably eat the bad cache behavior instead of loading the data multiple times from spinning rust / ssd.

Tried the following on a Windows machine with 8 logical processors and v0.6.2 and here are the surprising results:

using Base.Threads
function threadedcount(bits)
hist = zeros(UInt, 65536, 4, nthreads())
@threads for j = 1:length(bits)
for i = 0:3
@inbounds hist[1+Int((bits[j] >> (i << 4)) & 0xffff), i+1, threadid()] += 1
end
end
@threads for j in 1:4
for i = 2:nthreads()
@inbounds hist[:, j, 1] .+= hist[:, j, i]
end
end
hist[:,:,1]
end
function count(bits)
hist = zeros(UInt, 65536, 4)
for x1 in bits
for i in 0:3
@inbounds hist[1+Int((x1 >> (i << 4)) & 0xffff), i+1] += 1
end
end
hist
end
svec = rand(UInt64, 100_000_000);
@time a = threadedcount(svec);
@time b = count(svec);
#No threads: 0.9 s
#4 threads: 1.2 s
#8 threads: 1.8 s

Have you tried the smaller versions (count8, count16)?

I am guessing that you run out of L3 for hist.

The following gets at least faster when multithreaded on my machine (but with a pretty limited factor).

function count8t(bits)
hist = zeros(UInt64, 1<<16, 4)
hist_hot = zeros(UInt8, 1<<16, 4)
@threads for i in 0:3
@inbounds for x1 in bits
idx = 1+ Int((x1 >> (i << 4)) & 0xffff)
hist_hot[idx,i+1] += UInt8(1)
if hist_hot[idx,i+1] == 0
hist[idx, i+1] += (1<<8)
end
end
@inbounds for idx in 1:(1<<16)
hist[idx,i+1] += hist_hot[idx, i+1]
end
end
hist
end

Makes sense, because the threaded loop only goes over 4 elements (and even otherwise, you probably only have 4 physical cores, and this is not a function that profits from smt/hyperthreading). Note that your speedup from 1 to 4 threads is almost perfectly linear.

I get only a very sublinear speedup on my laptop; no idea why, but I don’t plan on troubleshooting this. I should probably just buy a better machine for profiling

L3 cache is shared between cores. If we do something that single-threaded fails to fit into L1/L2 but does fit into L3, then multithreaded will kill performance.

Hence the split into the small UInt8 array, which should barely fit into L2. If you multithread over bits and process all 4 2bytes simultaneously (iterating over bits only once) then your threads will need to use L3, which is shared and too small to accomodate all threads.

At least that’s my theory for why this performs better even though we iterate over the data 4 times.

Also note that “fits into the cache” is not sizeof(data) < cache_size; if you don’t want to read the Intel manual and be very careful, then you should aim for “fits with lots of spare room”.

Quite Interesting, the speed story changes if I just sample from UInt(1):UInt(1_000_000) instead of the whole of UInt. The original threaded count orig is now the fastest.

using BenchmarkTools
svec = rand(rand(UInt(1):UInt(1_000_000), 1_000_000), 100_000_000);
d = @elapsed threadedcount(svec)
a = @elapsed count64(svec)
b = @elapsed count16(svec)
c1 = @elapsed count8(svec)
c = @elapsed count8t(svec)
using Plots
bar(["orig","count64","count16","count8", "count8t"], [d, a,b,c1, c], label="time (sec)")
savefig("Single vs Multithreaded count.png")

I guess that you need at least a little bit of fine-tuning for your machine and the statistics of the data. Good luck with finding a good compromise! Also consider using UInt32 for the histogram and giving up on inputs that are larger than 32 GB. That might save you some precious cache space.

Thanks for the good advice! I am trying to implement a fast string sort multithreaded algorithm in pure Julia. In particular, I am implementing the PARADIS and RADULS algorithms. Hope something good comes out of this.

May I ask what made you think the first will run out of cache but the later ones won’t? And where can one learn more about this? I think this is truly at the boundary of efficient algorithm design and efficient implementations of algorithms so it is valuable insight to have for any computational scientist.

I am suspicious of @foobar_lv2’s discussion of cache. The operations in the inner loop are serialized integer ops, which don’t come close to saturating the DRAM bandwidth (at least on the desktops I use). This is quite different from, say, linear algebra with floats where SIMD rules.

Anyway, the compiler doesn’t seem to optimize his loop very well, so let’s do some of its job:

function count8x(bits)
hists = [zeros(UInt64, 1<<16) for i in 1:4]
hists_hot = [zeros(UInt8, 1<<16) for i in 1:4]
@threads for i in 0:3
hist = hists[i+1]
hist_hot = hists_hot[i+1]
i4 = UInt8(i<<4)
@inbounds for x1 in bits
idx = 1+ Int((x1 >> i4) & 0xffff)
hist_hot[idx] += UInt8(1)
if hist_hot[idx] == 0
hist[idx] += (1<<8)
end
end
@inbounds for idx in 1:(1<<16)
hist[idx] += hist_hot[idx]
end
end
hcat(hists...)
end

And you are right to be suspicious! All these uarch questions tend to be really subtle, and I am not a real expert, just using some rule-of-thumb.

However, what makes you think that the integer-ops are serialized? Cf eg https://compas.cs.stonybrook.edu/course/cse502-s13/lectures/cse502-L11-commit.pdf; TLDR: A single core should be perfectly capable of incrementing several hist_hot entries in parallel (on different ports) and rolling back if two increments collide; likewise, it should not need to always stall when a hist_hot entry is out-of-cache.

The fact that LLVM and I can’t seem to get more than about 1 integer operation per thread per clock in loops like this, even when the hot stuff is all in L1.

I agree that your impression should be right based on the number of ports, ALUs, etc. I tried a version where I forced most of the integer ops into SIMD instructions and unrolled the histogram updates so the processor would have more opportunities to speculate on the cache accesses. All that yields … a 25% loss in performance compared to count8x above. I am foiled by CPU subtleties once again.

Still, count8x seems to overlap most loads and stores with the arithmetic. And performance is indeed about 20% worse for a 16-bit version where hot_hist doesn’t fit in L1. So I should amend my comment about caches to agree with you that having the hot histogram in cache helps to avoid extra latencies.