# Multi-threading doesn't seem to scale well on this counting problem

I have been reading about multithreading and the general feel is that it’s very hard to get good performance out of it. After reading the post (https://github.com/JuliaLang/julia/issues/17395), I still feel hazy on what I need to do.

Below an MWE I have where I take `Vector{UInt64}` and try to construct 4 histograms, one for each of the 4 16bits blocks that make up a 64 bit `Int`. This MWE in the sense that when I try to create a simpler example the scaling seems ok (3x performance on 4 cores).

On my machine with 4 threads (I have 4 cores), the single threaded version is almost always faster. I don’t understand the advice about wrapping `@threads` for loop in a function as my `@threads` is almost in a function and can’t be broken down further.

Help appreciated!

Update: fixed bug pointed out by @foobar_lv2

``````using Base.Threads

hist = zeros(UInt, 65536, 4, nthreads())
for i = 0:3
@inbounds hist[1 + Int((bits[j] >> (i << 4)) & 0xffff), i+1, threadid()] += 1
end
end
@inbounds hist[:, j, 1] .+= hist[:, j, i]
end
end
hist[:,:,1]
end

function count(bits)
hist = zeros(UInt, 65536, 4)
for x1 in bits
for i in 0:3
@inbounds hist[1 + Int((x1 >> (i << 4)) & 0xffff), i+1] += 1
end
end
hist
end

# svec = rand("id".*dec.(1:1_000_000,10), 100_000_000);
svec = rand([randstring(8) for i =1:1_000_000], 100_000_000);
@time b = count(svec);
(a .== b) |> all
``````

I can’t really answer your question, but you have an off-by one that corrupts memory; it should probably be:

`hist[1+Int((x1 >> (i << 4)) & 0xffff), i+1] += 1`

Further, your code does not really run on my machine, throwing `MethodError: no method matching >>(::String, ::Int64)` (but I guess you are happy with UInt64 which should be equivalent to a randstring(8)?)

Also, your single-threaded code already looks really slow, because hist does not fit into cache and you need lots of random access. Maybe consider saving some space?

Example: Only store UInt16 in the hot histogram, check for overflow on every increment, if overflow then increment some (out-of-cache) second histogram. Possibly even use only an UInt8.

Consider consuming less bytes at a time. Example: consume only a single 2byte at a time (but store the other bytes somewhere if you need indirect access in `bits`). Then your histogram shrinks down to` 2^16*1*2 = 128 kB` which should fit into L2.

Can you get away with not using a histogram of UInt16? For example, if you could get away with e.g. 12 bits you could get down to `2^12*1*2 = 8 kB` which should nicely fit into L1.

1 Like

Ok, I could not resist giving this a try. Still awfully slow, though.

``````function count64(bits)
hist = zeros(UInt, 1<<16, 4)
for x1 in bits
for i in 0:3
@inbounds hist[1+Int((x1 >> (i << 4)) & 0xffff), i+1] += 1
end
end
hist
end

function count16(bits)
hist = zeros(UInt64, 1<<16, 4)
hist_hot = zeros(UInt16, 1<<16)
@inbounds for i in 0:3
for x1 in bits
idx = 1+ Int((x1 >> (i << 4)) & 0xffff)
hist_hot[idx] += UInt16(1)
if hist_hot[idx] == 0
hist[idx, i+1] += (1<<16)
end
end
for idx in 1:(1<<16)
hist[idx,i+1] += hist_hot[idx]
end
fill!(hist_hot, zero(UInt16))
end
hist
end

function count8(bits)
hist = zeros(UInt64, 1<<16, 4)
hist_hot = zeros(UInt8, 1<<16)
@inbounds for i in 0:3
for x1 in bits
idx = 1+ Int((x1 >> (i << 4)) & 0xffff)
hist_hot[idx] += UInt8(1)
if hist_hot[idx] == 0
hist[idx, i+1] += (1<<8)
end
end
for idx in 1:(1<<16)
hist[idx,i+1] += hist_hot[idx]
end
fill!(hist_hot, zero(UInt8))
end
hist
end
``````

yielding post warm-up:

``````svec = rand(UInt64, 100_000_000);
a = @time count64(svec);
#  1.666775 seconds (6 allocations: 2.000 MiB)
b = @time count16(svec);
#  1.081748 seconds (8 allocations: 2.125 MiB)
c = @time count8(svec);
#  0.870814 seconds (8 allocations: 2.063 MiB)
( (a .== b) |> all ) && ( (a.==c) |>all )
# true
``````

PS: This is a really, really bad idea if your svec does not fit into main memory. Then you should probably eat the bad cache behavior instead of loading the data multiple times from spinning rust / ssd.

Tried the following on a Windows machine with 8 logical processors and v0.6.2 and here are the surprising results:

``````using Base.Threads

hist = zeros(UInt, 65536, 4, nthreads())
for i = 0:3
@inbounds hist[1+Int((bits[j] >> (i << 4)) & 0xffff), i+1, threadid()] += 1
end
end
@inbounds hist[:, j, 1] .+= hist[:, j, i]
end
end
hist[:,:,1]
end

function count(bits)
hist = zeros(UInt, 65536, 4)
for x1 in bits
for i in 0:3
@inbounds hist[1+Int((x1 >> (i << 4)) & 0xffff), i+1] += 1
end
end
hist
end

svec = rand(UInt64, 100_000_000);
@time b = count(svec);

``````

Have you tried the smaller versions (count8, count16)?

I am guessing that you run out of L3 for hist.

The following gets at least faster when multithreaded on my machine (but with a pretty limited factor).

``````function count8t(bits)
hist = zeros(UInt64, 1<<16, 4)
hist_hot = zeros(UInt8, 1<<16, 4)
@inbounds for x1 in bits
idx = 1+ Int((x1 >> (i << 4)) & 0xffff)
hist_hot[idx,i+1] += UInt8(1)
if hist_hot[idx,i+1] == 0
hist[idx, i+1] += (1<<8)
end
end
@inbounds for idx in 1:(1<<16)
hist[idx,i+1] += hist_hot[idx, i+1]
end
end
hist
end
``````
1 Like

This is better going from 1 thread to 4 and then gets slower with more than 4 threads:

``````1 thread: 0.84 s
``````

Makes sense, because the threaded loop only goes over 4 elements (and even otherwise, you probably only have 4 physical cores, and this is not a function that profits from smt/hyperthreading). Note that your speedup from 1 to 4 threads is almost perfectly linear.

I get only a very sublinear speedup on my laptop; no idea why, but I don’t plan on troubleshooting this. I should probably just buy a better machine for profiling

1 Like

Why not just loop over the `bits` array?

L3 cache is shared between cores. If we do something that single-threaded fails to fit into L1/L2 but does fit into L3, then multithreaded will kill performance.

Hence the split into the small UInt8 array, which should barely fit into L2. If you multithread over bits and process all 4 2bytes simultaneously (iterating over bits only once) then your threads will need to use L3, which is shared and too small to accomodate all threads.

At least that’s my theory for why this performs better even though we iterate over the data 4 times.

Also note that “fits into the cache” is not `sizeof(data) < cache_size`; if you don’t want to read the Intel manual and be very careful, then you should aim for “fits with lots of spare room”.

2 Likes

Quite Interesting, the speed story changes if I just sample from `UInt(1):UInt(1_000_000)` instead of the whole of `UInt`. The original threaded count `orig` is now the fastest.

``````using BenchmarkTools
svec = rand(rand(UInt(1):UInt(1_000_000), 1_000_000), 100_000_000);
a = @elapsed count64(svec)
b = @elapsed count16(svec)
c1 = @elapsed count8(svec)
c = @elapsed count8t(svec)

using Plots
bar(["orig","count64","count16","count8", "count8t"], [d, a,b,c1, c], label="time (sec)")
``````

Sure, but then you don’t use a lot of your histogram (saving you on the cache) and the small integers will overflow a lot:

``````rand(rand(UInt(1):UInt(1_000_000), 1_000_000), 10)
10-element Array{UInt64,1}:
0x000000000006f892
0x0000000000002e17
0x000000000005fbda
0x00000000000ea9a7
0x00000000000a664a
0x000000000002d211
0x00000000000b9c4d
0x0000000000003982
0x000000000003c3fd
0x0000000000074a16
``````

I guess that you need at least a little bit of fine-tuning for your machine and the statistics of the data. Good luck with finding a good compromise! Also consider using UInt32 for the histogram and giving up on inputs that are larger than 32 GB. That might save you some precious cache space.

1 Like

Thanks for the good advice! I am trying to implement a fast string sort multithreaded algorithm in pure Julia. In particular, I am implementing the PARADIS and RADULS algorithms. Hope something good comes out of this.

May I ask what made you think the first will run out of cache but the later ones won’t? And where can one learn more about this? I think this is truly at the boundary of efficient algorithm design and efficient implementations of algorithms so it is valuable insight to have for any computational scientist.

I am suspicious of @foobar_lv2’s discussion of cache. The operations in the inner loop are serialized integer ops, which don’t come close to saturating the DRAM bandwidth (at least on the desktops I use). This is quite different from, say, linear algebra with floats where SIMD rules.

Anyway, the compiler doesn’t seem to optimize his loop very well, so let’s do some of its job:

``````function count8x(bits)
hists = [zeros(UInt64, 1<<16) for i in 1:4]
hists_hot = [zeros(UInt8, 1<<16) for i in 1:4]
hist = hists[i+1]
hist_hot = hists_hot[i+1]
i4 = UInt8(i<<4)
@inbounds for x1 in bits
idx = 1+ Int((x1 >> i4) & 0xffff)
hist_hot[idx] += UInt8(1)
if hist_hot[idx] == 0
hist[idx] += (1<<8)
end
end
@inbounds for idx in 1:(1<<16)
hist[idx] += hist_hot[idx]
end
end
hcat(hists...)
end
``````

And you are right to be suspicious! All these uarch questions tend to be really subtle, and I am not a real expert, just using some rule-of-thumb.

However, what makes you think that the integer-ops are serialized? Cf eg https://compas.cs.stonybrook.edu/course/cse502-s13/lectures/cse502-L11-commit.pdf; TLDR: A single core should be perfectly capable of incrementing several hist_hot entries in parallel (on different ports) and rolling back if two increments collide; likewise, it should not need to always stall when a hist_hot entry is out-of-cache.

The fact that LLVM and I can’t seem to get more than about 1 integer operation per thread per clock in loops like this, even when the hot stuff is all in L1.

I agree that your impression should be right based on the number of ports, ALUs, etc. I tried a version where I forced most of the integer ops into SIMD instructions and unrolled the histogram updates so the processor would have more opportunities to speculate on the cache accesses. All that yields … a 25% loss in performance compared to `count8x` above. I am foiled by CPU subtleties once again.

Still, `count8x` seems to overlap most loads and stores with the arithmetic. And performance is indeed about 20% worse for a 16-bit version where `hot_hist` doesn’t fit in L1. So I should amend my comment about caches to agree with you that having the hot histogram in cache helps to avoid extra latencies.