I’ve been a julia user for a long time, but I’m only just looking at multithreaded programming in general and wanted to play with julia to understand it a bit more. One of the things I wanted to use to learn a bit was some kind of multithreaded counting function:
function mycount(n)
c = 0
for i in 1:n
c += i
end
return c
end
From what I understand, c, being a variable incremented frequently by all the threads, would mean if you were using locks and synchronisation to work out who (and by who I colloquially mean which thread) can write to it at one time, would make threads wait such that you might as well be doing things single threaded. But I’ve read the alternative is to use “atomic” variables, which I see are even a julia type in Threads. So really what I’d like to know is, in a nutshell how to atomic variables help solve this problem, is it better than using locks, and more julia specific - can one make sure the toy counter above is safe just by replacing c with an Atomic type and proceeding as normal, or must I also use this function I’ve found in the Threads docs: Threads.atomic_add!?
No you shouldn’t use atomic variables. They are not meant for this and will essentially serialize everything. The correct way is to use a thread local counter and combine them together later.
It is not trivial to do all that in julia at those point and since you have not had previous experience with multithreaded programming, I don’t recommand you to start threaded programming in julia at this point.
I was talking with a colleague today about threads, and he said it’s probably better to make an array of length = nthreads, and each thread then has a local count in that array that is the reduced by a sum at the end, and I can see something similar in the docs where each thread writes to a different position in an array. This sounds a lot like your thread local counter? I was also told if computation of the count increment was quite time-consuming, then the use of atomic variables wouldn’t slow things down to much so it would be fine, but if it the operation is quick anyway (which it is), indeed then using the atomics would mean it might as well be serial.
That’ll actually likely be almost equally bad due to false sharing (google will give you a lot of good results).
Well, define “time-consuming”. Atomic read-modify-write (RMW) operations are in general really expensive. Just to give you an idea, on my laptop, sin(1.0) takes 6ns and an atomic increment of 1 takes 24ns. One or two libm functions is well above my usual definition of “time-consuming” loop body since they are hundreds time more expensive than simple arithmetics but that’s not remotely enough to make the atomic increment overhead “fine”.
Also note that the 24ns is single threaded measurement, if you are doing atomic increment from multiple threads, it’ll be even slower by few times due to cacheline ping-pong, same reason why you shouldn’t use an dense array as thread local counter.
And as another example, SIMD loop for adding an array of 400 Float64 or ~640 Int64 also takes 24ns on the same computer. If this can give you a better idea of how much computation you need in the loop to make atomic increment relatively cheap.
I see so if my googling is correct, the array would be in one area of the cache, and multiple threads modifying it still create issues. If this is the case why do the docs have the zeros example in it?
a = zeros(10)
Threads.@threads for i = 1:10
a[i] = Threads.threadid()
end
I know in C11, you can use ‘thread_local’ from threads.h for local variables. You alluded to that thread local variables can be done in julia, but it is not trivial, where can I read how it is currently done? Rest assured I’m not asking for a practical reason - I don’t have any desire to seriously use threads in production julia code whilst it’s still experimental, I’m just trying to get a better appreciation of julia.
function nothread_test(v)
for i = 1:length(v)
@inbounds v[i] = rand()
end
sum(v)
end
function thread_test(v)
Threads.@threads for i = 1:length(v)
@inbounds v[i] = rand()
end
sum(v)
end
@benchmark nothread_test!(rand(250_000_000))
@benchmark thread_test!(rand(250_000_000))
I just tried to use threads (n = 4) to populate an array and sum it. I expected the Threads version to run faster. But I don’t think I figured out the way to do it just yet.