(1) There is some discussion in this old thread about building arrays in a thread-safe way. The solution proposed there is to allocate Threads.nthreads() separate arrays and allow each thread to freely use push! on its own array and then append all of the arrays together. The example used, however, uses vectors of simple bits-types. I would like to know whether this solution would also work for non-bitstypes. Basically, I would like to construct the following in a threaded but threadsafe way:
out = [TypeConstructor(X,ii) for ii=1:1_000]
where TypeConstructor produces a struct that will take an unknown amount of memory rather than a bits-type.
(2) Are there best-practices to avoid false sharing in Julia, or is it entirely up to the compiler to help me avoid false sharing. I’ve noticed that many of the simple multithreaded algorithms that I use (and some that I’ve seen on Discourse) are potentially vulnerable to false sharing. For example, in doing a multithreaded sum, the most natural solution to me would look something like:
function threadsum(x)
Tsum=zeros(eltype(x),Threads.nthreads())
Threads.@threads for ii=1:length(x)
@inbounds Tsum[Threads.threadid()] += x[ii]
end
return sum(Tsum)
end
However, this seems like it would be vulnerable to false sharing since
the elements of Tsum are all close to each other in memory. (And in fact, this function doesn’t outperform either the sum function or even a user-written serial summation by much.) Any advice or guidance on how to avoid false sharing and other potential pitfalls would be greatly appreciated. I do a lot of work that (at least on paper) seem like they should be obviously shared-memory-parallelizeable but have had mixed results in getting anything close to linear improvements from Threads.@threads.
Thanks for asking this. I have discussed false sharing as well and I do not know the right solution. As far as I know, processors are sharing a cachelines, therefore allocating an array of length nthreads()*cacheline_length and from each thread address only the cacheline might help. But I am far from expert here.
One possible solution might be a vector of pointers to different points in memory so that each thread’s sum is stored in a very different place in memory. However, I don’t know how to implement this.
@Tomas_Pevny, interesting. I will look at SIMDArrays. It seems like what we really want is some sort of thread-specific memory. I feel like it would be a huge boon to have even just a few bytes cordoned off for each thread to use as a local counter that is very distant in memory from the other local counters.
I think this PR did something similar to what you suggested, but it’s a bit above my level of comprehension. I tried to implement a poor-man’s version of placing a buffer between the used parts of Tsum (e.g. making Tsum a matrix of dimension 100_000 by Threads.nthreads() and only actually using the first row of Tsum) but did not find any performance improvement.
This is what I would do (if I understand your problem correctly):
x = Vector{TypeConstructor}(undef, 1_000)
Threads.@threads for i=1:length(x)
@inbounds x[i] = TypeConstructor(X,ii)
end
This seems to be simplest.
Ad 2. Cache line is 64 bytes on most processors, so what you can do is to make sure, that there is such a separation in your data (sometimes less is enough but 64 bytes are safe). You can either add “dummy” entries to your array or make a data structure that you hold in an array a bit bigger - again with “dummy” field.
However, for general threading you might hit NUMA issues and in such cases it is better to have thread local allocation and do something like:
function threadsum(x)
Tsum=zeros(eltype(x),Threads.nthreads())
Threads.@threads for ii=1:Threads.nthreads()
tmp = zero(eltype(x))
# do the summation here to variable tmp over a region
# of x assigned to thread ii, e.g. using a loop
Tsum[Threads.threadid()] = tmp
end
return sum(Tsum)
end