I’m happy to announce a new package ConcurrentCollections.jl that implements a couple of thread-safe collections for Julia 1.7:
DualLinkedConcurrentRingQueue
DualLinkedQueue
LinkedConcurrentRingQueue
ConcurrentQueue
ConcurrentStack
WorkStealingDeque
ConcurrentDict
These collections supports non-blocking (“lock-free”) APIs whenever appropriate. As a result, these collections are useful for low-level optimizations of communication-intensive multi-threaded programs.
You can install it by simply typing “using ConcurrentCollections
Enter Enter” in Julia 1.7 (thanks to the new automatic package installation).
ConcurrentCollections.jl is built on top of the shiny new atomics API for Julia 1.7. For more information on Julia’s new atomics capabilities, see Atomic fields: the new primitives on the block | Jameson Nash | JuliaCon2021 - YouTube.
Benchmarks
Dual Linked Concurrent Ring Queue (LCRQ)
ConcurrentCollections.jl implements the “almost non-blocking” dual LCRQ based on Izraelevitz and Scott (2017). This is similar to Base.Channel
; i.e., a first-in-first-out container where popfirst!
will wait until the item is available. I implemented the “hot potato” benchmark similar to the one by them and compare the result with the performance of Base.Channel
.
As you can see, the dual LCRQ (orange) has a much higher throughput than Base.Channel
(blue). The dual LCRQ performance also is comparable to a “hardware limit” (red) which simply repeatedly executes atomic fetch-and-increment (FAI) instructions from multiple Julia tasks. Note that the original C++ implementation by Izraelevitz and Scott (2017) exhibits the performance much closer to the FAI throughput. So, there is still more room for improvement [1]. But I’m glad that we can still write concurrent algorithms in Julia that are in the same order of magnitude as the hardware limit.
Note that this dual LCRQ implementation has a much higher memory footprint than Base.Channel
. Also, it doesn’t implement close
API yet.
ConcurrentDict
ConcurrentCollections.jl also implements a mostly lock-free dictionary, inspired by Maier et al. (2019) and Click (2007). As a benchmark, I implemented a histogram (“countmap
”) function using ConcurrentDict
. For a comparison, I also implemented it using Base.Dict
based on the data-parallel approach that I usually recommend.
As you can see, ConcurrentDict
(right) performs well across a wide range of access patterns (color). On the other hand, using task-local Base.Dict
s in a divide-and-conquer fashion (left) performs better when there are many elements per one slot (higher “Access density” [2]). This is presumably because the output dictionaries are small and the cost for merging them is not dominating. However, if there is virtually no overlap between the keys processed by different tasks (lower “Access density”), the cost for merging the dictionaries dominates and you can’t expect the speedup. You can also see that ConcurrentDict
is slower with the single-task case because of the extra overhead for the concurrent algorithm compared to the sequential algorithm in Base.Dict
. Overall, I think this benchmark indicates “lock-free” is not always a win. However, one of the scenarios where it can shine is when you need to update data structures concurrently and they are too large for each task to have their own working copies.
A quick profiling suggests that a non-negligible amount of the time is spent while zeroing out the arrays with boxed values (required for GC). Creating arrays using
calloc
may be an interesting optimization (ref: Faster zeros with calloc). It also may be worth considering switching to a slightly more lockful approach to reduce memory footprint and thus zeroing cost. ↩︎The “Access density” reflects how many duplicated elements exist in the data. More precisely, it is defined as
datasize / nkeys
where I generate the data usingints = rand(1:nkeys, datasize) lastkey = string(nkeys) data = string.(ints; pad = length(lastkey))
Since
datasize = 2^19
is fixed across the benchmarks, it also dictates the output dictionary size. ↩︎