ConcurrentCollections.jl: "lock-free" dictionary, queue, etc. for Julia 1.7

tkf · December 3, 2021, 3:37am

I’m happy to announce a new package ConcurrentCollections.jl that implements a couple of thread-safe collections for Julia 1.7:

These collections supports non-blocking (“lock-free”) APIs whenever appropriate. As a result, these collections are useful for low-level optimizations of communication-intensive multi-threaded programs.

You can install it by simply typing “using ConcurrentCollections Enter Enter” in Julia 1.7 (thanks to the new automatic package installation).

ConcurrentCollections.jl is built on top of the shiny new atomics API for Julia 1.7. For more information on Julia’s new atomics capabilities, see Atomic fields: the new primitives on the block | Jameson Nash | JuliaCon2021 - YouTube.

Benchmarks

Dual Linked Concurrent Ring Queue (LCRQ)

ConcurrentCollections.jl implements the “almost non-blocking” dual LCRQ based on Izraelevitz and Scott (2017). This is similar to Base.Channel; i.e., a first-in-first-out container where popfirst! will wait until the item is available. I implemented the “hot potato” benchmark similar to the one by them and compare the result with the performance of Base.Channel.

As you can see, the dual LCRQ (orange) has a much higher throughput than Base.Channel (blue). The dual LCRQ performance also is comparable to a “hardware limit” (red) which simply repeatedly executes atomic fetch-and-increment (FAI) instructions from multiple Julia tasks. Note that the original C++ implementation by Izraelevitz and Scott (2017) exhibits the performance much closer to the FAI throughput. So, there is still more room for improvement ^[1]. But I’m glad that we can still write concurrent algorithms in Julia that are in the same order of magnitude as the hardware limit.

Note that this dual LCRQ implementation has a much higher memory footprint than Base.Channel. Also, it doesn’t implement close API yet.

`ConcurrentDict`

ConcurrentCollections.jl also implements a mostly lock-free dictionary, inspired by Maier et al. (2019) and Click (2007). As a benchmark, I implemented a histogram (“countmap”) function using ConcurrentDict. For a comparison, I also implemented it using Base.Dict based on the data-parallel approach that I usually recommend.

As you can see, ConcurrentDict (right) performs well across a wide range of access patterns (color). On the other hand, using task-local Base.Dicts in a divide-and-conquer fashion (left) performs better when there are many elements per one slot (higher “Access density” ^[2]). This is presumably because the output dictionaries are small and the cost for merging them is not dominating. However, if there is virtually no overlap between the keys processed by different tasks (lower “Access density”), the cost for merging the dictionaries dominates and you can’t expect the speedup. You can also see that ConcurrentDict is slower with the single-task case because of the extra overhead for the concurrent algorithm compared to the sequential algorithm in Base.Dict. Overall, I think this benchmark indicates “lock-free” is not always a win. However, one of the scenarios where it can shine is when you need to update data structures concurrently and they are too large for each task to have their own working copies.

A quick profiling suggests that a non-negligible amount of the time is spent while zeroing out the arrays with boxed values (required for GC). Creating arrays using calloc may be an interesting optimization (ref: Faster zeros with calloc). It also may be worth considering switching to a slightly more lockful approach to reduce memory footprint and thus zeroing cost. ↩︎
The “Access density” reflects how many duplicated elements exist in the data. More precisely, it is defined as datasize / nkeys where I generate the data using
```
ints = rand(1:nkeys, datasize)
lastkey = string(nkeys)
data = string.(ints; pad = length(lastkey))
```
Since datasize = 2^19 is fixed across the benchmarks, it also dictates the output dictionary size. ↩︎

melonedo · December 3, 2021, 5:28am

Maybe a bit sidestepping, I see a series of maybe* methods in your package: maybeget, maybepop!, etc. Has this been a trend in Julia 1.7, or are you following some guidelines?

tkf · December 3, 2021, 5:58am

I don’t think there’s any established guideline. I just used what I thought least confusing (i.e., treating Union{Some,Nothing} as a maybe value). But, in fact, there are a few recent discussions in Julia around this

It’s also nice to go beyond the “maybe” APIs. When you use concurrent data structures, it’d be nice to have a low-level API that tells you why certain operations failed (empty collection? contention? locked?). Throwing an exception is too slow in a tight loop but there’s no established convention for the success/error sum type. There are ErrorTypes.jl, ResultTypes.jl, and Expect.jl but, IIRC, none of them seem to try to exploit union-split capability in Julia which is rather unfortunate (although there are of course legit reasons why you don’t want to use Union). Ref: [ANN] ErrorTypes.jl - Rust-like safe errors in Julia

Topic		Replies	Views
Notes on lock-free programming Teaching & Outreach	6	1168	May 27, 2021
Thread safe operations with LibPQ.jl and AMPQ.jl New to Julia	2	362	June 14, 2021
How to make a wrapper around a non-thread safe C library thread-safe General Usage multithreading	2	611	September 19, 2019
[ANN] Dictionaries.jl - Improved productivity and performance of dictionaries in Julia Package Announcements dictionary , dictionaries	22	6125	December 15, 2019
Discussion on ThreadPools.jl Performance	5	940	July 26, 2021

ConcurrentCollections.jl: "lock-free" dictionary, queue, etc. for Julia 1.7

Benchmarks

Dual Linked Concurrent Ring Queue (LCRQ)

ConcurrentDict

Related topics

`ConcurrentDict`