Readers–writer lock using DataFrame


I have an application where one async task updates a DataFrame with new data every minute, while multiple tasks read the DataFrame at random times. I need to make this thread-safe, possibly by using a lock, to prevent reads during updates.

A ReentrantLock() could work, but it might block concurrent reads, which DataFrame can handle inherently. I’m considering using semaphores or a perhaps a suitable readers-writer lock to address this.

Any suggestions, pointers, or examples to achieve this without compromising the performance of concurrent reads?

Here is an example of concurrent readers and writers with duckdb in case you want to persist the data as well.

I needed a RW lock for AllocArrays.jl and settled on the one in ConcurrentUtilities.jl

1 Like

Thanks for the pointer. That looks very interesting.

Thanks, that looks very interesting. I see that julia v1.11 may have some of these functionalities in base, but perhaps not the ReadWriteLock() part.

Depending on your specific requirements, also consider copy-on-write.

The idea of copy-on-write is that on updates, you construct a new dataframe (reusing unchanged arrays of the old one), and you have some object like

mutable struct AtomicContainer
@atomic contents::Any

and after your new dataframe is constructed, you insert it into the AtomicContainer. Readers have a long-lived reference to the AtomicContainer, and can then read the contents and dispatch (function barrier to cure the type instability!) with the current immutable snapshot of the data.

The relevant considerations for this are:

  1. If you have a reader who wants to do stuff and an update is underway, is it preferable to block until the update is done or is it preferable to do your stuff on a consistent-but-potentially-stale version?
  2. What is your relation between readers and writers, in terms of volume?
  3. Can you afford the additional GC pressure from copy-on-write? Especially, how real-time-ish are your readers?

The big problems with Reader-writer-locks are that:

  1. Multiple concurrent readers don’t block each other. But the responsible cpu cores still need to play tug-of-war on which core owns the cacheline of the lock
  2. If your update is big, then you either have a long critical section, i.e. long blocking of readers, or your readers can see inconsistent states (because you relinquish the lock in the middle). Whether this is a problem depends on your answer to (1).
  3. There is a big question for your readers: Take the lock for a long time (long critical section) or take it often for short times. Taking it for a long time may block the writer, taking it often causes tug-of-war on the cache-line between multiple readers. Your critical section of course needs to be long enough to span the required consistency (i.e. the entire “transaction”).

PS. The above example uses Any as type for the contents. This is super defensive programming of me, because consider the following:

mutable AtomicContainerYolo{T}
@atomic contents::T 

can introduce a lock in AtomicContainerYolo((1,2,3)) in some julia versions. Because your hardware only supports 128 bit atomics, and julialang made the imo extremely ill-considered design decision to imitate C++ in implementing atomic variables that are hardware-impossible by hidden locks instead of boxing / copy-on-write. C++ has the excuse of “no GC in the runtime, cannot cow”, but julialang doesn’t. Making the contents field abstractly typed forces the compiler to box it, which is exactly what you want for anything larger than 128 bit.

PPS. I am not recommending to use persistent (aka functional, aka non-overwriting) datastructures. Instead, use that your problem is very specific: Your updates/writes come in large batches. So for every update/write you need to identify which parts of your dataframe are modified; and you might want to modify your data layout to minimize the modified parts.

1 Like

Yeah, we only added Lockable to Base, but not the ReadWriteLock.