Dictionary for thread-safe, limited memoization

Tamas_Papp · May 11, 2026, 12:23pm

I am looking for a Dict-like type which

allows me to keep only the last N values inserted, for some value of N::Int,
is thread-safe.

Background: memoization, where the resulting objects are large and I am unlikely to reuse more than the last N, even though I am doing these calculations M \gg N times and just using a Dict or similar would be wasting/exhausting memory. I don’t need memorization macros etc, just the data structure.

lilachint · May 11, 2026, 12:48pm

Do we really need thread-safe data structures for memoization? I mean, the operations on data should be idempotent for memoization. I’m not an expert, so tell me if I’m wrong.

ericphanson · May 11, 2026, 12:49pm

Oscar_Smith · May 11, 2026, 12:49pm

Do you need it to store exactly the last N, or can it be probabilistic? that would be a lot easier. Also, what do you mean by thread safe? do you want to support multiple concurrent writers and readers? that will be very hard to do efficiently under heavy contention.

Tamas_Papp · May 11, 2026, 1:18pm

Yes, at least the underlying data structure should be thread-safe, because concurrent writes could corrupt it etc.

Probabilistic would be OK, but since @ericphanson’s solution does exactly what I asked for I am going with that.

No heavy threading or clashing is expected, I just need some basic protections for the possibility of concurrent access.

PS.: I continue to be amazed at Julia’s package ecosystem. 80% of the time when I think of a problem, it turns out that someone has implemented a solution years ago.

Oscar_Smith · May 11, 2026, 1:38pm

This is exactly what I meant by probabilistic. It has capacity N, but doesn’t store the most recent N, it just overwrites old entries in existing slots.

Tamas_Papp · May 11, 2026, 1:41pm

Sorry, I am confused, I thought the LRU implementation was deterministic. “Last” is defined by wall clock time, even if it is threaded. That suits me fine.

Oscar_Smith · May 11, 2026, 1:54pm

Probabilistic may not be quite the right word. LRUs are deterministic, but not in that they store the most recent N elements. That would be fairly expensive to support, so instead, they are just an N element hash table that instead of using separate chaining or probing, will replace the element on insertion.

As such, you end up with N buckets, each of which store the most recent insert into that bucket, which is a reasonable and cheap to maintain approximation of the N most recently stored items, but isn’t exactly the most recent N items. e.g. if you write N items into one, you will only have stored ~0.6*n elements with the rest being duplicates.

Sukera · May 11, 2026, 2:19pm

I think you’re mixing something up. A LRU (Least Recently Used) cache replaces the element that hasn’t been accessed for the longest time when a new element is inserted. There is nothing probabilistic about it, and there is no bucketing going on. It’s not backed by a hash table.

Note that “time” here is not necessarily true wall-clock time, it can just be an age counter that is incremented.

foobar_lv2 · May 11, 2026, 2:20pm

Wall-clock time is problematic: We live in a relativistic world; events can be time-like (on happens before, one after) or space-like; and any notion of “happens before in wall-clock time” for space-like events is subjective.

This may sound like ridiculous nit-picking, but a clock-cycle is ~6 cm, and intuitions from physics underpin the intuitions we use for concurrency, like e.g. Lamport clocks.

More concretely: If you want “Last” in a shared LRU to be well-defined, then all involved cores need to talk to each other in some way, which costs latency / power. Some of that can be speculated around, but it sucks. The situation becomes much worse when you’re not talking about multiple cores on the same CPU to synchronize on some “sequentially consistent” ordering, but are instead thinking about distributed systems (in these cases, your events have good chance to be actually space-like, not just “effectively space-like because of processing latencies”).

So I think the way to go is to use a lock-free variant of LRUCache.jl, accepting that you sometimes compute a value twice due to races and that “least recently” is a probabilistic guarantee.

LRUCache.jl acquires a single central lock on read-only access, which means that all accesses are sequenced and slow.

If this is good enough for you, then yay, no reason to look further. Just keep in mind that this will scale horribly to many-core CPUs.

langestefan · May 11, 2026, 2:23pm

Using something like hash(key) % N? And then the same mechanism to retrieve data from the LRU?

You get the 0.6 from a binomial? I’m not sure what you mean by duplicates. Do you mean collisions?

ericphanson · May 11, 2026, 5:56pm

That’s not how LRUCache.jl works, did you check the code?

Tamas_Papp · May 12, 2026, 8:14am

I am open to suggestions about other packages; if you have something in mind I am wondering why aren’t you just saying what you suggest instead.

foobar_lv2 · May 12, 2026, 9:23am

I am open to suggestions about other packages; if you have something in mind I am wondering why aren’t you just saying what you suggest instead.

I don’t have any suggestions for alternative packages. Frankly, my overview over the julia ecosystem is not comprehensive enough.

I can take a quick glance and comment at the design decisions underlying LRUCache.jl though: Single table-global mutex that is taken on the read-path implies quite limited throughput. But if your expected throughput is < 1 M accesses per second, then this all doesn’t matter performance-wise.

I think I would be able to google something in the java, Rust or C/C++ world, and it is an interesting design challenge.

Tamas_Papp · May 25, 2026, 10:59am

I ended up not using LRUCache.jl for two reasons:

I am running into an issue with it an EnzymeTestUtils.jl,
I realized a simple solution where I put items in a dict then cull to a minimum size min_size when it’s size reaches some threshold max_size is fine for me even if max_size ≫ min_size (eg by a factor of 10) is fine for my purposes, all I need is not growing beyond a certain bound.

As for threading, I benchmarked the code, and for real-life problems, calculating the cached value takes about 10^4–10^6 more time (!) than dictionary access [that’s why I am caching it ], and as long as only dict access locks (haskey, getindex, setindex!), it is unlikely to ever become a problem.

So for now I am using ThreadSafeDicts.jl, which seems to work fine. But I am open to other suggestions.

tecosaur · May 25, 2026, 4:41pm

Just for fun, I’ll leave a breadcrumb here that I’ve got some work in progress on a concurrent LRU cache. The work’s not ready yet, but if you’re interested do get in touch

Tamas_Papp · May 26, 2026, 11:12am

Yes, I am interested. Are you aware of

which has a

Concurrent dictionary. All operations are lock-free except when the dictionary is resized.

I haven’t had time to understand it better yet but it looks interesting.

tecosaur · May 27, 2026, 3:00am

I am indeed, but I’m currently pursuing a custom Dict implementation. There are essentially three motivations for this:

Being able to get a stable index of an entry in the dictionary, and evict by index
- Alternative: store the keys in the LRU list, but that’s rather inefficient
- This also requires no resizing/rehashing ever happens
Being able to use a pre-computed hash
Being able to use an abstract value store (e.g. a disk-based system)

I’m currently most interested in trying a design based on DLHT (https://arxiv.org/pdf/2406.09986), which uses a limited closed addressing approach and cmpxchg16b.

My main concerns with this are single-threaded performance and space efficiency. Currently if I shove 100 random int => short random string pairs into the cache, I find that:

LRUCache.jl takes up ~12.5 KiB
My LRU + custom dict takes up ~5.2 KiB

I suspect a DLHT based custom dict will be a fair bit larger, but with much better concurrent performance.

It’s worth noting that it’s not just a concurrent dict you need, since the LRU needs to be updated each access too. I’m currently thinking about thread-local stored updates which are periodically pushed to the main LRU on a lock (see BP-Wrapper: BP-Wrapper: A System Framework Making Any Replacement Algorithms (Almost) Lock Contention Free | IEEE Conference Publication | IEEE Xplore), is probably the best approach to start with. I believe this is used in a bunch of popular Go caches, such as Ristretto, Theine, and Otter.

foobar_lv2 · May 27, 2026, 9:04am

Cool paper, thanks!

DLHT is very much based on seqlocks. If you make progress on that, I’d be very interested to see that!

I think we need some more language support for seqlocks / ticket-locks for non-bitstype keys/values first?

Seqlocks canonically work like:

function readThing(lock, valueref)
  while true
    first_seqnum = @atomic :acquire lock.seqnum
    if isodd(first_seqnum) 
      Base.pause()
      continue
    end
    value = valueref[]
    second_secnum = @atomic :release lock.seqnum
    second_secnum == first_seqnum && return value
    #value can be llvm-`undef`, i.e. an invalid out-of-thin-air pointer. If the garbage collector sees it, we are fucked.
  end
end

The two issues are:

There is no @atomic :release load
value is undef if a write raced us and we need to restart.

The fact that value is undef or poison doesn’t matter, since we don’t look at it. But the garbage collector can look at it!

And this means that somebody must ensure that we do the right thing: We could ensure that there is no GC safepoint between the speculative load of value and the commit, and putting the loaded value on the shadow/GC stack must be deferred until after we committed and before we actually reach the next safepoint. Or we could try for other invariants that make this safe.

Tamas_Papp · May 27, 2026, 10:06am

My use case is probably something like 20–100 values in the cache, each about 100KiB, so a few KiB of overhead is irrelevant for my purposes.

Caching/memoization gives me a such a significant speedup that it dwarfs every concern about dict implementation etc.

Also, I just realized that if I could ensure that each computation remains on its own thread, I could make my caches thread local.

But I am not sure about the semantics. Suppose that process A is on thread 1, and has local cache CacheA. Similarl, B on 2 has CacheB. Now if I make caches thread local, could A end up with CacheB and B with CacheA? Still no races, or corruption, just cache misses, which I don’t want. Can I prevent this? Or am I worrying about something irrelevant?

Topic		Replies	Views
Can dicts be threadsafe? General Usage multithreading	17	6609	September 22, 2022
Cached data and thread safety New to Julia multithreading	4	477	September 9, 2023
Length of Dict is not thread safe General Usage multithreading	5	749	January 20, 2022
[ANN] Memoization.jl Package Announcements memoize	12	2902	April 16, 2021
Dicts in different @async blocks New to Julia question	8	579	April 23, 2020

Dictionary for thread-safe, limited memoization

Related topics