allows me to keep only the last N values inserted, for some value of N::Int,
is thread-safe.
Background: memoization, where the resulting objects are large and I am unlikely to reuse more than the last N, even though I am doing these calculations M \gg N times and just using a Dict or similar would be wasting/exhausting memory. I don’t need memorization macros etc, just the data structure.
Do we really need thread-safe data structures for memoization? I mean, the operations on data should be idempotent for memoization. I’m not an expert, so tell me if I’m wrong.
Do you need it to store exactly the last N, or can it be probabilistic? that would be a lot easier. Also, what do you mean by thread safe? do you want to support multiple concurrent writers and readers? that will be very hard to do efficiently under heavy contention.
Yes, at least the underlying data structure should be thread-safe, because concurrent writes could corrupt it etc.
Probabilistic would be OK, but since @ericphanson’s solution does exactly what I asked for I am going with that.
No heavy threading or clashing is expected, I just need some basic protections for the possibility of concurrent access.
PS.: I continue to be amazed at Julia’s package ecosystem. 80% of the time when I think of a problem, it turns out that someone has implemented a solution years ago.
This is exactly what I meant by probabilistic. It has capacity N, but doesn’t store the most recent N, it just overwrites old entries in existing slots.
Sorry, I am confused, I thought the LRU implementation was deterministic. “Last” is defined by wall clock time, even if it is threaded. That suits me fine.
Probabilistic may not be quite the right word. LRUs are deterministic, but not in that they store the most recent N elements. That would be fairly expensive to support, so instead, they are just an N element hash table that instead of using separate chaining or probing, will replace the element on insertion.
As such, you end up with N buckets, each of which store the most recent insert into that bucket, which is a reasonable and cheap to maintain approximation of the N most recently stored items, but isn’t exactly the most recent N items. e.g. if you write N items into one, you will only have stored ~0.6*n elements with the rest being duplicates.
I think you’re mixing something up. A LRU (Least Recently Used) cache replaces the element that hasn’t been accessed for the longest time when a new element is inserted. There is nothing probabilistic about it, and there is no bucketing going on. It’s not backed by a hash table.
Note that “time” here is not necessarily true wall-clock time, it can just be an age counter that is incremented.
Wall-clock time is problematic: We live in a relativistic world; events can be time-like (on happens before, one after) or space-like; and any notion of “happens before in wall-clock time” for space-like events is subjective.
This may sound like ridiculous nit-picking, but a clock-cycle is ~6 cm, and intuitions from physics underpin the intuitions we use for concurrency, like e.g. Lamport clocks.
More concretely: If you want “Last” in a shared LRU to be well-defined, then all involved cores need to talk to each other in some way, which costs latency / power. Some of that can be speculated around, but it sucks. The situation becomes much worse when you’re not talking about multiple cores on the same CPU to synchronize on some “sequentially consistent” ordering, but are instead thinking about distributed systems (in these cases, your events have good chance to be actually space-like, not just “effectively space-like because of processing latencies”).
So I think the way to go is to use a lock-free variant of LRUCache.jl, accepting that you sometimes compute a value twice due to races and that “least recently” is a probabilistic guarantee.
I am open to suggestions about other packages; if you have something in mind I am wondering why aren’t you just saying what you suggest instead.
I don’t have any suggestions for alternative packages. Frankly, my overview over the julia ecosystem is not comprehensive enough.
I can take a quick glance and comment at the design decisions underlying LRUCache.jl though: Single table-global mutex that is taken on the read-path implies quite limited throughput. But if your expected throughput is < 1 M accesses per second, then this all doesn’t matter performance-wise.
I think I would be able to google something in the java, Rust or C/C++ world, and it is an interesting design challenge.
I am running into an issue with it an EnzymeTestUtils.jl,
I realized a simple solution where I put items in a dict then cull to a minimum size min_size when it’s size reaches some threshold max_size is fine for me even if max_size ≫ min_size (eg by a factor of 10) is fine for my purposes, all I need is not growing beyond a certain bound.
As for threading, I benchmarked the code, and for real-life problems, calculating the cached value takes about 10^4–10^6 more time (!) than dictionary access [that’s why I am caching it ], and as long as only dict access locks (haskey, getindex, setindex!), it is unlikely to ever become a problem.
So for now I am using ThreadSafeDicts.jl, which seems to work fine. But I am open to other suggestions.
Just for fun, I’ll leave a breadcrumb here that I’ve got some work in progress on a concurrent LRU cache. The work’s not ready yet, but if you’re interested do get in touch
I am indeed, but I’m currently pursuing a custom Dict implementation. There are essentially three motivations for this:
Being able to get a stable index of an entry in the dictionary, and evict by index
Alternative: store the keys in the LRU list, but that’s rather inefficient
This also requires no resizing/rehashing ever happens
Being able to use a pre-computed hash
Being able to use an abstract value store (e.g. a disk-based system)
I’m currently most interested in trying a design based on DLHT (https://arxiv.org/pdf/2406.09986), which uses a limited closed addressing approach and cmpxchg16b.
My main concerns with this are single-threaded performance and space efficiency. Currently if I shove 100 random int => short random string pairs into the cache, I find that:
LRUCache.jl takes up ~12.5 KiB
My LRU + custom dict takes up ~5.2 KiB
I suspect a DLHT based custom dict will be a fair bit larger, but with much better concurrent performance.
DLHT is very much based on seqlocks. If you make progress on that, I’d be very interested to see that!
I think we need some more language support for seqlocks / ticket-locks for non-bitstype keys/values first?
Seqlocks canonically work like:
function readThing(lock, valueref)
while true
first_seqnum = @atomic :acquire lock.seqnum
if isodd(first_seqnum)
Base.pause()
continue
end
value = valueref[]
second_secnum = @atomic :release lock.seqnum
second_secnum == first_seqnum && return value
#value can be llvm-`undef`, i.e. an invalid out-of-thin-air pointer. If the garbage collector sees it, we are fucked.
end
end
The two issues are:
There is no @atomic :release load
value is undef if a write raced us and we need to restart.
The fact that value is undef or poison doesn’t matter, since we don’t look at it. But the garbage collector can look at it!
And this means that somebody must ensure that we do the right thing: We could ensure that there is no GC safepoint between the speculative load of value and the commit, and putting the loaded value on the shadow/GC stack must be deferred until after we committed and before we actually reach the next safepoint. Or we could try for other invariants that make this safe.
My use case is probably something like 20–100 values in the cache, each about 100KiB, so a few KiB of overhead is irrelevant for my purposes.
Caching/memoization gives me a such a significant speedup that it dwarfs every concern about dict implementation etc.
Also, I just realized that if I could ensure that each computation remains on its own thread, I could make my caches thread local.
But I am not sure about the semantics. Suppose that process A is on thread 1, and has local cache CacheA. Similarl, B on 2 has CacheB. Now if I make caches thread local, could A end up with CacheB and B with CacheA? Still no races, or corruption, just cache misses, which I don’t want. Can I prevent this? Or am I worrying about something irrelevant?