A recent Slack thread discussed what the purpose of task local storage (TLS) and scoped values was. Since these features of Julia are a little obscure, @Krastanov suggested I post them to Discourse for posterity.
@vchuravy already gave a talk on scoped values at JuliaCon 2024, which some people may prefer to a post like this:
TLS has been part of Julia since at least 2013, long before Julia 1.0. In contrast, scoped values are new in Julia 1.11, released about one year ago. Their use cases are distinct, but their existence have the same underlying motivations: How to handle global, mutable state.
A background: Shared values
In Julia, nearly all the data we process is passed to functions as arguments, and is therefore part of the local scope of the function processing it:
data = [1, 2, 3]
sm = sum(data) # passed as variable to `sum`
print("The sum of data is: $sm")
This is the most useful pattern for accessing data, and therefore by far the most widespread.
Let’s look at some exceptions to this pattern.
Global constants
Suppose I want to compute the molecular weight of a poly-A tail[1]. For this, I need to access the molecular weight of adenosine monophosphate, water, hydroxyl, and the five-prime cap. For example:
function poly_a_tail_weight(
nucleotides::Integer,
amp_weight::Real,
hydroxyl_weight::Real,
water_weight::Real,
five_prime_cap_weight::Real,
)
return nucleotides * amp_weight -
(nucleotides - 1) * water_weight +
hydroxyl_weight +
five_prime_cap_weight
end
While it works, this signature is kind of silly for two reasons:
-
It feels semantically wrong that the molecular weights of these molecules are passed into the function call, because the weights are constant, and have nothing to do with this particular instance of computation. In constrast, the number of nucleotides really is local information relevant to precisely this computation. These two kinds of information: Constant knowledge, and information local to the function, should be separated somehow.
-
It’s annoying to pass all these arguments to the function - that means the arguments must also be part of the caller’s signature, and that caller’s caller and so on, all the way up the call chain. So you would end up having tonnes and tonnes of arguments at the top level functions.
I guess there are also minor questions about efficiency: Why should this constant information be stored on the stack though the call chain?
Anyway, the solution is clear here: We store it outside the signature, as global constants:
const AMP_WEIGHT = 347.22
const HYDROXYL_WEIGHT = 17.007
const WATER_WEIGHT = 18.015
const FIVE_PRIME_CAP_WEIGHT = 803.40
function poly_a_tail_weight(nucleotides::Integer)
...
end
Much nicer!
Global mutable data?
One problem with const
is that it’s… constant. We sometimes want to mutate data.
For example, suppose I have a set of genomes and, for each genome, a set of gene positions. I need to write out a file containing each genome’s instance of a given gene.
The files should be gzip-compressed, and because compression is a bottleneck, I reach for the high performance LibDeflate.jl.
This package requires us to allocate a mutable Compressor
struct to handle our compression.
The function gzip_compress!
takes the compressor, and mutates it in the process of compressing the input data.
If I was careless, I might do something like the following:
using LibDeflate
# OncePerProcess is new in Julia 1.12. In Julia 1.11,
# I might use `Base.Lockable(Ref{Union{Nothing, Compressor}}(nothing))`
const COMPRESSOR = OncePerProcess{Compressor}(() -> Compressor())
function compress_genes(
genomes::Genome,
positions::Vector{<:UnitRange},
path::AbstractString
)
buffer = IOBuffer()
for gene_position in positions
write(buffer, get_gene(genome, gene_position))
end
open(path, "w") do io
write(io, gzip_compress!(COMPRESSOR[], UInt8[], take!(buffer)))
end
end
This works perfectly fine… until I try to run compress_genes
in multiple tasks concurrently.
Compressor
is not thread-safe, so if it’s used by multiple threads at once, it will malfunction and most likely crash the process with a segfault.
In fact, even if I never use multiple tasks, it’s still dangerous. Someone else might use my package and spawn multiple tasks which calls into compress_genes
.
This is the problem that both task-local storage and scoped values is intended to solve. They allow you to access data which is not passed as an argument, while also handling concurrent access better than a simple global mutable variable.
Task-local storage
Let’s revisit the problem above. The issue with using OncePerProcess
here is that we must never have more than one thread mutating COMPRESSOR
at once.
In Julia, by design, we have little control over threads. Threads are a resource transparently provided by the operating system, kind of like CPU time or memory. We request a thread when we schedule a task, but which thread we get, and when, is a decision made by the Julia runtime, and out of our control. This is analagous to how we can’t (and shouldn’t attempt to) control at which memory address out data is allocated.
Anyway, all this is to say that, in Julia, we shouldn’t attempt to directly interact with threads, we should interact with tasks. Since all Julia task runs on at most one thread at any given time (although a task may jump between threads), if we ensure that only one task accesses our compressor, we also guarantee at most one thread accesses it at one time.
Therefore, we can solve the thread safety issue by giving a new instance of Compressor
to each task. The function task_local_storage
accesses, or writes to, a dictionary which is specific to the current running task.
struct CompressorKey end
function compress_genes( ... )
# Get compressor from task local storage if we already created it,
# else create a new one.
tls = task_local_storage()
key = CompressorKey()
compressor = if haskey(tls, key)
tls[key]
else
tls[key] = Compressor()
end
[ ... ]
# Read from the task local storage
write(io, gzip_compress!(task_local_storage(CompressorKey()), UInt8[], take!(buffer)))
end
Above, we could also have used the new OncePerTask
interface instead of task_local_storage
. The former is a wrapper around the latter with a nicer API.
Note that in the case above, if we spawn a task to run compress_genes
100 times, it will only create a Compressor
once.
The pattern above is not optimally efficient in its specific case - if we have, say 8 threads and 100 tasks, we create 100 individual compressors, where in reality, we only need 8 to avoid threading issues. A better example might have state which truly needs to be tied to the task, and not to the thread. Nonetheless, I hope the example gets the point across.
Scoped values
Scoped values are used when you:
- Have some global data D
- Wants D to take on different values at different point in the program
- Need to access D from multiple different tasks
The canonical example is logging. You have some logger
object, and it would be a pain to pass this object as argument through all your function calls, so this really should be a global variable.
The logger is an IO object, and should probably be protected behind a lock, so thread safety is not a concern.
However, the logger has settings, which can be changed during the program - one part of the program may need one set of settings, and another part another set. Or, your library could be called by two tasks, one which requires one logger setting, another which requires another setting. Your library makes use of multiple tasks itself, so using TLS is not appropriate - one logger state is shared between all the many tasks spawned by your library.
What you want is some kind of ‘task local state’ which is inherited by all child tasks spawned by the current task. And that is what scoped values are.
Here’s how to use it - assuming we have some kind of logger package:
using Base.ScopedValues
const LOGGER = ScopedValue(new_logger(DEFAULT_SETTINGS))
# Call some code using a modified logger
function do_computation(data; logger_settings::LoggerSettings=DEFAULT_SETTINGS)
with(LOGGER => new_logger(logger_settings)) do
[ ... ]
end
end
Inside the with
function’s scope, the constant LOGGER
will be a ScopedValue
set to new_logger(logger_settings)
- even if inside that scope, I spawn multiple new tasks.
Outside the scope, LOGGER
will retain its old value with default settings, even if outside the scope runs multiple tasks, and all these tasks, both outisde and inside the scope, run concurrently with each other.
Here is an example where the same global scoped value has two distinct values at the same time, as accessed by four different tasks.
julia> begin
using Base.ScopedValues
const SCOPED = ScopedValue(1)
with(SCOPED => 2) do
Threads.@spawn begin
sleep(2)
println("In scope: ", SCOPED[])
end
Threads.@spawn begin
sleep(1)
println("In scope: ", SCOPED[])
end
end
println("Outside scope: ", SCOPED[])
task = Threads.@spawn begin
sleep(1.5)
println("Outside scope: ", SCOPED[])
sleep(1)
println("Outside scope: ", SCOPED[])
end
wait(task)
end
Outside scope: 1
In scope: 2
Outside scope: 1
In scope: 2
Outside scope: 1
The way this works is that the tasks are not really accessing the same, global scoped value. Instead, they are reading a certain kind of task-local storage, which is inherited by child tasks.
TL;DR:
- Task local storage (TLS) and scoped values are both answers to how to access global mutable data with multiple tasks
- TLS is used when each task needs to operate on a unique piece of data, typically when the data is not threadsafe and therefore can’t be accessed by multiple tasks
- Scoped values are used when you want different parts of your code to use different values for some global variable, and each of these parts may make use of more than one task so TLS cannot be used
This is just an example, the biological details here are questionable. ↩︎