No more 1st class support for Threads? [thread-local storage]

I am implementing multi-threaded scientific software, an so far it works quite well.

But now I am told that thread-local storage is no longer supported by Julia. Well, it still works, even under Julia 1.12rc1, but i am told not to use it.

What is the suggested alternative?

Shell I replace @threads with @spawn ?

I don’t think I can put @spawn in front of a for loop.

So how can I use task local storage in a for loop? Or shall I not use for loops any longer?

Very confused.

Can you elaborate on that?

This was widely advertised already over 2 years ago: PSA: Thread-local state is no longer recommended; Common misconceptions about threadid() and nthreads()

Consider using OhMyThreads.jl: Thread-Safe Storage · OhMyThreads.jl

3 Likes

Well, I tried it and it had a very bad performance.

You can’t use an array indexed by threadid(), but there are other mechanisms for task-local storage.

For example, the Base function task_local_storage() gives you a task-local IdDict. You can also use other patterns, e.g. a Channel as described in Pattern for managing thread local storage? - #2 by tkf.

2 Likes

What did you try? There’s a lot of different suggestions in there for different situations.

2 Likes

This is too complicated for me. I do not have a degree in computer science, only in electrical engineering. Thread local storage is an easy to understand pattern for engineers and researchers.

For example, instead of using a value mydata[threadid()], you could use task_local_storage(mydata), or e.g. get(task_local_storage(), mydata, default) if you want a default value, where mydata is some constant (equal for all tasks), typically a global unique symbol (e.g. const mydata = gensym(@__MODULE__)). Ideally appending ::T to tell Julia the type T of the returned value (since it is an Any dictionary).

3 Likes

I don’t think you can achieve a good performance with dictionaries. Arrays are much faster.

Depends on how performance-critical access to the task-local variable is. What is your application where the cost of a dictionary lookup is significant compared to your other calculations in the task? Maybe in your case there is a better abstraction.

This is the function we are talking about: FLORIDyn.jl/src/visualisation/calc_flowfield.jl at 31081b8e695f5b05f0c3282c365d3eb348f48999 · ufechner7/FLORIDyn.jl · GitHub

In other words, you’d be replacing the line

GP = buffers.thread_buffers[tid]

with something like

GP = get!(task_local_storage(), :FLORIDyn_buffer) do
   # create new buffer if it doesn't exist yet for this task
end::WindFarm

Since this is only executed once per iteration, and the rest of the iteration (the rest of your for loop body) seems to do a lot of other work, why would the cost of a dictionary lookup matter?

5 Likes

I think the way to make the least modifications in a code that previously used threadid is to use ChunkSplitters.jl, by just replacing the threaded loop by a threaded loop over the chunks of the data:

julia> using ChunkSplitters, Base.Threads

julia> my_arr = rand(10_000);

julia> nchunks = 10
       my_sum = zeros(10)
       @threads for (ichunk, inds) in enumerate(index_chunks(my_arr; n=nchunks))
           my_sum[ichunk] += sum(@view(my_arr[inds]))
       end
       sum(my_sum)
5033.886812176603

# replacement to
julia> my_sum = zeros(10)
       @threads for i in eachindex(my_arr)
           my_sum[threadid()] += my_arr[i]
       end
       sum(my_sum)
5033.886812176624


but OhMyThreads.jl is a higher-level alternative and is probably, most times, a better option after some initial small effort to rewrite the structure of the parallel code.

ps: In your case you would do:

    buffers = create_thread_buffers(wf, nth)
    # Parallel loop using @threads
    using ChunkSplitters: chunks
    @threads for (tid, iGP_range) in enumerate(chunks(1:length(mx); n=nth))
        # Get thread-local buffers
        GP = buffers.thread_buffers[tid] # tid is now the chunk index
        comp_buffers = buffers.thread_comp_buffers[tid]
        for iGP in iGP_range
            # current calculations using iGP
        end
    end

(note that with that nth does not be necessarily equal to nthreads(), which can be useful to control the number of threads used, if nth < nthreads() or increase the number of tasks sometimes improving workload balance, if nth >> nthreads()).

8 Likes

All of these solutions look much more complex than what I have now. Is there any good reason to depreciate simple, performant, working solution in favor of complex, confusing solutions? Why are making it more and more difficult to write performant code in Julia?

Yes: allowing tasks to migrate between threads allows for much more flexible and performant parallelism, especially for irregular parallelism where you don’t know in advance how to equally divide the work among threads (i.e. where you need dynamic load balancing).

(In OpenMP, the same thing happens if you use an “untied” task, and their documentation warns against using the thread number in this case.)

(Caveat: I’m not involved in the details of Julia’s task scheduler; task migration is just a general principle of composable parallelism in my understanding: when a thread is idle, it might need to “steal” work from another thread.)

9 Likes

So can I conclude that it is discouraged to use @threads in front of a for loop? At least the “good” example at Multi-Threading · The Julia Language is not using @threads any more but @spawn.

On the other hand, you suggested to continue to use @threads together with ChunkSplitters .

And the @threads macro is not longer creating threads but tasks that can move between threads?

Still very confused.

I guess I will try this suggestion and see how it performs.

The thing that’s discouraged is using Threads.threadid() as an index into preallocated storage. Not just discouraged—you may get incorrect results!

Threads.@threads is still a nice and easy solution in many cases, as long as you avoid that pattern.

That said, when your algorithm needs some kind of task local storage, you may actually find it easier to obtain a correct and concise solution using the primitives in OhMyThreads.jl instead.

6 Likes

wait, if you have @threads :static for iGP in 1:length(mx) , it’s fine right?

No, it’s not explicitly discouraged, but using threadid() is. Note, though, that @threads has always been very primitive and limited API.
Using it in conjunction with ChunkSplitters.jl makes it a bit more versatile but OhMyThreads.jl tries to provide a better API so that you don’t have to

It never created threads. It always created tasks. What has changed is that tasks, created by @threads, used to be sticky and thus were “pinned” to threads and couldn’t migrate. They now can (since Julia 1.10 IIRC). Hence the 1-1 tasks to threads mapping is generally gone (unless you actively try to restore it).

13 Likes