I think it’s true that to get decent performance from multi-threaded code, it’s a good idea to minimise the amount of allocation done by each task. One way to do this can be calling functions that accept a “scratch” argument, allowing the same memory to be re-used multiple times i.e. less allocation.
So I have code that looks something like this:
using Base.Threads: nthreads, @threads, threadid
function bugged_code(input_matrix::AbstractMatrix)
output_matrix = similar(input_matrix)
n = some_large_number
scratch_args = [fill(1.0, n) for _ in 1:nthreads()]
@threads for i in eachindex(input_matrix)
scratch = scratch_args[threadid()]
output_matrix[i] = some_operator(input_matrix[i], scratch)
end
return (output_matrix)
end
function some_operator(x, scratch::AbstractVector)
#calculate y without allocation, thanks to scratch argument
return (y)
end
My hope was that each scratch vector would never be in use by more than one task simultaneously. But on reading Thread local state… I see that the code is probably bugged, with a “quickfix, not recommended longterm” being to use the static scheduler i.e. replace @threads for with @threads :static for
Maybe that’s OK, but my some_operator calls sort! which might well one day become multi-threaded in which case bad stuff (“… even deadlock”) might happen.
What’s the solution? How should I ensure that each executing task has sole access to dedicated scratch space?
using OhMyThreads: TaskLocalValue, tmap
function bugged_code(input_matrix::AbstractMatrix)
n = some_large_number
scratch_space = TaskLocalValue{Vector{Float64}}(() -> fill(1.0, n))
tmap(input_matrix) do x
some_operator(x, scratch_space[])
end
end
Thanks to both @Oscar_Smith and @adienes for those replies, which I’m sure are pointing me in the right direction.
I’ve amended the code to use TaskLocalValue with all tests passing, but allocations are rather higher, and performance a bit lower, than my existing “quickfix, not recommended” approach. So I need to experiment some more.
but allocations are rather higher, and performance a bit lower, than my existing “quickfix, not recommended” approach
I think this “shouldn’t” be the case. if you share a more complete example (and benchmark) it may be possible to help look more deeply into the performance difference.
As @adienes said, a MWE would be nice. Could also just be dynamic vs static scheduling.
You could try scheduler=StaticScheduler(), e.g.
using OhMyThreads: TaskLocalValue, tmap
function bugged_code(input_matrix::AbstractMatrix)
n = some_large_number
scratch_space = TaskLocalValue{Vector{Float64}}(() -> fill(1.0, n))
tmap(input_matrix; scheduler=StaticScheduler()) do x
some_operator(x, scratch_space[])
end
end
Or scheduler=DynamicScheduler(; nchunks=nthreads()).
Quick background: I have (an unregistered) package KendallTau for calculating kendall correlation via a multi-threaded function corkendall that I hope will replace the single-threaded function of the same name in StatsBase (mostly my code from 2021).
As of yesterday, the main branch of KendallTau uses the “quickfix” (threadid and static scheduler) and the TLV_Experiment branch uses OhMyThreads.TaskLocalValue. The relevant function is _corkendall in src/corkendall.jl. See here.
Here’s the comparison (on an ancient 4-core desktop):
julia> x=randn(1000,1000);
julia> res_qf=corkendall(x);#compile
julia> @time res_qf=corkendall(x);
6.501597 seconds (1.58 k allocations: 15.884 MiB)
#Change branch from main to TLV_Experiment, and restart Julia (not sure if Revise plays ball when branch switching).
julia> using KendallTau
julia> x=randn(1000,1000);
julia> res_tlv=corkendall(x);#compile
julia> @time res_tlv=corkendall(x);
6.595984 seconds (1.52 M allocations: 93.466 MiB, 1.70% gc time, 0.12% compilation time)
So, contrary to what I wrote yesterday, the speed of the two approaches looks about the same, but the number of allocations has gone up by a factor of 1,000 and the allocated memory by a factor of about 6. OTOH StatsBase.corkendall allocates way more:
I confess I have yet to fire up Julia with --track-allocation to investigate, but if you did have some insight as to how the code on the TLV_Experiment branch could be improved then that would be wonderful.
Not sure if this would be the source of those allocations, but as you can see in the examples above (as well as in your original code) scratches are created outside of the threaded loop and here you have a new set with each iteration.
I’ve now figured out the cause of the extra allocations in the code on branch TLV_Experiment. It’s that a TaskLocalValue is an untyped location so when getting data I need to specify the type as here.
But for my use case I think it would be simpler to not use OhMyThreads, but instead use Base.task_local_storage (as suggested by @Oscar_Smith) as here.
Ah, I suspected that OhMyThreads.TaskLocalValue might be simply a wrapper around Base.task_local_storage though I hadn’t dug in to check - thanks for confirming.
I’m a bit perplexed as to what was going on in my tests of TaskLocalValue yesterday (done at home, today I’m at my office) since I can’t replicate type assertions being necessary, just as you wrote above.
By the way, the documentation for OhMyThreads is excellent!
Great, otherwise I would have been really confused
Thanks. The package and its documentation are brand new. Please file issues/PRs if you have suggestions for how to improve the documentation or also new features that you would like to see.
In case you are interested, I’ve managed to replicate the circumstance in which allocations do happen (by looking at git history of what I actually did yesterday!)
Allocations don’t happen with this code, but they do happen with this code, which you could install via:
Thanks to @lmiq for the suggestion that I look at ChunkSplitters.jl An aspect of the problem that I haven’t paid too much attention to is that when calculating corkendall(x) (as opposed to corkendall(x,y)) the workload is naturally unbalanced as one calculates only one half of a symmetric matrix. ChunkSplitters helps with unbalanced workloads.