Multi-threaded code requiring scratch space

PGS62 · February 9, 2024, 5:48pm

I think it’s true that to get decent performance from multi-threaded code, it’s a good idea to minimise the amount of allocation done by each task. One way to do this can be calling functions that accept a “scratch” argument, allowing the same memory to be re-used multiple times i.e. less allocation.

So I have code that looks something like this:

using Base.Threads: nthreads, @threads, threadid

function bugged_code(input_matrix::AbstractMatrix)

    output_matrix = similar(input_matrix)
    n = some_large_number

    scratch_args = [fill(1.0, n) for _ in 1:nthreads()]

    @threads for i in eachindex(input_matrix)
        scratch = scratch_args[threadid()]
        output_matrix[i] = some_operator(input_matrix[i], scratch)
    end
    return (output_matrix)
end

function some_operator(x, scratch::AbstractVector)
    #calculate y without allocation, thanks to scratch argument
    return (y)
end

My hope was that each scratch vector would never be in use by more than one task simultaneously. But on reading Thread local state… I see that the code is probably bugged, with a “quickfix, not recommended longterm” being to use the static scheduler i.e. replace @threads for with @threads :static for

Maybe that’s OK, but my some_operator calls sort! which might well one day become multi-threaded in which case bad stuff (“… even deadlock”) might happen.

What’s the solution? How should I ensure that each executing task has sole access to dedicated scratch space?

Oscar_Smith · February 9, 2024, 6:02pm

You can use task_local_storage.

adienes · February 9, 2024, 6:08pm

this might work

using OhMyThreads: TaskLocalValue, tmap

function bugged_code(input_matrix::AbstractMatrix)
    n = some_large_number
    scratch_space = TaskLocalValue{Vector{Float64}}(() -> fill(1.0, n))

    tmap(input_matrix) do x
        some_operator(x, scratch_space[])
    end
end

PGS62 · February 9, 2024, 8:19pm

Thanks to both @Oscar_Smith and @adienes for those replies, which I’m sure are pointing me in the right direction.

I’ve amended the code to use TaskLocalValue with all tests passing, but allocations are rather higher, and performance a bit lower, than my existing “quickfix, not recommended” approach. So I need to experiment some more.

adienes · February 9, 2024, 8:32pm

but allocations are rather higher, and performance a bit lower, than my existing “quickfix, not recommended” approach

I think this “shouldn’t” be the case. if you share a more complete example (and benchmark) it may be possible to help look more deeply into the performance difference.

lmiq · February 10, 2024, 12:25am

You can use ChunkSplitters.jl for that.

carstenbauer · February 10, 2024, 8:27am

carstenbauer · February 10, 2024, 10:23am

As @adienes said, a MWE would be nice. Could also just be dynamic vs static scheduling.

You could try scheduler=StaticScheduler(), e.g.

using OhMyThreads: TaskLocalValue, tmap

function bugged_code(input_matrix::AbstractMatrix)
    n = some_large_number
    scratch_space = TaskLocalValue{Vector{Float64}}(() -> fill(1.0, n))

    tmap(input_matrix; scheduler=StaticScheduler()) do x
        some_operator(x, scratch_space[])
    end
end

Or scheduler=DynamicScheduler(; nchunks=nthreads()).

PGS62 · February 10, 2024, 10:53am

Thanks for offering to “look more deeply”!

Quick background: I have (an unregistered) package KendallTau for calculating kendall correlation via a multi-threaded function corkendall that I hope will replace the single-threaded function of the same name in StatsBase (mostly my code from 2021).

As of yesterday, the main branch of KendallTau uses the “quickfix” (threadid and static scheduler) and the TLV_Experiment branch uses OhMyThreads.TaskLocalValue. The relevant function is _corkendall in src/corkendall.jl. See here.

Here’s the comparison (on an ancient 4-core desktop):

julia> x=randn(1000,1000);

julia> res_qf=corkendall(x);#compile

julia> @time res_qf=corkendall(x);
  6.501597 seconds (1.58 k allocations: 15.884 MiB)

#Change branch from main to TLV_Experiment, and restart Julia (not sure if Revise plays ball when branch switching).

julia> using KendallTau

julia> x=randn(1000,1000);

julia> res_tlv=corkendall(x);#compile

julia> @time res_tlv=corkendall(x);
  6.595984 seconds (1.52 M allocations: 93.466 MiB, 1.70% gc time, 0.12% compilation time)

So, contrary to what I wrote yesterday, the speed of the two approaches looks about the same, but the number of allocations has gone up by a factor of 1,000 and the allocated memory by a factor of about 6. OTOH StatsBase.corkendall allocates way more:

julia> using StatsBase

julia> x=randn(1000,1000);

julia> res_sb = StatsBase.corkendall(x);#compile

julia> @time res_sb = StatsBase.corkendall(x);
 25.176168 seconds (3.00 M allocations: 17.090 GiB, 2.28% gc time)

I confess I have yet to fire up Julia with --track-allocation to investigate, but if you did have some insight as to how the code on the TLV_Experiment branch could be improved then that would be wonderful.

mkoculak · February 10, 2024, 11:26am

Not sure if this would be the source of those allocations, but as you can see in the examples above (as well as in your original code) scratches are created outside of the threaded loop and here you have a new set with each iteration.

PGS62 · February 11, 2024, 5:38pm

That was sloppy - thanks for pointing it out. With the mistake corrected:

julia> res_tlv = KendallTau.corkendall(x);#compile

julia> @time res_tlv = KendallTau.corkendall(x);
  6.973501 seconds (1.50 M allocations: 38.920 MiB)

which continues to compare not so well with the “quickfix” code on the main branch:

julia> res_qf=corkendall(x);#compile

julia> @time res_qf=corkendall(x);
  6.501597 seconds (1.58 k allocations: 15.884 MiB)

PGS62 · February 11, 2024, 8:36pm

I’ve now figured out the cause of the extra allocations in the code on branch TLV_Experiment. It’s that a TaskLocalValue is an untyped location so when getting data I need to specify the type as here.

But for my use case I think it would be simpler to not use OhMyThreads, but instead use Base.task_local_storage (as suggested by @Oscar_Smith) as here.

carstenbauer · February 11, 2024, 9:49pm

Note that TaskLocalValue is a slim wrapper around Base.task_local_storage. See here.

carstenbauer · February 11, 2024, 9:52pm

That shouldn’t be necessary. You specify the type of the value explicitly in the constructor and getindex has a type assertion.

PGS62 · February 12, 2024, 10:27am

Ah, I suspected that OhMyThreads.TaskLocalValue might be simply a wrapper around Base.task_local_storage though I hadn’t dug in to check - thanks for confirming.

I’m a bit perplexed as to what was going on in my tests of TaskLocalValue yesterday (done at home, today I’m at my office) since I can’t replicate type assertions being necessary, just as you wrote above.

By the way, the documentation for OhMyThreads is excellent!

carstenbauer · February 12, 2024, 10:37am

Great, otherwise I would have been really confused

Thanks. The package and its documentation are brand new. Please file issues/PRs if you have suggestions for how to improve the documentation or also new features that you would like to see.

PGS62 · February 12, 2024, 12:14pm

In case you are interested, I’ve managed to replicate the circumstance in which allocations do happen (by looking at git history of what I actually did yesterday!)

Allocations don’t happen with this code, but they do happen with this code, which you could install via:

]add https://github.com/PGS62/KendallTau#TLV_with_allocations

I then see this:

julia> using KendallTau
Precompiling KendallTau
  1 dependency successfully precompiled in 2 seconds. 52 already precompiled.

julia> x=randn(1000,1000);res=corkendall(x);

julia> @time corkendall(x);
  1.685487 seconds (1.50 M allocations: 39.410 MiB)

So 1.5 million allocations as opposed to the “good state” of 1.5 thousand.

PGS62 · February 14, 2024, 12:40pm

Thanks to @lmiq for the suggestion that I look at ChunkSplitters.jl An aspect of the problem that I haven’t paid too much attention to is that when calculating corkendall(x) (as opposed to corkendall(x,y)) the workload is naturally unbalanced as one calculates only one half of a symmetric matrix. ChunkSplitters helps with unbalanced workloads.