--track-allocation and @threads

Essentially, I call a function within nested loops. In serial, no heap allocations occur within the function. When run on multiple threads via @threads, allocations are reported within the function. Because the threaded case runs much slower, I suspect this is not only a reporting issue.

I know there are some related issues to this elsewhere, but I haven’t really seen anything that would immediately solve this problem. In particular, it would be good to know whether this is something subtle in the code @threads is operating on or if this is a more fundamental problem with @threads and I need to switch to some other threading construct. I have tried @batch, which has the same issue.

The following example may fail to be ‘minimal’, but it is complete and demonstrates the issue.

  1 #!/usr/bin/env julia
  2 
  3 
  4 using Profile
  5 
  6 using .Threads
  7 
  8 using StaticArrays
  9 
 10 using BenchmarkTools
 11 
 12 
 13 function nonsense_work(v::SVector{3,F}, i::Int) where {F}
 14     return v.^i ./ SVector{3,F}(1, 2, 3)
 15 end
 16 
 17 
 18 function body()
 19     m, n = 100, 200
 20     some_vectors = [SVector(i, i-1, i+1) for i in 1:n]
 21     result = Array{SVector{3,Float64},2}(undef, m, n)
 22     @threads for i in 1:m
 23         for (j, v) in enumerate(some_vectors)
 24             result[i,j] = nonsense_work(v, i)
 25         end
 26     end
 27 end
 28 
 29 
 30 function main()
 31     body()
 32     Profile.clear_malloc_data()
 33     body()
 34     @time body()
 35     @btime body()
 36 end
 37 
 38 
 39 main()
 40 
 41 
 $ JULIA_NUM_THREADS=1
 $ julia --track-allocation=user threadallocmwe.jl 
  0.009075 seconds (9 allocations: 474.188 KiB)
  8.823 ms (9 allocations: 474.19 KiB)
 $ JULIA_NUM_THREADS=4
 $ julia --track-allocation=user threadallocmwe.jl 
  0.040592 seconds (25 allocations: 475.547 KiB)
  29.747 ms (24 allocations: 475.52 KiB)

In the first case:

 13         - function nonsense_work(v::SVector{3,F}, i::Int) where {F}
 14         0     return v.^i ./ SVector{3,F}(1, 2, 3)
 15         - end
 16         -
 17         -
 18         - function body()
 19         -     m, n = 100, 200
 20         0     some_vectors = [SVector(i, i-1, i+1) for i in 1:n]
 21 534809120     result = Array{SVector{3,Float64},2}(undef, m, n)
 22     53472     @threads for i in 1:m
 23         -         for (j, v) in enumerate(some_vectors)
 24         -             result[i,j] = nonsense_work(v, i)
 25         -         end
 26         -     end
 27         - end

and in the second:

 13         - function nonsense_work(v::SVector{3,F}, i::Int) where {F}
 14     15584     return v.^i ./ SVector{3,F}(1, 2, 3)
 15         - end
 16         -
 17         -
 18         - function body()
 19         -     m, n = 100, 200
 20         0     some_vectors = [SVector(i, i-1, i+1) for i in 1:n]
 21 127701280     result = Array{SVector{3,Float64},2}(undef, m, n)
 22     12768     @threads for i in 1:m
 23         -         for (j, v) in enumerate(some_vectors)
 24         -             result[i,j] = nonsense_work(v, i)
 25         -         end
 26         -     end
 27         - end

What @threads is effectively doing is creating an anonymous function for the loop body, which captures the used variables from the surrounding scope. You should be able to see that if you do @macroexpand. This capturing is tricky and hard for the compiler to optimize, which can often lead to additional allocations happening.

One way around that may be to use a Ref of the captured variable instead, which you then access via r[] in the loop body, to get the original object back. This explicit boxing is much easier on the compiler.

A different approach would be asking whether the overhead of interthread communication alone is too much to make the parallelization worthwhile - that requires a more careful analysis of your original code though.

@Sukera

Can you give an example of what you mean by use a Ref to capture variables to avoid allocations. Here is a nonsense example that doesn’t allocate without Threads.@threads, but does with it.

@views function test!(a, b)
    n = size(a,1)
    Threads.@threads for i = 1:n-1
        a[i] = cos(atan(log10((a[i]^2 + b[i+1]^2)^2)))
    end
end

a = ones(Float32, 100000)
b = ones(Float32, 100000)

@time test!(a,b)

@Sukera, correct me if I’m wrong, but I think you mean this:

 18 function body()
 19     m, n = 100, 200
 20     some_vectors = [SVector(i, i-1, i+1) for i in 1:n]
 21     result = Array{SVector{3,Float64},2}(undef, m, n)
 22     res_ref = Ref(result)
 23     @batch for i in 1:m
 24         for (j, v) in enumerate(some_vectors)
 25             res_ref[][i,j] = nonsense_work(v, i)
 26         end
 27     end
 28 end

I have yet to try this in my larger code, but this works for the MWE. Thank you so much for providing a simple solution to this problem—I spend a long time trying to figure this out, assuming it was some issue with the functions being called.

 $ JULIA_NUM_THREADS=1
 $ julia --track-allocation=user threadallocmwe.jl 
  0.009055 seconds (3 allocations: 473.641 KiB)
  9.317 ms (3 allocations: 473.64 KiB)
 $ julia -O3 threadallocmwe.jl 
  0.000553 seconds (3 allocations: 473.641 KiB)
  387.229 μs (3 allocations: 473.64 KiB)
 $ JULIA_NUM_THREADS=4
 $ julia --track-allocation=user threadallocmwe.jl 
  0.056792 seconds (5 allocations: 473.688 KiB)
  54.899 ms (5 allocations: 473.69 KiB)
 $ julia -O3 threadallocmwe.jl 
  0.000384 seconds (5 allocations: 473.688 KiB)
  107.344 μs (5 allocations: 473.69 KiB)

Yes, that is exactly what I meant.

For more background about why that helps - the Ref(result) produces a RefValue{Matrix{...}}, so a typed reference. Capturing variables produces a generic, untyped Box (more or less a derefable Any). Accessing that box means you have to put the result somewhere, while also introducing type instabilities - which can notoriously lead to allocations.

I don’t know if I understand why that would be the case. I think what you’re saying is that typeof(val_ref[]) where val_ref = Ref(val) is known a-priori, whereas the type of val outside this context could change. Since the type of val_ref could change, I am not sure how this leads to any optimization.

Thought this might be a bit outside my abilities, is there a reason @threads doesn’t do this automatically for every variable not declared within the loop? In other words, is there a reason

@threads for i in range
    dothings(a, b, i)
end

is not automatically turned into the equivalent of

a_ref, b_ref = Ref(a), Ref(b)
@threads for i in range
    dothings(a[], b[], i)
end

?

PS - The fix you gave doesn’t fix the problem for my use case, but hopefully something like it could work.

No, I’m not talking about changing types of anything. The objects still stay the same type. I’m saying that the type of val inside the function created by @threads is a Box (accessing which will insert dynamic lookups everywhere) in the bad case, and a RefValue{Matrix{...}} in the good case. That’s the difference.

How should a macro reach outside of its captured expression and know which variables are captured and which are not? Some may even be global variables, which would probably be stable anyway if they’re marked const. A macro is just missing a lot of extra information (and it can’t reach outside of its domain to change surrounding code).

You mean in your real code? That’s unfortunate, but without knowing what that code is, it’s kind of impossible to diagnose…

The canonical issue for this is performance of captured variables in closures · Issue #15276 · JuliaLang/julia · GitHub

I see, thanks.

I understand it can’t reach outside of the block it is operating on, but you make a good point that doing this to everything not defined within the block could be a bad idea.

It wouldn’t be reasonable to ask anyone to look through it even if I could share it.

I saw this issue, but wasn’t able to turn that information into a solution. I also assumed everything before ~2020 may no longer be the case.

Thank you again for your help.

1 Like

Earlier I mentioned this didn’t work for my more general code, and I have found what I think is the difference between the MWE I gave and my larger code. In my larger code the data structure passed to nonsense_work is complex and created outside of the function, it looks, for example, more like the following (where the result of --track-allocations=user have been included):

        - function nonsense_work(::Type{F}, v::V, i::Int) where {V,F}
       32     return v[].^i ./ SVector{3,F}(1, 2, 3)
        - end
        - 
        - 
        - function body(m::Int, n::Int, some_vectors::Vector{Base.RefValue{SVector{3,Int}}})
   480080     result = Array{SVector{3,Float64},2}(undef, m, n)
   480080     result2 = Array{SVector{3,Float64},2}(undef, m, n)
       16     res_ref = Ref(result)
       48     @threads for i in 1:m
        -         for (j, v) in enumerate(some_vectors)
        -             res_ref[][i,j] = nonsense_work(Int, v, i)
        -         end
        -     end
        0     for i in 1:m
        0         for (j, v) in enumerate(some_vectors)
        0             result2[i,j] = nonsense_work(Int, v, i)
        -         end
        -     end
     7024     @assert all(result .== result2)
        - end
        - 
        - 
        - function main()
        -     m, n = 100, 200
        0     some_vectors = [Ref(SVector(i, i-1, i+1)) for i in 1:n]
        0     body(m, n, some_vectors)
        0     Profile.clear_malloc_data()
        0     @time body(m, n, some_vectors)
        - end
        - 
        - 

Wildly speculating that the issue has something to do with these added function barriers, @inlineing functions seems to largely (although not entirely) solve the problem.

This doesn’t seem like the best solution, but it seems to partially solve my problem. Hopefully this is useful to someone else.

I am entirely open to better solutions to this problem.

Is your current question about why @inline seems to help? How much does it help?

Also you refer to a large structure. I suspect we have an issue where this large structure may not be concretely typed and thus we are experiencing significant overhead from dynamic dispatch.

Have we tried running @code_warntype to analyze type stability issues and other type inference issues?

@inline reduces allocations; I am not sure if it has made a huge timing difference.

My goal is for the large structure to be concretely typed; according to isconcretetype it is.

There are a couple unions with nothing (yellow) from iterators, but nothing red.

I may post it publicly at some point, but it is currently not on github.