Using an assignment outside of threaded for loop degraded performance by a huge amount

I created an MWE for this question, in practice the only difference between the fs functions is that T = Int is introduced in different scopes, however the first two versions are very slow:

using Base.Threads: @threads, nthreads

function makechunks(X::AbstractVector, n::Integer)
    c = length(X) ÷ n
    return [X[1+c*k:(k == n-1 ? end : c*k+c)] for k = 0:n-1]
end

function f1(vcs) # very slow
    T = Int
    smap = Dict("a" => [1:5000;], "b" => [1:5000;])
    @threads for chunk in makechunks(vcs, nthreads())
        c = 0
        for v in chunk
            for s in v
                for idx in smap[s]
                    c += one(T)
                end
            end
        end
    end
    return
end

function f2(vcs) # very slow
    T = Int
    smap = Dict("a" => T[1:5000;], "b" => T[1:5000;])
    @threads for chunk in makechunks(vcs, nthreads())
        T = Int
        c = 0
        for v in chunk
            for s in v
                for idx in smap[s]
                    c += one(T)
                end
            end
        end
    end
    return
end

function f3(vcs) # fast
    Q = Int
    smap = Dict("a" => Q[1:5000;], "b" => Q[1:5000;])
    @threads for chunk in makechunks(vcs, nthreads())
        T = Int
        c = 0
        for v in chunk
            for s in v
                for idx in smap[s]
                    c += one(T)
                end
            end
        end
    end
    return
end

timings:

f1(fill(["a", "b"], 5000))
@time f1(fill(["a", "b"], 5000)) # 4.124463 seconds (49.99 M allocations: 763.007 MiB, 4.31% gc time)

f2(fill(["a", "b"], 5000))
@time f2(fill(["a", "b"], 5000)) # 3.106983 seconds (49.99 M allocations: 763.007 MiB, 2.36% gc time)

f3(fill(["a", "b"], 5000))
@time f3(fill(["a", "b"], 5000)) # 0.000160 seconds (89 allocations: 164.969 KiB)

Is this something expected? I’m not so well versed with multithreading to be sure but seems bad to me

1 Like

mmmh, maybe it is an instance of the problem described in this open issue: performance of captured variables in closures · Issue #15276 · JuliaLang/julia · GitHub

2 Likes

Yes, with nuance. @threads does put the expression into a closure that captures T = Int in f1 and f2, but that closure issue has improved since then and un-reassigned captured variables are inferred fine. Here’s a demonstration of that:

julia> let
        x = Int
        onex() = one(x)
       end
(::var"#onex#3"{DataType}) (generic function with 1 method)

julia> let
        x = Int
        onex() = (x = Float64; one(x))
       end
(::var"#onex#9") (generic function with 1 method)

In the 1st case like f1, there is a type parameter DataType for the type of the captured variable x, but in the 2nd case like f2 where the closure reassigns x, the parameter is absent because the compiler fell back to not inferring it (variable’s value is stored in a Core.Box, which is like a Ref{Any}).

So why is the well-inferred f1 just as slow as f2? Because one(::DataType) is not type-stable in general e.g. one(Int) and one(Float64) have different types. f3 evades this issue because T is not captured, so when the closure is compiled, T can be recognized as a constant and constant-folding can fix one(T) at compile-time. Unfortunately, captured variables’ inferrability are determined at parse-time when their types are still completely unknown, so closures can’t do better than struct ...{...S...} ... T::S ... end and finding out later that S=DataType. It’d be nice if we could have T::Type{S2} so S2=Int can allow constant-folding.

1 Like

What if you make it a global const?

const T = Int
function f4(vcs) # very slow
    smap = Dict("a" => [1:5000;], "b" => [1:5000;])
    @threads for chunk in makechunks(vcs, nthreads())
        c = 0
        for v in chunk
            for s in v
                for idx in smap[s]
                    c += one(T)
                end
            end
        end
    end
    return
end

f4(fill(["a", "b"], 5000))
@time f4(fill(["a", "b"], 5000)) # 0.000157 seconds (21 allocations: 157.688 KiB)

It is again fast

I think another workaround is to interpolate the variable, like in:

c += one($T)

(You should use c = zero($T) as well, to avoid a promotion there)

julia> function f1(vcs) # very slow
           T = Int
           smap = Dict("a" => [1:5000;], "b" => [1:5000;])
           @threads for chunk in makechunks(vcs, nthreads())
               c = 0
               for v in chunk
                   for s in v
                       for idx in smap[s]
                           c += one($T)
                       end
                   end
               end
           end
           return
       end
ERROR: syntax: "$" expression outside quote around REPL[4]:9
Stacktrace:
 [1] top-level scope
   @ REPL[4]:1

If you mean this, it throws a syntax error

Actually the interpolation works with @spawn, not with @threads:

julia> function f1_2(vcs) # very slow
           T = Int
           smap = Dict("a" => [1:500;], "b" => [1:500;])
           @sync for chunk in makechunks(vcs, nthreads())
               @spawn begin
                   c = zero($T)
                   for v in $chunk
                       for s in v
                           for idx in $smap[s]
                               c += one($T)
                           end
                       end
                   end
               end
           end
           return
       end
f1_2 (generic function with 1 method)

julia> @time f1_2(fill(["a","b"], 500))
  0.045161 seconds (538.60 k allocations: 10.318 MiB, 4.56% gc time, 417.59% compilation time)

julia> @time f1_2(fill(["a","b"], 500))
  0.021519 seconds (495.98 k allocations: 7.588 MiB, 8.42% gc time)

Which I used as a solution before (see Type-instability because of @threads boxing variables - #21 by Elrod). It should be equivalent to creating a let block, but that doesn’t seem to be doing the job here:

julia> function f1_2(vcs) # very slow
           T = Int
           smap = Dict("a" => [1:500;], "b" => [1:500;])
           @threads for chunk in makechunks(vcs, nthreads())
               let T = T, chunk = chunk, smap = smap
                   c = zero(T)
                   for v in chunk
                       for s in v
                           for idx in smap[s]
                               c += one(T)
                           end
                       end
                   end
               end
           end
           return
       end
f1_2 (generic function with 1 method)

julia> @time f1_2(fill(["a","b"], 500))
  0.048589 seconds (508.51 k allocations: 8.427 MiB, 486.98% compilation time)

julia> @time f1_2(fill(["a","b"], 500))
  0.021447 seconds (495.97 k allocations: 7.588 MiB, 12.07% gc time)

Probably there is something I’m missing.

2 Likes

@spawn’s interpolation of T doesn’t make the instance Int visible and constant to the compiler like a const or wholly local variable could; it can’t because it’s expanded at parse-time when T is just a symbol in f1_2. It doesn’t capture the variable or copy the instance either, it captures the instance like a hidden argument. Most it can automatically infer is DataType.

Making let blocks helps inferrability of captured variables, but again, inferring T::DataType doesn’t make one(::T) inferrable, constant propagation is needed for that, and we can’t add method parameters to hidden closures to help with that.

1 Like

Oh, that’s interesting. But that suggests this pattern, of instead of using T = Int using T = 1, and using zero(T) as zero(1) and one(T) as meaning one(1). That solves the issues with the captured variable:

julia> function f1_2(vcs) # very slow
           T = 1
           smap = Dict("a" => [1:500;], "b" => [1:500;])
           @threads for chunk in makechunks(vcs, nthreads())
                   c = zero(T)
                   for v in chunk
                       for s in v
                           for idx in smap[s]
                               c += one(T)
                           end
                       end
                   end
           end
           return
       end
f1_2 (generic function with 1 method)

julia> @time f1_2(fill(["a","b"], 500))
  0.022976 seconds (12.30 k allocations: 864.859 KiB, 99.68% compilation time)

julia> @time f1_2(fill(["a","b"], 500))
  0.000104 seconds (16 allocations: 17.453 KiB)

and it works if one obtains the value of T from an input variable:

julia> function f1_2(vcs, val) # very slow
           T = one(val)
           smap = Dict("a" => [1:500;], "b" => [1:500;])
           @threads for chunk in makechunks(vcs, nthreads())
               #let T = T, chunk = chunk, smap = smap
                   c = zero(T)
                   for v in chunk
                       for s in v
                           for idx in smap[s]
                               c += one(T)
                           end
                       end
                   end
               #end
           end
           return
       end
f1_2 (generic function with 2 methods)

julia> @time f1_2(fill(["a","b"], 500), 1)
  0.023463 seconds (12.23 k allocations: 854.250 KiB, 99.69% compilation time)

julia> @time f1_2(fill(["a","b"], 500), 1)
  0.000054 seconds (16 allocations: 17.453 KiB)
1 Like

Could also pass Int in as a method parameter T at that point too.

1 Like