Poor performance while multithreading (Julia 1.0)

This is kinda single-threaded, isn’t it?

You will get better answers if you boil your problem down to a self-contained MWE that people can run without loading any outside code by just copy-pasting into the REPL.

Many questions also answer themselves during this process (you try to remove one part of the calculation for simplicity, and suddenly it is fast and the problem becomes obvious in retrospect).

2 Likes

This is kinda single-threaded, isn’t it?

I am not sure - if I run this with more than a single thread performance decreases, but I guess thats just the overhead from locking and unlocking then.

I’ll try to come up with a minimal example.

I think this should work

# necessary for benchmarking
using BenchmarkTools 

# definition of MyStruct
struct MyStruct 
    Kernel1 :: Vector{Float64}
    Kernel2 :: Matrix{Float64}
end

# generate MyStruct
function init(L1 :: Int64, L2 :: Int64) :: MyStruct 
    return MyStruct(rand(Float64, L1),  rand(Float64, L2, L1))
end

function Factor(v :: Float64, Input :: MyStruct) :: Float64 
    Val :: Float64 = v * first(Input.Kernel1)
    return Val 
end

function integrate(Kernel :: Function, LowerBound :: Float64, UpperBound :: Float64, Dim :: Int64) :: Vector{Float64}

    Integral :: Vector{Float64} = zeros(Float64, Dim)
    Nodes    :: Vector{Float64} = Float64[LowerBound + i * (UpperBound - LowerBound) / 10.0 for i in 1 : 10]

    for i in 1 : 9
        Integral .+= (Nodes[i + 1] - Nodes[i]) .* Kernel(Nodes[i])
    end 

    return Integral 

end

function computeKernel1(T :: Float64, Iterator :: Int64, Input :: MyStruct) :: Float64
    Val :: Float64 = Input.Kernel1[Iterator] * Input.Kernel2[Iterator] / T
    return Val 
end

function computeKernel2(T :: Float64, Iterator :: Int64, Input1 :: MyStruct, Input2 :: MyStruct) :: Vector{Float64}

    Buffer :: Vector{Float64} = zeros(Float64, size(Input1.Kernel2, 1))

    function IntegrationKernel1(v :: Float64) :: Vector{Float64}

        for i in 1 : size(Input1.Kernel2, 1)
            Buffer[i] = v * i 
        end 

        return Factor(v, Input1) .* Buffer

    end
    function IntegrationKernel2(v :: Float64) :: Vector{Float64}

        for i in 1 : size(Input1.Kernel2, 1)
            Buffer[i] = v * i 
        end 

        return Factor(v, Input2) * Buffer

    end

    Result :: Vector{Float64}  = IntegrationKernel1(T)
    Result .+= integrate(IntegrationKernel2, -T, T, size(Input1.Kernel2, 1))

    return Result 

end

# the threaded loop
function computeStep(T :: Float64, Input1 :: MyStruct, Input2 :: MyStruct) :: MyStruct 

    Threads.@threads for Iterator in 1 : length(Input1.Kernel1)
        Input2.Kernel1[Iterator] = computeKernel1(T, Iterator, Input1)
    end 

    Threads.@threads for Iterator in 1 : size(Input1.Kernel2, 2)
        Input2.Kernel2[:, Iterator] = computeKernel2(T, Iterator, Input1, Input2)
    end 

    return Input2 

end 

function updateStep(dT :: Float64, Input1 :: MyStruct, Input2 :: MyStruct) :: MyStruct 

    Input1.Kernel1 .+= dT * Input2.Kernel1
    Input1.Kernel2 .+= dT * Input2.Kernel2

    return Input1 

end

# running the MWE
function MWE() :: Nothing

    L1 :: Int64 = 100
    L2 :: Int64 = 100

    Ti :: Float64 = 10.0 
    Tf :: Float64 = 0.01 
    b  :: Float64 = 0.98 

    Input1 :: MyStruct = init(L1, L2)
    Input2 :: MyStruct = init(L1, L2)

    while Ti > Tf 
        Input2 = computeStep(Ti, Input1, Input2)
        Input1 = updateStep((b - 1) * Ti, Input1, Input2)
        Ti     = b * Ti
    end 

    GC.gc()

end

This is a minimal example of what my code basically does. The mathematical operations involved were simplified, but the structure as a whole is identical.

Using @btime I get

1 Thread

143.877 ms (548918 allocations: 413.81 MiB)

4 Threads

309.731 ms (442965 allocations: 342.10 MiB)

Still with @code_warntype I cannot find type instabilities.

Your issue is not multithreading, it is that your code seems somewhat convoluted (no offense intended). You are very new to julia, right?

Rewrite your MWE into straight-loop code (no multithreading) in a single function. You have no recursion, so this is possible (just inline everything by hand). Then we all will have an easier time spotting optimization possibilities. In principle, your main loop in MWE() should not cause a single allocation.

The cool fancy features of julia that permit high programmer productivity, very fast code and genericity are only good once you understand your code well. This means writing it once in fortran-style, in your case in a single function (inline everything!) without custom types.

Type annotations are mostly unneeded in julia. Sometimes you need them, because inference fails to figure out types; but most of the time it is better to understand why inference fails and fixing that, instead of papering over the issue by just asserting types. By placing type-asserts you lose the very useful hints that type-instabilities will point you at problematic code sections.

2 Likes

Yes I come mainly from a C/Python background, and started working with Julia roughly a month ago, so I really don’t take this as an offense. However I was told that wrapping as much as possible in functions in julia would be a good idea…Nevertheless ill restructure the code and post again after that.

I don’t actually think this is good advice for everyone. Types and functions are the tools we have to organize thoughts into actual code, and throwing that away doesn’t really make any sense to me. In particular, the fact that @dkiese has separated out the code into small function should make it easier to figure out what is slow and where the problem lies.

This seems, frankly, inappropriate and unwelcoming to a beginner.

4 Likes

You’re right, I apologize and fixed my tone.

I don’t actually think this is good advice for everyone.

Depends. By manual inlining you see all things where you can allocate a temporary outside of the loop, or alltogether lift a computation outside. OPs code has several such opportunities.

1 Like
struct MyStruct
    Kernel1 :: Array{Float64, 1}
    Kernel2 :: Array{Float64, 2}
end

function init(L1 :: Int64, L2 :: Int64) :: MyStruct
    return MyStruct(ones(Float64, L1), ones(Float64, L2, L1))
end

################################################################################

# functions needed for calculation
function Factor(v :: Float64, Input :: MyStruct) :: Float64
    return v * first(Input.Kernel1)
end

function integrate!(LowerBound :: Float64, UpperBound :: Float64, Iterator :: Int64, Input1 :: MyStruct, Input2 :: MyStruct) :: Nothing
    for i in 1 : 9
        v = LowerBound + i * (UpperBound - LowerBound) / 10.0
        F = (UpperBound - LowerBound) / 10.0 * Factor(v, Input2) * v
        for j in 1 : size(Input1.Kernel2, 1)
            Input2.Kernel2[j, Iterator] += F * j
        end
    end
    return nothing
end

function computeKernel1!(T :: Float64, Iterator :: Int64, Input1 :: MyStruct, Input2 :: MyStruct) :: Nothing
    Input2.Kernel1[Iterator] = Input1.Kernel2[Iterator] / T
    return nothing
end

function computeKernel2!(T :: Float64, Iterator :: Int64, Input1 :: MyStruct, Input2 :: MyStruct) :: Nothing

    F  = Factor(T, Input1)
    for i in 1 : size(Input1.Kernel2, 1)
        Input2.Kernel2[i, Iterator] = F * T * i
    end

    integrate!(-T, T, Iterator, Input1, Input2)

    return nothing
end

function updateStep!(dT :: Float64, Input1 :: MyStruct, Input2 :: MyStruct) :: Nothing

    Input1.Kernel1 .+= dT * Input2.Kernel1
    Input1.Kernel2 .+= dT * Input2.Kernel2

    return nothing
end
################################################################################

# putting the MWE to work
function MWE() :: Nothing

    L1 = 10000
    L2 = 10000

    Ti = 10.0
    Tf = 0.1
    b  = 0.8

    Input1 = init(L1, L2)
    Input2 = init(L1, L2)

    while Ti > Tf

        println("Integration step @  $(Ti).")

        Threads.@threads for Iterator in 1 : L1
            computeKernel1!(Ti, Iterator, Input1, Input2)
        end

        Threads.@threads for Iterator in 1 : L1
            computeKernel2!(Ti, Iterator, Input1, Input2)
        end

        updateStep!((b - 1) * Ti, Input1, Input2)
        Ti = b * Ti
    end

    println(Input1.Kernel1)

end

I have restructured the code a bit (although its not Fortran style, hope this doesnt bother you to much @foobar_lv2, but I can think better if the code is split into small pieces). This version is much faster and also shows scaling with threads

36.647 s (3373 allocations: 17.14 GiB) (Threads.nthreads = 1)
12.024 s (328 allocations: 17.14 GiB)  (Threads.nthreads = 4)

It also yields the same result for Input1.Kernel1 (so more threads do not modify the outcome).
Can somebody however explain why there are much less allocations for the threaded version?

Applying this structure also to the full code speeds up the multithreaded version quite a bit, but still it does not outpace the single threaded version for larger systems. Considering the issue that allocations presented here, I assume this is because I need to compute some interpolation weights in every loop iteration which are then used multiple times (this is why I allocate some space for them). Since I don’t know these weights pre-runtime … I currently don’t know how to proceed from here. So I am hoping there is still some issue in the above MWE() that presents a severe performance lack that people can point me to.

Very good! This improves readability a lot. However, you still do not understand your own algorithm, and by the remaining convolutedness of your code you deprive yourself of the opportunity to understand it. To wit: Inline everything. First your original code, with minor changes to printing and without multithreading:

       function MWE() :: Nothing

           L1 = 10000
           L2 = 10000

           Ti = 10.0
           Tf = 0.1
           b  = 0.8

           Input1 = init(L1, L2)
           Input2 = init(L1, L2)

           while Ti > Tf

               #println("Integration step @  $(Ti).")

               for Iterator in 1 : L1
                   computeKernel1!(Ti, Iterator, Input1, Input2)
               end

               for Iterator in 1 : L1
                   computeKernel2!(Ti, Iterator, Input1, Input2)
               end

               updateStep!((b - 1) * Ti, Input1, Input2)
               Ti = b * Ti
           end

           println(Input1.Kernel1[1:10])
       end
julia> @time MWE()
[2.60948e6, 5.21896e6, 7.82844e6, 1.04379e7, 1.30474e7, 1.56569e7, 1.82664e7, 2.08759e7, 2.34853e7, 2.60948e7]
 48.036136 seconds (48.50 k allocations: 17.141 GiB, 2.87% gc time)

Inline everything:

function MWE2() :: Nothing

    L1 = 10000
    L2 = 10000

    Ti = 10.0
    Tf = 0.1
    b  = 0.8

    Input1 = MyStruct(ones(Float64, L1), ones(Float64, L2, L1))
    Input2 = MyStruct(ones(Float64, L1), ones(Float64, L2, L1))

    while Ti > Tf

       # println("Integration step @  $(Ti).")

        for Iterator in 1 : L1
            let T = Ti 
                Input2.Kernel1[Iterator] = Input1.Kernel2[Iterator] / T
            end
        end

        for Iterator in 1 : L1
            let T = Ti 
            #computeKernel2!(Ti, Iterator, Input1, Input2)
                F  = T * first(Input1.Kernel1)
                for i in 1 : size(Input1.Kernel2, 1)
                    Input2.Kernel2[i, Iterator] = F * T * i
                end

                #integrate!(-T, T, Iterator, Input1, Input2)
                let LowerBound = -T, UpperBound = T
                    for i in 1 : 9
                        v = LowerBound + i * (UpperBound - LowerBound) / 10.0
                        F = (UpperBound - LowerBound) / 10.0 * ( v * first(Input2.Kernel1)) * v
                        for j in 1 : size(Input1.Kernel2, 1)
                            Input2.Kernel2[j, Iterator] += F * j
                        end
                    end
                end

            end
        end
        let dT = (b - 1) * Ti
            Input1.Kernel1 .+= dT * Input2.Kernel1
            Input1.Kernel2 .+= dT * Input2.Kernel2
        end
        Ti = b * Ti
    end

    println(Input1.Kernel1[1:10])

end
julia> @time MWE2()
[2.60948e6, 5.21896e6, 7.82844e6, 1.04379e7, 1.30474e7, 1.56569e7, 1.82664e7, 2.08759e7, 2.34853e7, 2.60948e7]
 46.487035 seconds (142.37 k allocations: 17.145 GiB, 2.16% gc time)

Now you can refactor your code: Remove mystruct, fix missing dots, possibly replace loops by one-line broadcasts, hoist the integrate loop, remove unneeded temporaries. See that Kernel2[i, j] = Kernel2[i, k] for every j, k, so you can cut by a factor of 10000. I am assuming that this is a bug in your code? Don’t think about @threads until you understand your algorithm, and you don’t understand your algorithm until you know every loop and every dependency. By inlining, loops and dependencies become obvious.