# Poor performance while multithreading (Julia 1.0)

#21

This is kinda single-threaded, isn’t it?

You will get better answers if you boil your problem down to a self-contained MWE that people can run without loading any outside code by just copy-pasting into the REPL.

Many questions also answer themselves during this process (you try to remove one part of the calculation for simplicity, and suddenly it is fast and the problem becomes obvious in retrospect).

#22

This is kinda single-threaded, isn’t it?

I am not sure - if I run this with more than a single thread performance decreases, but I guess thats just the overhead from locking and unlocking then.

I’ll try to come up with a minimal example.

#23

I think this should work

``````# necessary for benchmarking
using BenchmarkTools

# definition of MyStruct
struct MyStruct
Kernel1 :: Vector{Float64}
Kernel2 :: Matrix{Float64}
end

# generate MyStruct
function init(L1 :: Int64, L2 :: Int64) :: MyStruct
return MyStruct(rand(Float64, L1),  rand(Float64, L2, L1))
end

function Factor(v :: Float64, Input :: MyStruct) :: Float64
Val :: Float64 = v * first(Input.Kernel1)
return Val
end

function integrate(Kernel :: Function, LowerBound :: Float64, UpperBound :: Float64, Dim :: Int64) :: Vector{Float64}

Integral :: Vector{Float64} = zeros(Float64, Dim)
Nodes    :: Vector{Float64} = Float64[LowerBound + i * (UpperBound - LowerBound) / 10.0 for i in 1 : 10]

for i in 1 : 9
Integral .+= (Nodes[i + 1] - Nodes[i]) .* Kernel(Nodes[i])
end

return Integral

end

function computeKernel1(T :: Float64, Iterator :: Int64, Input :: MyStruct) :: Float64
Val :: Float64 = Input.Kernel1[Iterator] * Input.Kernel2[Iterator] / T
return Val
end

function computeKernel2(T :: Float64, Iterator :: Int64, Input1 :: MyStruct, Input2 :: MyStruct) :: Vector{Float64}

Buffer :: Vector{Float64} = zeros(Float64, size(Input1.Kernel2, 1))

function IntegrationKernel1(v :: Float64) :: Vector{Float64}

for i in 1 : size(Input1.Kernel2, 1)
Buffer[i] = v * i
end

return Factor(v, Input1) .* Buffer

end
function IntegrationKernel2(v :: Float64) :: Vector{Float64}

for i in 1 : size(Input1.Kernel2, 1)
Buffer[i] = v * i
end

return Factor(v, Input2) * Buffer

end

Result :: Vector{Float64}  = IntegrationKernel1(T)
Result .+= integrate(IntegrationKernel2, -T, T, size(Input1.Kernel2, 1))

return Result

end

function computeStep(T :: Float64, Input1 :: MyStruct, Input2 :: MyStruct) :: MyStruct

Input2.Kernel1[Iterator] = computeKernel1(T, Iterator, Input1)
end

Input2.Kernel2[:, Iterator] = computeKernel2(T, Iterator, Input1, Input2)
end

return Input2

end

function updateStep(dT :: Float64, Input1 :: MyStruct, Input2 :: MyStruct) :: MyStruct

Input1.Kernel1 .+= dT * Input2.Kernel1
Input1.Kernel2 .+= dT * Input2.Kernel2

return Input1

end

# running the MWE
function MWE() :: Nothing

L1 :: Int64 = 100
L2 :: Int64 = 100

Ti :: Float64 = 10.0
Tf :: Float64 = 0.01
b  :: Float64 = 0.98

Input1 :: MyStruct = init(L1, L2)
Input2 :: MyStruct = init(L1, L2)

while Ti > Tf
Input2 = computeStep(Ti, Input1, Input2)
Input1 = updateStep((b - 1) * Ti, Input1, Input2)
Ti     = b * Ti
end

GC.gc()

end

``````

This is a minimal example of what my code basically does. The mathematical operations involved were simplified, but the structure as a whole is identical.

Using `@btime` I get

``````143.877 ms (548918 allocations: 413.81 MiB)
``````

``````309.731 ms (442965 allocations: 342.10 MiB)
``````

Still with `@code_warntype` I cannot find type instabilities.

#24

Your issue is not multithreading, it is that your code seems somewhat convoluted (no offense intended). You are very new to julia, right?

Rewrite your MWE into straight-loop code (no multithreading) in a single function. You have no recursion, so this is possible (just inline everything by hand). Then we all will have an easier time spotting optimization possibilities. In principle, your main loop in `MWE()` should not cause a single allocation.

The cool fancy features of julia that permit high programmer productivity, very fast code and genericity are only good once you understand your code well. This means writing it once in fortran-style, in your case in a single function (inline everything!) without custom types.

Type annotations are mostly unneeded in julia. Sometimes you need them, because inference fails to figure out types; but most of the time it is better to understand why inference fails and fixing that, instead of papering over the issue by just asserting types. By placing type-asserts you lose the very useful hints that type-instabilities will point you at problematic code sections.

#25

Yes I come mainly from a C/Python background, and started working with Julia roughly a month ago, so I really don’t take this as an offense. However I was told that wrapping as much as possible in functions in julia would be a good idea…Nevertheless ill restructure the code and post again after that.

#26

I don’t actually think this is good advice for everyone. Types and functions are the tools we have to organize thoughts into actual code, and throwing that away doesn’t really make any sense to me. In particular, the fact that @dkiese has separated out the code into small function should make it easier to figure out what is slow and where the problem lies.

This seems, frankly, inappropriate and unwelcoming to a beginner.

#27

You’re right, I apologize and fixed my tone.

I don’t actually think this is good advice for everyone.

Depends. By manual inlining you see all things where you can allocate a temporary outside of the loop, or alltogether lift a computation outside. OPs code has several such opportunities.

#28
``````struct MyStruct
Kernel1 :: Array{Float64, 1}
Kernel2 :: Array{Float64, 2}
end

function init(L1 :: Int64, L2 :: Int64) :: MyStruct
return MyStruct(ones(Float64, L1), ones(Float64, L2, L1))
end

################################################################################

# functions needed for calculation
function Factor(v :: Float64, Input :: MyStruct) :: Float64
return v * first(Input.Kernel1)
end

function integrate!(LowerBound :: Float64, UpperBound :: Float64, Iterator :: Int64, Input1 :: MyStruct, Input2 :: MyStruct) :: Nothing
for i in 1 : 9
v = LowerBound + i * (UpperBound - LowerBound) / 10.0
F = (UpperBound - LowerBound) / 10.0 * Factor(v, Input2) * v
for j in 1 : size(Input1.Kernel2, 1)
Input2.Kernel2[j, Iterator] += F * j
end
end
return nothing
end

function computeKernel1!(T :: Float64, Iterator :: Int64, Input1 :: MyStruct, Input2 :: MyStruct) :: Nothing
Input2.Kernel1[Iterator] = Input1.Kernel2[Iterator] / T
return nothing
end

function computeKernel2!(T :: Float64, Iterator :: Int64, Input1 :: MyStruct, Input2 :: MyStruct) :: Nothing

F  = Factor(T, Input1)
for i in 1 : size(Input1.Kernel2, 1)
Input2.Kernel2[i, Iterator] = F * T * i
end

integrate!(-T, T, Iterator, Input1, Input2)

return nothing
end

function updateStep!(dT :: Float64, Input1 :: MyStruct, Input2 :: MyStruct) :: Nothing

Input1.Kernel1 .+= dT * Input2.Kernel1
Input1.Kernel2 .+= dT * Input2.Kernel2

return nothing
end
################################################################################

# putting the MWE to work
function MWE() :: Nothing

L1 = 10000
L2 = 10000

Ti = 10.0
Tf = 0.1
b  = 0.8

Input1 = init(L1, L2)
Input2 = init(L1, L2)

while Ti > Tf

println("Integration step @  \$(Ti).")

computeKernel1!(Ti, Iterator, Input1, Input2)
end

computeKernel2!(Ti, Iterator, Input1, Input2)
end

updateStep!((b - 1) * Ti, Input1, Input2)
Ti = b * Ti
end

println(Input1.Kernel1)

end
``````

I have restructured the code a bit (although its not Fortran style, hope this doesnt bother you to much @foobar_lv2, but I can think better if the code is split into small pieces). This version is much faster and also shows scaling with threads

``````36.647 s (3373 allocations: 17.14 GiB) (Threads.nthreads = 1)
``````

It also yields the same result for `Input1.Kernel1` (so more threads do not modify the outcome).
Can somebody however explain why there are much less allocations for the threaded version?

Applying this structure also to the full code speeds up the multithreaded version quite a bit, but still it does not outpace the single threaded version for larger systems. Considering the issue that allocations presented here, I assume this is because I need to compute some interpolation weights in every loop iteration which are then used multiple times (this is why I allocate some space for them). Since I don’t know these weights pre-runtime … I currently don’t know how to proceed from here. So I am hoping there is still some issue in the above MWE() that presents a severe performance lack that people can point me to.

#29

Very good! This improves readability a lot. However, you still do not understand your own algorithm, and by the remaining convolutedness of your code you deprive yourself of the opportunity to understand it. To wit: Inline everything. First your original code, with minor changes to printing and without multithreading:

``````       function MWE() :: Nothing

L1 = 10000
L2 = 10000

Ti = 10.0
Tf = 0.1
b  = 0.8

Input1 = init(L1, L2)
Input2 = init(L1, L2)

while Ti > Tf

#println("Integration step @  \$(Ti).")

for Iterator in 1 : L1
computeKernel1!(Ti, Iterator, Input1, Input2)
end

for Iterator in 1 : L1
computeKernel2!(Ti, Iterator, Input1, Input2)
end

updateStep!((b - 1) * Ti, Input1, Input2)
Ti = b * Ti
end

println(Input1.Kernel1[1:10])
end
julia> @time MWE()
[2.60948e6, 5.21896e6, 7.82844e6, 1.04379e7, 1.30474e7, 1.56569e7, 1.82664e7, 2.08759e7, 2.34853e7, 2.60948e7]
48.036136 seconds (48.50 k allocations: 17.141 GiB, 2.87% gc time)
``````

Inline everything:

``````function MWE2() :: Nothing

L1 = 10000
L2 = 10000

Ti = 10.0
Tf = 0.1
b  = 0.8

Input1 = MyStruct(ones(Float64, L1), ones(Float64, L2, L1))
Input2 = MyStruct(ones(Float64, L1), ones(Float64, L2, L1))

while Ti > Tf

# println("Integration step @  \$(Ti).")

for Iterator in 1 : L1
let T = Ti
Input2.Kernel1[Iterator] = Input1.Kernel2[Iterator] / T
end
end

for Iterator in 1 : L1
let T = Ti
#computeKernel2!(Ti, Iterator, Input1, Input2)
F  = T * first(Input1.Kernel1)
for i in 1 : size(Input1.Kernel2, 1)
Input2.Kernel2[i, Iterator] = F * T * i
end

#integrate!(-T, T, Iterator, Input1, Input2)
let LowerBound = -T, UpperBound = T
for i in 1 : 9
v = LowerBound + i * (UpperBound - LowerBound) / 10.0
F = (UpperBound - LowerBound) / 10.0 * ( v * first(Input2.Kernel1)) * v
for j in 1 : size(Input1.Kernel2, 1)
Input2.Kernel2[j, Iterator] += F * j
end
end
end

end
end
let dT = (b - 1) * Ti
Input1.Kernel1 .+= dT * Input2.Kernel1
Input1.Kernel2 .+= dT * Input2.Kernel2
end
Ti = b * Ti
end

println(Input1.Kernel1[1:10])

end
julia> @time MWE2()
[2.60948e6, 5.21896e6, 7.82844e6, 1.04379e7, 1.30474e7, 1.56569e7, 1.82664e7, 2.08759e7, 2.34853e7, 2.60948e7]
46.487035 seconds (142.37 k allocations: 17.145 GiB, 2.16% gc time)
``````

Now you can refactor your code: Remove mystruct, fix missing dots, possibly replace loops by one-line broadcasts, hoist the integrate loop, remove unneeded temporaries. See that `Kernel2[i, j] = Kernel2[i, k]` for every `j, k`, so you can cut by a factor of 10000. I am assuming that this is a bug in your code? Don’t think about `@threads` until you understand your algorithm, and you don’t understand your algorithm until you know every loop and every dependency. By inlining, loops and dependencies become obvious.