Thanks for your advice! First I had a look at ThreadPooling.jl to restrict the number of available Threads for the loop. But your solution is more useful to me. Especially since it is easily to adopt for usage with Distributed.jl. Havenât done that yet, but I will definitely have a go on it and compare the result to Thread parallelisation. For me that is a good way to quantify general statements, like parallelisation type x works well if the problem size is big enough and so onâŚ
Skeptical person that I am, I implemented this splitter method and ran it with different number of threads. It gave consistent results: Eg with 1 thread chunks do not improve the timing, with 2 threads 2 chunks gave the most speed up and so on). I could also see in htop that only as many threads are active as used chunks.
I implemented a bigger computation to have a look how Thread parallelisation improves with the problem size. I expected to see the biggest speed up with the biggest problem, but this was not the case. Any idea why this is the case?
Here the code to reproduce it (I ran it with 16 threads, Note: execution takes like 1 min or so)
using BenchmarkTools
using Plots
using DataFrames
using Colors, ColorSchemes
nrofproblems= 1000
nrofparticleslist =[10,500,5000]
function splitter(n,nchunks,ichunk)
n_per_chunk = div(n,nchunks) # only works for multiples
first = (ichunk-1)*n_per_chunk+1
last = ichunk*n_per_chunk
return first:last
end
function generalsplitter(n::Int64,nchunks::Int64,ichunk::Int64)
return (floor(Int, (ichunk-1)*(n/nchunks))+1:floor(Int, (ichunk)*n/nchunks))
end
# my function that solves the entire nrofproblems
function calcminimumpotential(x; nchunks=Threads.nthreads())
resultperchunk = zeros(Float64,nchunks)
Threads.@threads for ichunk in 1:nchunks
#for i in splitter(length(x), nchunks, ichunk)
for i in generalsplitter(length(x), nchunks, ichunk)
@inbounds resultperchunk[ichunk] = calcpotential(x[i])
end
end
return minimum(resultperchunk)
end
# function that solves one problem -> to be distributed
function calcpotential(ax::Array{Float64})
elements = length(ax)
potential = 0
for i in 1:elements
for j in 1:(i-1)
potential -= 1/abs(ax[i]-ax[j])
end
end
return potential
end
# DataFrame to store results
df = DataFrame()
# loop through parameter points
for nrofparticles in nrofparticleslist
# create list of problems
global x = [ rand(Float64, nrofparticles) for i in 1:nrofproblems]
# bm the solution of a single problem
bmres = @benchmark calcpotential(x[1])
bmsingleproblem = minimum(bmres.times)
# create list of chunks to be analysed -> the problemlist will be chunkged up in this amount of subproblemlists
thenchunks = [1,2,4,6,10,16,24]
# prepare a list in which the result of each subproblemslist is stored (in this case we only care about the minimum of the value of each subproblem)
result = zero(thenchunks)
for (i, nchunks) in enumerate(thenchunks)
print("/ chunks= ",nchunks,"...")
bmres = @benchmark calcminimumpotential($x; nchunks=$nchunks)
result[i] = minimum(bmres.times)
#result[i]= @btime sumstuff($x; nchunks=$nchunks)
end
# create tmpdf with results and append to df
tmpdf = DataFrame( nthreads = [Threads.nthreads() for i in thenchunks],
nrofparticles = [nrofparticles for i in thenchunks],
nrofproblems = [nrofproblems for i in thenchunks],
bmsingleproblem = [bmsingleproblem for i in thenchunks],
nchunks = thenchunks,
speedup = result[1]./result)
append!(df, tmpdf)
end
uniquenrofparticles = unique(df,2).nrofparticles # vector of unique values for threads
mycolor = ColorScheme(range(colorant"red", colorant"black", length=length(uniquenrofparticles)))
p=plot(xlabel="nchuncks", ylabel="speedup", title="speed up in N-particle interactions (nrth=" * string(Threads.nthreads()) * ")", xlim=[1,maximum(df.nchunks)], ylim=[1,1.5*maximum(df.speedup)],legend=:topleft, )
for (i, nrofparticle) in enumerate(uniquenrofparticles)
subdf = filter(:nrofparticles => ==(nrofparticle), df)
if i==1 plot!(p,subdf.nchunks, subdf.nchunks, label="perfect scaling", line=(:dash,3), color="black") end
plot!(p,subdf.nchunks, subdf.speedup, label="t(single problem)/sec=" * string( subdf.bmsingleproblem[i] / 10^6), linewidth=3, color=mycolor[i])
end
display(p)
Something more that was surprising: When I execute the code above several times I get slightly different results: The speed up factor can change from eg. 5 to 6 for the same parameter point.
But I guess that could be taken care of by tuning BenchmarkTools to the size of the problem.
Anyways, the thing that puzzles me the most is why the âbiggest problemâ did not speed up the most with parallelisation!