Multithreading of a simple loop

Hi, I am trying to use multi-threading to parallelise a simple loop. The actual code is quite complex and I have given a simplified example.

The multithreaded version runs slower than the single thread version. I have probably not used the @threads macro correctly (and I am not sure whether I need to introduce any locks as outputs are stored to arrays).

How can I improve performance of the multi-threaded version? Is this the right way of using multi-threading?

using LinearAlgebra, CSV, DataFrames, BenchmarkTools

function generate_data(m)
    Values = rand(20.0:140.0, m)
    return Values
end
function summation(Values)
    A= cumsum(Values, dims =1)
    return A
end
function some_thing(Values)
    B = sum(Values, dims =1)
    return B
end

function run_singlethread(m,n)

V_id =  Array{Float64,2}(undef, m,n)
A_id = Array{Float64,2}(undef, m,n)
B_id = Vector{Float64}(undef,n)
for i in 1:n
    Values =  generate_data(m)
    A = summation(Values)
    B = some_thing(Values)
    V_id[:,i] = Values
    A_id[:,i] = A
    B_id[i]  = B[1]
end
df1 = DataFrame(V_id)
df2 = DataFrame(A_id)
df3 = DataFrame(ID=1:n,some_thing = B_id)
CSV.write("DataFrame1.csv",df1)
CSV.write("DataFrame2.csv",df2)
CSV.write("DataFrame3.csv",df3)
return A_id, V_id, B_id
end

function multithread_run(m,n)

V_id =  Array{Float64,2}(undef, m,n)
A_id = Array{Float64,2}(undef, m,n)
B_id = Vector{Float64}(undef,n)
Threads.@threads for i in 1:n
    Values =  generate_data(m)
    A = summation(Values)
    B = some_thing(Values)
    V_id[:,i] = Values
    A_id[:,i] = A
    B_id[i]  = B[1]
end
df1 = DataFrame(V_id)
df2 = DataFrame(A_id)
df3 = DataFrame(ID=1:n,some_thing = B_id)
CSV.write("DataFrame1mthreads.csv",df1)
CSV.write("DataFrame2mthreads.csv",df2)
CSV.write("DataFrame3mthreads.csv",df3)
return A_id, V_id, B_id
end

Run-time code

@btime run_singlethread(3,10000)      #48.490 ms
@btime multithread_run(3,10000)       #49.847 ms
1 Like

Did you remember to set JULIA_NUM_THREADS before launching Julia? What is Threads.nthreads()?

Don’t allocate arrays in your inner loop if you can help it (pre-allocate arrays before running performance-critical code). (Both rand and cumsum allocate new arrays.)

(If you are doing lots of calculations on 3-component arrays as in your example here, you should strongly consider using StaticArrays.jl instead. e.g. V should be a Vector{SVector{3,Float64}}(undef, 10^4) rather than a 3 \times 10^4 matrix.)

(I would typically also only try to parallelize code that is expensive enough to run for at least several seconds.)

5 Likes

Thanks. I am using Juno, which I believe starts with number of threads equal to number of cores.
Threads.nthreads() is equal to 4. This was just an example so I used rand to generate some data. In the actual code, I am running functions which I need to call in a loop on different sets of data, so generating random data for a MWE seemed a good choice to me. The actual code has a large number of iterations in for loop, so parallelising it makes sense.

Do I need to introduce locks in this example? If so, what would be a good a choice?

You don’t need locks since different loop iterations are writing to disjoint elements of the shared arrays.

The problem is CSV.write functions. They take much longer time than actual data generation and affected by IO. By removing all lines after the loop (starting from df1 = DataFrame(V_id) up to return) I get the following numbers:

@btime run_singlethread(3, 10000)
# 1.847 ms (30006 allocations: 3.59 MiB)
@btime multithread_run(3, 10000)
# 814.549 μs (40051 allocations: 4.05 MiB)
3 Likes

Thanks @stevengj and @Skoffer
I am running in to a strange error with multi-threading when I use it with JuMP in a similar manner. I will post that as a separate question.

Link to the question can be found here