I am trying to efficiently parallelize a code that relies on a big data structure (let’s say matrices of size 10^5 x 10^3 ). A small example is provided below. I would like to avoid creating/copying the same data structures on each core, instead, it would be convenient to use the same structure in parallel.
The iterative code is the following:

# create remote data structure
struct complex_mvts
raw_data::Matrix{Float64}
num_ROI::Int
T::Int
end
# create remote function
function create_data_structure(data::Matrix{Float64})
N,T=size(data)
return(complex_mvts(data,N,T))
end
# define launch function
function launch_one_t(TS::complex_mvts,t::Int64)
a=sum(TS.raw_data[:,t])
println(a)
end
# main function
function main()
# create local data structure
data = rand(100000,1000)
TS=create_data_structure(data)
# launch function
for t in 1:1000
launch_one_t(TS,t)
end
end
# call main function
main()

I have tried to use Distributed but it doesn’t seem to provide any kind of speedup (actually is far slower). Do you have any suggestions?

# import Distributed library
using Distributed
# add worker processes
addprocs(4)
# create remote data structure
@everywhere struct complex_mvts
raw_data::Matrix{Float64}
num_ROI::Int
T::Int
end
# create remote function
@everywhere function create_data_structure(data::Matrix{Float64})
N,T=size(data)
return(complex_mvts(data,N,T))
end
# define launch function
@everywhere function launch_one_t(TS::complex_mvts,t::Int64)
s=sum(TS.raw_data[:,t])
print(s, " ")
end
# main function
function main()
# create local data structure
data = rand(100000,1000)
TS=create_data_structure(data)
# launch function in parallel
@sync @distributed for t in 1:100
launch_one_t(TS,t)
end
end
# call main function
main()

The main difference between flavors of parallel computing is that multiprocessing (with Distributed) relies on distributed memory, whereas multithreading (with Base.Threads) relies on shared memory: see the documentation page. So in many cases, the data transfer alone will make Distributed much slower. Have you tried plain multithreading?

Thanks a lot for the info! I’ve had a look at the documentation and even when using Threads, it seems that there is no significant speed-up even using plain multithreading. I tried wrapping the for loop (with Base.Threads) as follows:

@threads for t in 1:1000
launch_one_t(t)
end

I come from Python and in the past, I used global variables there for the same problem, but I know that in Julia global variables should be avoided when possible.

What’s your Threads.nthreads()? Also your slice TS.raw_data[:,t] is allocating a temporary copy, that could be slowing things down. Pretty sure the @view(TS.raw_data[:,t]) is contiguous so it shouldn’t slow the sum down, should check though.

I was using Jupiter for launching the code, so I had 1 thread available. I did run some tests using 12 cores (both Jupiter and shell using the command julia --threads 12 ) and there is a slight speed-up. It also helped the @view(TS.raw_data[:,t]). Here is the code that I am running for reference:

using Base.Threads
using BenchmarkTools
# create remote data structure
struct complex_mvts
raw_data::Matrix{Float64}
num_ROI::Int
T::Int
end
# create remote function
function create_data_structure(data::Matrix{Float64})
N,T=size(data)
return(complex_mvts(data,N,T))
end
# define launch function
function launch_one_t(TS::complex_mvts,t::Int64)
s=sum(@view(TS.raw_data[:,t]))
# print(s, " ")
end
# Iterative function
function iterative_func()
# create local data structure
data = rand(10000,2500)
TS=create_data_structure(data)
# launch function in parallel
for t in 1:2500
launch_one_t(TS,t)
end
end
# Parallel function
function func_parallel()
# create local data structure
data = rand(10000,2500)
TS=create_data_structure(data)
# launch function in parallel
@threads for t in 1:2500
launch_one_t(TS,t)
end
end

With the iterative, using @benchmark iterative_func() I get this:

A problem with the benchmark is that it’s including the construction of the input data, so the same allocation and GC overhead is being added to the timing. Ideally you isolate the different parts (the for loop) into their own functions and benchmark those.

It’s also limiting the timing improvements. rand(10000, 2500) takes ~7x longer than 2500 (serial) sums of 10000 values (on v1.8.5), in other words the serial loop was ~1/8 of the runtime. By Amdahl’s law, a full 12x (actually less due to multithreading overhead) speedup of the loop would result in 1/((1-1/8)+1/8/12)=1.1294x speedup overall, approaching 1/(1-1/8)=1.1429x at infinite cores.

Thanks a lot for the explanation. I benchmarked the functions when isolating the different parts of the loop. It is now much more evident that there is a speed-up. Here are the results:

@btime iterative_func(data)
9.473 ms (0 allocations: 0 bytes)
@btime func_parallel(data)
6.161 ms (73 allocations: 6.52 KiB)