Pmap: does it copy or share arguments across processes?

parallel

#1
for (name, security) in dataset.securities
    push!(arg_array, (name, data[name]))
end
tuples = pmap(extract_variables, arg_array)

In the above code, data[name]) is a Dict{Int64, Float32} that contains a lot of data.

This document https://docs.julialang.org/en/stable/manual/parallel-computing is a very hard read, and it makes me think that arg_array will always be copied to a new process that is running extract_variables unless I resort to something such as Shared Arrays.

If that is true, how can I make data[name] be shared across all processes in pmap so that it will not be copied and cost some efficiency.


#2

Yes, you read it correctly, data is not shared for the pmap or @parallel for loops. Also you did find one of the answers: SharedArrays. The other is DistributedArrays.jl. Last, you can try and use Threads.@treads, which shares data.


#3

Thank you. No wonder why in all my benchmarking, pmap is not faster than map.


#4

And unfortunately neither SharedArrays nor DistributedArrays.jl will work with Dict.


#5

DistributedArrays can now hold any type of data with Julia>v0.6.


#6

After reading the documentation, it appears that Threads.@threads is what I am looking for as it does not spawn a new process and can share memory.

len = length(arg_array)
secs = Array{Data.Security, 1}(len)
Threads.@threads for i = 1:len
    secs[i] = read_csv_and_init(arg_array[i])
end

I am getting Bus error: 10 from the code above.

It is interesting because without Threads.@threads this code runs fine.

read_csv_and_init does not cause any side effect either as it does not modify any of its input arguments nor access global variables.

I am on Mac OS High Sierra.


#7

Reading files in parallel is probably not the best. Usually that is limited by disc access not CPU. I don’t know whether that could cause the error too.


#8

This code gets Bus error: 10

function test(filename)
    readdlm(string("data/yahoo/", filename), ',')
end

Threads.@threads for filename = readdir("data/yahoo/")
    test(filename)
end

While this code works fine.

function test(filename)
    readdlm(string("data/yahoo/", "2S.csv"), ',')
end

Threads.@threads for filename in readdir("data/yahoo/")
    test(filename)
end

It appears that the error is caused when trying to access filename from readdlm where as the string literal “2S.csv” works fine.


#9

I don’t believe IO is thread-safe yet (ever?). Printing in threads will do the same.


#10

For printing Core.println works.