Julia pmap how to write each worker into separate index using parallel computing

I have a function that runs a simulation exercise. I call the function “raposo” and it has an output that is a vector (1,9). I need to compute it using parallel computing because it is taking 3 months to produce the output.
Without parallel computing I get the following:

When I use:

using addprocs(8)
praposo(N) = pmap(S1->raposo(S1), [N/nworkers() for i=1:nworkers()]);
praposo(50)

I get an error saying that one of the workers is trying to write into the wrong index.

I thought that I could solve the problem by saying that addprocs(8) should be an array instead of a vector. But I do not know how to do that.

The function raposo takes a long time due to a loop to run n simulations. In matlab it would be just to use “parfor” in the loop.

Welcome! What’s the exact error message? Your first line of code there (using addprocs(8)) isn’t valid Julia.

It’s unclear exactly what you want, but this trivial example might start a more helpful discussion:

using Distributed
addprocs(8)

@everywhere begin
    function raposo(S)
        println("Running raposo($S)")
        sleep(3)
        return S
    end
end

praposo(N) = pmap(S1 -> raposo(S1), 1:N)
#praposo(N) = pmap(raposo, 1:N)  # if you have only one argument, you don't need closure

praposo(50)
1 Like

Dear Matt,

Thank you so much for your quick reply.

That was my typo. I do not write “using addprocs(8)”. I usually write directly in the prompt addprocs(8) in order not to increase the number of workers too much. Actually, by the way, I also can’t control the actual number of workers. If I want to reduce the number of workers I exit Julia and get in again. That says it all about my knowledge of Julia, right?

The error that I get is the following:

The output when I do not try to do parallel computing I get is the following (03-build_it.jl I attach it to show that the output is already a vector which I think makes it complicated for me to use pmap but the other alternatives seem even more complicated):

coverage = [0.96, 0.98, 0.98, 0.98, 1.0, 1.0, 0.99, 0.99, 0.99]

width = [23.04731271400833, 6.461724751556204, 4.06406475792698, 3.041395719635732, 2.661069159405758, 2.5227358264222484, 2.4306187249891464, 2.405408363074599, 2.4047765311097633]

9-element Vector{Float64}:

23.04731271400833

6.461724751556204

4.06406475792698

3.041395719635732

2.661069159405758

2.5227358264222484

2.4306187249891464

2.405408363074599

2.4047765311097633

The idea is to do a simulation exercise and to do this I have three functions (one of them is raposo because it was my way to construct a function that would include the simulations and therefore I could do parallel computing.)

I attach here the dataset and my jl program with my attempt to do parallel computing (03-build_itae.jl). As you can see I have many other alternatives in the code. But for now I wait for your feedback.

dataset:

my program with mistake

This is the mistake I get:

julia> praposo(100)

From worker 6: s = 1.0

From worker 9: s = 1.0

From worker 5: s = 1.0

From worker 8: s = 1.0

From worker 7: s = 1.0

From worker 3: s = 1.0

From worker 4: s = 1.0

From worker 2: s = 1.0

ERROR: On worker 6:

ArgumentError: invalid index: 1.0 of type Float64

Stacktrace:

[1] to_index

@ ./indices.jl:300

[2] to_index

@ ./indices.jl:277

[3] to_indices

Pedro

Dear Greg,

Thank you so much for your quick reply. Your solution works for me but my mistake is due to having more than one output. Please check below.

The error that I get is the following:

julia> praposo(100)

From worker 6: s = 1.0

From worker 9: s = 1.0

From worker 5: s = 1.0

From worker 8: s = 1.0

From worker 7: s = 1.0

From worker 3: s = 1.0

From worker 4: s = 1.0

From worker 2: s = 1.0

ERROR: On worker 6:

ArgumentError: invalid index: 1.0 of type Float64

Stacktrace:

[1] to_index

@ ./indices.jl:300

[2] to_index

@ ./indices.jl:277

[3] to_indices

The output when I do not try to do parallel computing I get is the following (03-build_it.jl I attach it to show that the output is already a vector which I think makes it complicated for me to use pmap but the other alternatives seem even more complicated):

coverage = [0.96, 0.98, 0.98, 0.98, 1.0, 1.0, 0.99, 0.99, 0.99]

width = [23.04731271400833, 6.461724751556204, 4.06406475792698, 3.041395719635732, 2.661069159405758, 2.5227358264222484, 2.4306187249891464, 2.405408363074599, 2.4047765311097633]

9-element Vector{Float64}:

23.04731271400833

6.461724751556204

4.06406475792698

3.041395719635732

2.661069159405758

2.5227358264222484

2.4306187249891464

2.405408363074599

2.4047765311097633

The idea is to do a simulation exercise and to do this I have three functions (one of them is raposo because it was my way to construct a function that would include the simulations and therefore I could do parallel computing.)

I attach here the dataset and my jl program with my attempt to do parallel computing (03-build_itae.jl). As you can see I have many other alternatives in the code. But for now I wait for your feedback.

dataset:

my program with mistake

Pedro

Dear Greg,

When I use your solution basically writing:

praposo(N) = pmap(S1 → raposo(S1), 1:N)
#praposo(N) = pmap(raposo, 1:N) # if you have only one argument, you don’t need closure

praposo(50)

I get this mistake (each worker should go to each loop? I was expecting each worker to go to each of the # number of simulations. My function is basically a loop (easier to check the file if possible of course)):

julia> include(“/Volumes/Promise Pegasus/CLSBE Dropbox/pedro raposo/CLSBEPessoal/Chris/ScarringSatisfaction/Muris/code_new/03-build_itae.jl”)
N = 100
From worker 4: N = 100
From worker 7: N = 100
From worker 8: N = 100
From worker 5: N = 100
From worker 6: N = 100
From worker 3: N = 100
From worker 9: N = 100
From worker 2: N = 100
From worker 11: N = 100
From worker 17: N = 100
From worker 15: N = 100
From worker 14: N = 100
From worker 16: N = 100
From worker 13: N = 100
From worker 10: N = 100
From worker 12: N = 100
0.000058 seconds (38 allocations: 2.640 KiB)
From worker 7: s = 1
From worker 9: s = 1
From worker 3: s = 1
From worker 5: s = 1
From worker 6: s = 1
From worker 8: s = 1
From worker 2: s = 1
From worker 4: s = 1
From worker 13: s = 1
From worker 10: s = 1
From worker 17: s = 1
From worker 16: s = 1
From worker 11: s = 1
From worker 14: s = 1
From worker 12: s = 1
From worker 15: s = 1
From worker 7: s = 2
From worker 4: s = 2
From worker 14: s = 2
From worker 12: s = 2
From worker 13: s = 2
From worker 17: s = 2
From worker 5: s = 2
From worker 9: s = 2
From worker 15: s = 2
From worker 8: s = 2
From worker 3: s = 2
From worker 10: s = 2
From worker 13: s = 3
From worker 14: s = 3
From worker 15: s = 3
From worker 4: s = 3
From worker 2: s = 2
From worker 9: s = 3
From worker 16: s = 2
From worker 3: s = 3
From worker 7: s = 3
From worker 17: s = 3
From worker 6: s = 1
From worker 3: s = 4
From worker 10: s = 3
From worker 8: s = 3
From worker 11: s = 2
From worker 2: s = 1
From worker 9: s = 1
From worker 7: s = 4
From worker 17: s = 4
From worker 6: s = 2
From worker 14: s = 4
From worker 12: s = 3
From worker 10: s = 4
From worker 8: s = 4
From worker 7: s = 5
From worker 16: s = 3
From worker 6: s = 3
From worker 2: s = 2
From worker 8: s = 1
From worker 17: s = 5
From worker 15: s = 4
From worker 9: s = 2
From worker 6: s = 4
From worker 14: s = 5
From worker 5: s = 3
From worker 10: s = 5
From worker 7: s = 1
From worker 8: s = 2
From worker 2: s = 3
From worker 6: s = 5
From worker 16: s = 4
From worker 11: s = 3
From worker 9: s = 3
From worker 4: s = 4
From worker 3: s = 5
From worker 5: s = 4
From worker 10: s = 6
From worker 2: s = 4
From worker 6: s = 6
From worker 12: s = 4
From worker 7: s = 2
From worker 4: s = 5
From worker 13: s = 4
From worker 5: s = 5
From worker 15: s = 5
From worker 16: s = 5
From worker 9: s = 4
From worker 11: s = 4
From worker 12: s = 5
From worker 7: s = 3
From worker 10: s = 7
From worker 16: s = 6
From worker 6: s = 7
From worker 2: s = 5
From worker 17: s = 6
From worker 15: s = 6
From worker 14: s = 6
From worker 13: s = 5
From worker 11: s = 5
From worker 9: s = 5
From worker 12: s = 6
From worker 6: s = 8
From worker 2: s = 6
From worker 5: s = 6
From worker 7: s = 4
From worker 16: s = 7
From worker 4: s = 6
From worker 12: s = 7
From worker 11: s = 6
From worker 14: s = 7
From worker 10: s = 1
From worker 17: s = 7
From worker 7: s = 5
From worker 9: s = 6
From worker 3: s = 6
From worker 5: s = 1
From worker 15: s = 7
From worker 11: s = 7
From worker 7: s = 6
From worker 13: s = 6
From worker 4: s = 7
From worker 16: s = 8
From worker 10: s = 2
From worker 14: s = 8
From worker 11: s = 8
From worker 8: s = 3
From worker 10: s = 3
From worker 2: s = 7
From worker 15: s = 8
From worker 17: s = 8
From worker 6: s = 9
From worker 3: s = 7
From worker 7: s = 7
From worker 5: s = 2
From worker 9: s = 7
From worker 15: s = 9
From worker 8: s = 4
From worker 16: s = 9
From worker 10: s = 4
From worker 6: s = 10
From worker 12: s = 8
From worker 14: s = 9
From worker 7: s = 8
From worker 9: s = 8
From worker 17: s = 9
From worker 2: s = 8
From worker 5: s = 3
From worker 3: s = 8
From worker 14: s = 1
From worker 4: s = 8
From worker 6: s = 11
From worker 13: s = 7
From worker 10: s = 5
From worker 7: s = 9
From worker 15: s = 10
From worker 14: s = 2
From worker 17: s = 10
ERROR: LoadError: On worker 6:
BoundsError: attempt to access 10×9 Matrix{Float64} at index [11, 1]

Hi Pedro — please check out Please read: make it easier to help you for some guidance on how to format and present your posts in a more readable manner. Importantly, it can be very useful to use “code fences” (blocks of code surrounded by triple-backticks ```) to make your code more readable.

Your initial error —

ERROR: On worker 6:

ArgumentError: invalid index: 1.0 of type Float64
Stacktrace:
[1] to_index
@ ./indices.jl:300
[2] to_index
@ ./indices.jl:277
[3] to_indices

Is not that a worker is trying to write into a “wrong” index, but rather, that you’re doing some floating point computation for an index and haven’t converted it to an integer. Note that N/nworkers() will return a floating point number — I’m betting you want N ÷ nworkers() instead.

Thank you so much. I am totally new to these forums as a user and I am very very excited with the help I am getting but I agree that I need to be more clear in terms of my posting.

When I follow Greg’s suggestion

...
@time praposo(N) = pmap(S1 -> raposo(S1), 1:N)
praposo(50)
...

I get this mistake:

BoundsError: attempt to access 10×9 Matrix{Float64} at index [11, 1]

Every worker goes to the same loop stage (loop 1:50) and I think that this might be the problem. Or am I wrong?

**ERROR:** LoadError: On worker 6:
BoundsError: attempt to access 10×9 Matrix{Float64} at index [11, 1]

#2 Could I reduce the number of workers without exiting julia? I would expect a command to define the total number of workers instead of just the number of workers we want to add.

@Pedro_Raposo it is worth reading:

Having looked at your code in a bit more detail I have to say I still don’t get it, it is structured in a pretty confusing way. It does seem though like you are leaving a lot of performance on the table by not following the performance tips - things like accessing global variables from inside your functions, redefining functions in your loops (maybe? again I don’t fully follow the structure), excessive slicing of matrices without using views etc.

I think your time might be better spent profiling/benchmarking and making sure your single threaded, nondistributed code is actually efficient rather than working out how to throw more compute at it.

1 Like

Indeed. You could in some cases get 10-1000x speedup by cleaning up your single-threaded code. And probably also get better parallel scaling on top of that. If your code is highly inefficient, it is possible that you won’t get much benefit from parallelism at all.

1 Like

Nils

I am going to read your suggested page/documents . When I was playing with the global variables it did not seem to help in terms of timing but the document you shared is very detailed and it seems helpful. I hope this helps… I will keep you posted.

After saying this. When I was timing the code , the main problem seemed to be with the loop for each of the 500 simulations I want to do. In MATLAB this would be solved with parfor instead of for but I can not make it work. I tried to ask help with parallel computing because I could not make it work and that was frustrating and if you can help this would be great. With the suggestions above I am still not able to get the final output.

Thanks Julio. I will read this and Nils documents below and I will give you feedback.

Just running

using BenchmarkTools
@btime θ̂_mrv($Y₀, $Y₁, $Y₂, $Y₃, $D₀, $D₁, $D₂, $D₃, $X₁, $X₂, $X₃, $Xd₂, $Xd₃, $Xc₂, $Xc₃);

I get

330.067 ms (1418404 allocations: 940.62 MiB)

I haven’t profiled this and you’ve got an optimization in there with autodiff so you won’t have some sort of allocation-free version of this I assume but 1.4 million allocations tell you there’s likely a lot of room for improvement.

Thank you so much for running this and giving this feedback. But this might be too advanced for my programming knowledge. What is your hint to minimize this 1.4? Do you think I could use sparse matrices? Do you have any specific hint?

Nils via Julia Programming Language <notifications@julialang.discoursemail.com> escreveu em qua., 30/11/2022 às 17:30 :

It’s exactly what I said before - make sure you don’t access untyped globals, don’t slice arrays without views, and generally profile your function to work out where time is spent.

Writing fast code in Julia is no Black magic, but does require a bit of upfront investment into understanding the basics of the language. The performance tips I liked above are a great starting point, and people on here are generally super helpful when it comes to fixing specific performance issues you might get stuck with.