Slow parallel for loop


Hi everybody,

I have a problem when using parallelization in a for loop. The computation speed is extremely slow, even slower than without parallel. The for loop code is as follows:

  N = 20
  d_u = 0.5*rand(N)
  d_v = 0.5*rand(N)
  cost_update = SharedArray(Float64, N)

@parallel for k in eachindex(d_u)
    a = p_1     # dynamic lower bound
    b = p_2     # dynamic upper bound
    x_1 = a + (1-gr)*(b-a)
    x_2 = a + gr*(b-a)

    # run golden section search
    cost = zeros(2,1)
    while norm(b-a) > tol

      # compute cost for upper and lower bounds
      lambda_12 = [norm(x_1-p_1) / l_0;
                   norm(x_2-p_1) / l_0]
      theta_gnd = [atan2(x_1[2]-p_i[2], x_1[1]-p_i[1]);   # backward propagation
                   atan2(x_2[2]-p_i[2], x_2[1]-p_i[1])]
      x_gnd = [x_1[1]; x_2[1]] - vertex_min[1] + 1
      y_gnd = [x_1[2]; x_2[2]] - vertex_min[2] + 1
      cost  = [norm(x_1-p_i); norm(x_2-p_i)] .*
              cost_profile_wind(x_wf,  y_wf,  u_wf,  v_wf, d_u[k], d_v[k],
                                x_gnd, y_gnd, V_gnd, theta_gnd) +
              lambda_12.*u_2 + (1-lambda_12).*u_1

     # update upper or lower bounds as necessary
     if cost[1] < cost[2]
       b = x_2
       x_2 = x_1
       x_1 = a + (1-gr)*(b-a)
       a = x_1
       x_1 = x_2
       x_2 = a + gr*(b-a)


    # update cost to return
    cost_update[k] = mean(cost)


The variables x_wf, y_wf, u_wf, v_wf defines a vector field of wind. Inside the for loop, there is a while loop, as well as a function called cost_profile_wind() computing a cost due to wind.

Can anybody help me to get it run faster? If more code is needed, I can always provide.



The manual (under “Parallel Computing”) says

Any variables used inside the parallel loop will be copied and broadcast to each process.

This would seem to apply to the (presumably large) x_wf etc. Putting them in shared arrays may help, especially if your worker processes share memory.
Could someone tell us (or better yet, point to a practical way of determining) how often the copy/broadcast operation occurs? It may be once per while iteration - after all, the compiler doesn’t know if they are modified by the cost_profile_wind function.