Slow parallel for loop


#1

Hi everybody,

I have a problem when using parallelization in a for loop. The computation speed is extremely slow, even slower than without parallel. The for loop code is as follows:

  N = 20
  d_u = 0.5*rand(N)
  d_v = 0.5*rand(N)
  cost_update = SharedArray(Float64, N)

@parallel for k in eachindex(d_u)
    a = p_1     # dynamic lower bound
    b = p_2     # dynamic upper bound
    x_1 = a + (1-gr)*(b-a)
    x_2 = a + gr*(b-a)

    # run golden section search
    cost = zeros(2,1)
    while norm(b-a) > tol

      # compute cost for upper and lower bounds
      lambda_12 = [norm(x_1-p_1) / l_0;
                   norm(x_2-p_1) / l_0]
      theta_gnd = [atan2(x_1[2]-p_i[2], x_1[1]-p_i[1]);   # backward propagation
                   atan2(x_2[2]-p_i[2], x_2[1]-p_i[1])]
      x_gnd = [x_1[1]; x_2[1]] - vertex_min[1] + 1
      y_gnd = [x_1[2]; x_2[2]] - vertex_min[2] + 1
      cost  = [norm(x_1-p_i); norm(x_2-p_i)] .*
              cost_profile_wind(x_wf,  y_wf,  u_wf,  v_wf, d_u[k], d_v[k],
                                x_gnd, y_gnd, V_gnd, theta_gnd) +
              lambda_12.*u_2 + (1-lambda_12).*u_1

     # update upper or lower bounds as necessary
     if cost[1] < cost[2]
       b = x_2
       x_2 = x_1
       x_1 = a + (1-gr)*(b-a)
     else
       a = x_1
       x_1 = x_2
       x_2 = a + gr*(b-a)
     end

    end

    # update cost to return
    cost_update[k] = mean(cost)

  end

The variables x_wf, y_wf, u_wf, v_wf defines a vector field of wind. Inside the for loop, there is a while loop, as well as a function called cost_profile_wind() computing a cost due to wind.

Can anybody help me to get it run faster? If more code is needed, I can always provide.

Thanks!


#2

The manual (under “Parallel Computing”) says

Any variables used inside the parallel loop will be copied and broadcast to each process.

This would seem to apply to the (presumably large) x_wf etc. Putting them in shared arrays may help, especially if your worker processes share memory.
Could someone tell us (or better yet, point to a practical way of determining) how often the copy/broadcast operation occurs? It may be once per while iteration - after all, the compiler doesn’t know if they are modified by the cost_profile_wind function.