Reduced performance for parallel loops in larger code?

I’ve prepared a code that works pretty well in isolation. A minimal working example is below, but I’m basically taking values from a set of input arrays to another set of output arrays.

Edit: I’m going to simplify this question to: is there anything I should be doing or thinking in regards to parallelization with this example? Is repeated calls to the parallel segment something that should generate overhead and is there anything to do about it?

Here’s the MWE

using Random, BenchmarkTools

totM = 10 #number of original arrays
xsize = 2000 #size of input arrays
ysize = 2000

a = [rand(1:xsize) for i = 1:totM] #making original arrays
b = [rand(1:ysize) for i = 1:totM]

input = [rand(a[i],b[i]) for i = 1:totM]

sa = shuffle(a) #shuffling them for output arrays
sb = shuffle(b)

output = [zeros(sa[i],sb[i]) for i = 1:totM]

destinations = [[[w,i,j] for i = 1:sa[w] for j = 1:sb[w]] for w = 1:totM]#could also be a vector
coords = shuffle(vcat(destinations...)) #rearranges coordinates to copy "input" to "output"

totEl = [sum(w->length(output[w]),1:length(output))] #total elements

storeC = Array{Array{Array{Int64,1},1},1}(undef,totM)
for g = 1:totM
  offset = 0
  for w = 1:g-1
    offset += length(input[w])
  end
  start = max(offset,1)
  endpt = offset + length(input[g])
  stop = min(endpt,totEl[1])
  storeC[g] = coords[start:stop] #generates a list of where data should be transferred to
end

maxM = totM - sum([length(storeC[g]) > 0 for g = 1:length(storeC)])

function serial(totM::Int64,storeC::Array{Array{Array{Int64,1},1},1},output::Array{Array{Float64,2},1},input::Array{Array{Float64,2},1})
  #=Threads.@threads=# for g = 1:totM
    dims = size(input[g])
    for y = 1:dims[2]
      for x = 1:dims[1]

        z = x + dims[1] * (y-1)

        if z < length(storeC[g])
          finalz = storeC[g][z][1]
          finalx = storeC[g][z][2]
          finaly = storeC[g][z][3]

          output[finalz][finalx,finaly] = input[g][x,y]
        end
      end
    end
  end
  nothing
end

function parallely(totM::Int64,storeC::Array{Array{Array{Int64,1},1},1},output::Array{Array{Float64,2},1},input::Array{Array{Float64,2},1})
  for g = 1:totM
    dims = size(input[g])
    Threads.@threads for y = 1:dims[2]
      for x = 1:dims[1]

        z = x + dims[1] * (y-1)

        if z < length(storeC[g])
          finalz = storeC[g][z][1]
          finalx = storeC[g][z][2]
          finaly = storeC[g][z][3]

          output[finalz][finalx,finaly] = input[g][x,y]
        end
      end
    end
  end
  nothing
end

@btime serial(totM,storeC,output,input)   # 2.382 s (0 allocations: 0 bytes)
@btime parallely(totM,storeC,output,input) # 666.258 ms (213 allocations: 16.66 KiB)

Before I ask some other questions, how many threads are you starting Julia with? Because you got an almost perfect 4x speedup.

Yes, I was very happy with that. I have exactly 4.

However, in the full example for the real code, I don’t get a factor of 4 speedup with 4 cores. It’s more like a factor of 2, but the input blocks aren’t uniform size and there’s a lot more processing that goes on before getting to the last step. I also query a few more vectors to get the information here.

Well, if your problem is embarrassingly parallel like your MWE and you still don’t get a speedup near 4x you might be hitting some other bottlenecks, ie memory. Does your code allocate a lot? GC can also be a bottleneck.

1 Like

I took great care with that. The individual function on my largest example uses about 1k allocations. I figured this was pretty low considering some earlier implementations that had, for example, a new array created each step and cost millions of allocations.

So, I think I’ve minimized the allocations here. The only thing that I can think of is that pulling data from a single vector by several threads might cause a bottleneck?

I would then profile your code to see where are your bottlenecks are and seeing if there is some extra performance you can find somewhere.

2 Likes

are those expected? Ideally in critical performance code you should not allocate anything or allocate only if completely on purpose.

Great point. I believe that I am only allocating as many items as necessary. I have to create a few arrays. Storing vectors to reuse has always proven to be less efficient.

1 Like

If in your actual problem you have these arrays of small vectors, like here, probably you will do much better with StaticArrays. (not necessarily related to the parallelization issue, which might be solved with other parallelization strategies, like that of Floops). In general it is hard to offer much advice without a working example of the problem.

You can also use GitHub - tro3/ThreadPools.jl: Improved thread management for background and nonuniform tasks in Julia. Docs at https://tro3.github.io/ThreadPools.jl if you have an uneven load. The workload per thread can be balanced dynamically thanks to it.

1 Like

I posted the MWE above, but I had a few thoughts after posting about what could be going on. In the full code, I stored the data as a series of matrices instead of vectors. This is better for allocations (fewer), and it turns out that it is better at time too.

The loop that is replaced is

storeC = Array{Array{Int64,2},1}(undef,totM)
for g = 1:totM
  offset = 0
  for w = 1:g-1
    offset += length(input[w])
  end
  start = max(offset,1)
  endpt = offset + length(input[g])
  stop = min(endpt,totEl[1])
  temp = coords[start:stop] #generates a list of where data should be transferred to
  arraylocs = Array{Int64,2}(undef,3,length(temp))
  for p = 1:length(temp)
    for r = 1:3
      arraylocs[r,p] = temp[p][r]
    end
  end
  storeC[g] = arraylocs
end

function parallely(totM::Int64,storeC::Array{Array{Int64,2},1},output::Array{Array{Float64,2},1},input::Array{Array{Float64,2},1})
  for g = 1:totM
    dims = size(input[g])
    Threads.@threads for y = 1:dims[2]
      storez = dims[1] * (y-1)
      for x = 1:dims[1]

        z = x + storez
        finalz = storeC[g][1,z]
        finalx = storeC[g][2,z]
        finaly = storeC[g][3,z]
    
        output[finalz][finalx,finaly] = input[g][x,y]
      end
    end
  end
  nothing
end

@btime serial(totM,storeC,output,input) #  96.642 ms (0 allocations: 0 bytes)
@btime parallely(totM,storeC,output,input) #  67.932 ms (216 allocations: 16.75 KiB)

So, the time goes down but it is not as much of a decrease over the serial computation when storing the information in matrices instead of a series of vectors. I was expecting that the problem would still be a factor of 4 less for the parallel computation, but this makes sense.

However, that doesn’t solve the original problem. I did try the StaticArrays package and also looked at the ThreadPools package (nice!). I think I’ll give the “solution” tag to StaticArrays since this probably fixes the issue I was facing in that I could just pre-allocate the arrays necessary. Thank you to everyone who answered! I really appreciate it!