I am trying to write a parallel image cross-correlation algorithm and I am not sure why I do not get the performance with pmap as expected.
So basically, I have two images (im1 and im2). I take smaller slices (21x21) of both at the same location and compare them by cross-correlation. For each pair of slices, I am getting a cross-correlation matrix.
This is my code:
# taking both images and a tuple with coordinates where the slices are taken
@everywhere function crossCorr_par(im1::SharedArray{Float64},
im2::SharedArray{Float64},
t::Tuple{Int64, Int64})
a, b = t[1], t[2] # the tuple distributed by map, coordinates of the slices in the large images
s = 20 # the size of the image slices
image = Array(im1[a:a+s, b:b+s]) # create local (per worker) arrays "template" and "image" (of Array type) as slices from the bigger shared arrays
template = Array(im2[a:a+s, b:b+s])
# preparing "image" for cross-correlation and creating a local "out" array to write the results of the cross-cor function - cross-cor matrix
image_pad = padarray(image, Fill(0.0::Float64, (size(template)[1]-1, size(template)[2]-1)))
image_pad = image_pad[indices(image_pad)[1], indices(image_pad)[2]]::Array{Float64, 2}
outsize=tuple([size(image)...] + [size(template)...] - 1...)
out = zeros(Float64, outsize)
# the actual calculation happens here
# all arguments are local to the respective worker
# slide the template over the image and calculate the cross-cor coefficient for each displacement
e = CrossCorrelation(out, image_pad, template, outsize);
return e
end
@everywhere function CrossCorrelation(out::Array{Float64, 2}, image::Array{Float64, 2},
template::Array{Float64, 2}, outsize::Tuple{Int64, Int64})
@inbounds begin
for i2=1:outsize[2], i1=1:outsize[1] # looping true the rows of a column first!
s = 0.0::Float64
for y=1:size(template)[2], x=1:size(template)[1] # a lot of looping and then accesing individual items from the two arrays
@fastmath s += template[x, y] * image[i1+x-1, i2+y-1] # actual calculations
end
out[i1, i2] += s # writes to local array on the respective worker
end
end
return out
end
n = 100 # the image size
im1 = rand(1.0:2.0, n, n) # this is supposed to be the original image
im2 = copy(im1); # This is the template
im1_shared = SharedArray(im1); # these are for the parallel version
im2_shared = SharedArray(im2);
# A collection of tuples which contain the coordinates of the small slices of the images im1 and im2
iterations = [ (1, 1) (1, 22) (1, 43) (1, 64),
(22, 1) (22, 22) (22, 43) (22, 64)
(43, 1) (43, 22) (43, 43) (43, 64)
(64, 1) (64, 22) (64, 43) (64, 64)] # this is for the n = 100 image size
@time c = map(x -> crossCorr(im1, im2, x), iterations);
@time c_par = pmap(x -> crossCorr_par(im1_shared, im2_shared, x), iterations);
This is the output with image size n = 100
0.092365 seconds (13.11 k allocations: 5.595 MiB, 4.79% gc time)
0.046134 seconds (19.15 k allocations: 1.699 MiB)
pmap is almost twice as fast as map. Now, if I take a larger image/array (n = 200):
0.374411 seconds (32.14 k allocations: 53.488 MiB, 2.05% gc time)
1.361117 seconds (101.23 k allocations: 12.442 MiB, 0.23% gc time)
or n = 500:
1.030963 seconds (71.29 k allocations: 152.065 MiB, 13.02% gc time)
9.571784 seconds (269.64 k allocations: 34.470 MiB, 0.07% gc time)
Note: The function crossCorr() for the serial version is written for Array types instead of SharedArrays. Everything else is identical.
So, the larger the images, the more time the 4 cores need to process the slices, compared to a single core dealing with all chunks sequentially. Some thoughts to this:
- I read about the overhead problem, where, if I get it correctly, the task on each worker has to be fairly large, otherwise the management etc. of the tasks add a lot to time needed for completion. So this might be the problem here, but what is a “large” or comp expensive task where pmap is showing a speedup? The actual calculations in my code are just some sums and multiplications…
- The images im1 and im2 are of SharedArray type in the parallel version, so available to all workers and there is no transfer of data across workers, at least not obvious to me. Also, the functions are available everywhere - thanks to @everywhere.
- I hope that I have a type stable code. I do change SharedArray slices to Array, but hope that this is not the bottleneck. The crossCorrelation function is compatible with the Array type.
- Since Julia is reading arrays column-wise, I am not sure what happens with two dimensional ones. I wrote the code to access one column and then loop over all row entries in it, then proceed to the next column etc.
- Each worker is taking slices from the shared arrays and processing them independently. This should be a perfect example of an algorithm that could be executed in parallel, right?
Could you please help me find the error in my understanding of writing parallel code, if there is any. Thanks…