Parallel merge in Cilk+ vs Julia

As my first Julia program, I had some fun debugging a Julia implementation of a parallel merge, which I had already developed in Cilk+. I thought I’d make my first post by pointing the following out to warn others developing divide and conquer algorithms with @spawn.

Here’s my Cilk+ function:

// parallel merge of sorted sub-arrays i ∈ [low1..up1) and i ∈ [low2..up2) of a into b 
// starting at index start
void parMerge(int *a, int low1, int up1, int low2, int up2, int *b, int start){
	int k1 = up1 - low1;
	int k2 = up2 - low2;
	int mid1, mid2;
// Sequential cutoff
	if(k1+k2 < 100000)
                // usual sequential non-recursive merge
		sequentialMerge(a, low1, up1, low2, up2, b, start);
	else{
		if(k1 >= k2){
			mid1 = (low1+up1-1)/2;
			mid2 = binarySearch(a, low2, up2, mid1);
		} else{
			mid2 = (low2+up2-1)/2;
			mid1 = binarySearch(a, low1, up1, mid2)-1;
			mid2++;
		}
		cilk_spawn parMerge(a, low1, mid1+1, low2, mid2, b, start);
		start = start + mid1 - low1 + 1 + mid2 - low2;
		parMerge(a, mid1+1, up1, mid2, up2, b, start);
		cilk_sync;
	}
}

Note that I’ve updated start after the first recursive call to parMerge, which works fine since the value of start has already been used in that recursive call.

Here’s my Julia version, modified to deal with 1-based indexing:

# Merge a[lo1:hi1] and a[lo2:hi2] into b starting at start
function parMerge!(b, a, lo1, hi1, lo2, hi2, start)
	k1 = hi1 - lo1 + 1
	k2 = hi2 - lo2 + 1
	mid1, mid2 = 0, 0
	# sequential cutoff
	if k1+k2 < 100000
		sequentialMerge!(b, a, lo1, hi1, lo2, hi2, start)
		return
	end
	if k1 >= k2
		mid1 = (lo1+hi1)>>> 1
		mid2 = binarySearch(a, lo2, hi2, mid1)
	else
		mid2 = (lo2+hi2)>>>1
		mid1 = binarySearch(a, lo1, hi1, mid2)-1
		mid2 += 1
	end
	first = @spawn parMerge!(b, a, lo1, mid1, lo2, mid2-1, start)
	parMerge!(b, a, mid1+1, hi1, mid2, hi2, start + mid1 - lo1 + 1 + mid2 - lo2)
	wait(first)
end

Note that here I didn’t update start before the second recursive call, but instead put the expression into the last parameter.

I had initially updated start before the second recursive call (as in the Cilk+ program). What happened was that the first recursive call used the value of start that was updated in the following line, leading to out of bound array accesses. From my limited understanding of Julia, I take it that this is because the expression in the @spawn macro is “escaped” (esc), which results in variables referring back to the scope where the macro is called.

1 Like

@spawn builds a closure for the thunk which it passes to the task. This can indeed make variable references tricky.

Can this allocation overhead be avoided? This makes divide and conquer algorithms
allocate way too much…

You can interpolate values into @spawn.

   a = b+c
   a += 1
   task = @spawn myfun($a)

That has a nil effect, I am afraid.

I agree that divide and conquer implementations can allocate too much memory, only I don’t see how that applies in this example. Could you elaborate?

Take for instance this sorting algorithm (credit @pitsianis ):

- const SEQ_THRESH = 1 << 9
        - 
        - @inline function partition!(A, pivot, left,right)
        -     @inbounds while left <= right
        -       while A[left] < pivot
        -         left += 1
        -       end
        -       while A[right] > pivot
        -         right -= 1
        -       end
        -       if left <= right
        -         A[left], A[right] = A[right], A[left]
        -         left += 1
        -         right -= 1
        -       end
        -     end
        - 
        -   return (left,right)
        - end
        - 
        - function quicksort!(A, i=1, j=length(A))
        0   if j > i
        0     left, right = partition!(A, A[(j+i) >>> 1], i, j)
        0     quicksort!(A,i,right)
        0     quicksort!(A,left,j)
        -   end
        0   return
        - end
        - 
        - function parallel_quicksort!(A,i=1,j=length(A))
        0   if j-i <= SEQ_THRESH
        0     quicksort!(A, i, j)
        0     return
        -   end
        0   left, right = partition!(A, A[(j+i) >>> 1], i, j)
 12327200   t = Threads.@spawn parallel_quicksort!(A,i,right)
        0   parallel_quicksort!(A,left,j)
 18542720   wait(t)
        - 
        0   return
        - end
        - 
        - function test(n)
  8388656   b = rand(n)
        - 
        0   for i = 1:5
      240     print("Parallel  : ")
 41943280     a1 = copy(b); @time parallel_quicksort!(a1);
      240     print("Sequential: ")
 41943280     a2 = copy(b); @time quicksort!(a2);
      720     println("")
 41943280     @assert a1 == sort(a1)
        0     @assert a1 == a2
        0   end
        - 
        - end
        - 
        - test(2^20)

Notice the huge allocations where the task is spawned and where it is waited on.

That is ugly. So even if the implementation is good in theory, these allocations couldn’t be avoided. BTW how is this profile generated? I’ve only used totals given by @time and @btime

julia --track-allocation=user

Sorry, i think it was =all

--track-allocation inserts allocations that don’t exist. What happens if you use the allocation profiler instead?

That sounds promising. Alas, output is anything but easy to interpret. I’ve been looking at it least ten minutes without being able to figure out whether my code allocated anything or not.
This is a sample of the graph:


Is there a trick to it?

That looks like a lot of allocations in compilation. One thing to be aware of is that you probably need to set the sample_rate higher. It defaults to only capturing 1/1000 allocations (I want to change that default but haven’t yet).

Thanks! With ten thousand samples, there is a mention of my own code:


Does this mean it does not allocate?

No. that 8316 means that there are 8316 allocations that it recorded.

I do long for the old memory-allocation way of display: simply show the source code with the numbers on the left. Is that possible?

The problem with that display is that it forces the compiler to lie to you. In order to get that display you have to turn off most inlining (since once you inline code it’s very hard to attribute it to a line number), and a number of other unfortunate things happen that reduce the ability of the compiler to optimize your code. The fundamental problem is that Julia’s compiler works on an “as if” basis. This means that it is only allowed to make optimizations if it can prove that you can’t tell the difference. This has the unfortunate effect that when trying to do something like correspond allocations wiht line numbers, you are inherently preventing the compiler from optimizing your code.

1 Like

How do I find out where the memory was allocated?

The boxes give you the function names. following the arrows down gives you the call stack that leads to them.