Parallel merge in Cilk+ vs Julia

eaubanel · February 20, 2020, 1:31pm

As my first Julia program, I had some fun debugging a Julia implementation of a parallel merge, which I had already developed in Cilk+. I thought I’d make my first post by pointing the following out to warn others developing divide and conquer algorithms with @spawn.

Here’s my Cilk+ function:

// parallel merge of sorted sub-arrays i ∈ [low1..up1) and i ∈ [low2..up2) of a into b 
// starting at index start
void parMerge(int *a, int low1, int up1, int low2, int up2, int *b, int start){
	int k1 = up1 - low1;
	int k2 = up2 - low2;
	int mid1, mid2;
// Sequential cutoff
	if(k1+k2 < 100000)
                // usual sequential non-recursive merge
		sequentialMerge(a, low1, up1, low2, up2, b, start);
	else{
		if(k1 >= k2){
			mid1 = (low1+up1-1)/2;
			mid2 = binarySearch(a, low2, up2, mid1);
		} else{
			mid2 = (low2+up2-1)/2;
			mid1 = binarySearch(a, low1, up1, mid2)-1;
			mid2++;
		}
		cilk_spawn parMerge(a, low1, mid1+1, low2, mid2, b, start);
		start = start + mid1 - low1 + 1 + mid2 - low2;
		parMerge(a, mid1+1, up1, mid2, up2, b, start);
		cilk_sync;
	}
}

Note that I’ve updated start after the first recursive call to parMerge, which works fine since the value of start has already been used in that recursive call.

Here’s my Julia version, modified to deal with 1-based indexing:

# Merge a[lo1:hi1] and a[lo2:hi2] into b starting at start
function parMerge!(b, a, lo1, hi1, lo2, hi2, start)
	k1 = hi1 - lo1 + 1
	k2 = hi2 - lo2 + 1
	mid1, mid2 = 0, 0
	# sequential cutoff
	if k1+k2 < 100000
		sequentialMerge!(b, a, lo1, hi1, lo2, hi2, start)
		return
	end
	if k1 >= k2
		mid1 = (lo1+hi1)>>> 1
		mid2 = binarySearch(a, lo2, hi2, mid1)
	else
		mid2 = (lo2+hi2)>>>1
		mid1 = binarySearch(a, lo1, hi1, mid2)-1
		mid2 += 1
	end
	first = @spawn parMerge!(b, a, lo1, mid1, lo2, mid2-1, start)
	parMerge!(b, a, mid1+1, hi1, mid2, hi2, start + mid1 - lo1 + 1 + mid2 - lo2)
	wait(first)
end

Note that here I didn’t update start before the second recursive call, but instead put the expression into the last parameter.

I had initially updated start before the second recursive call (as in the Cilk+ program). What happened was that the first recursive call used the value of start that was updated in the following line, leading to out of bound array accesses. From my limited understanding of Julia, I take it that this is because the expression in the @spawn macro is “escaped” (esc), which results in variables referring back to the scope where the macro is called.

Ralph_Smith · February 20, 2020, 2:25pm

@spawn builds a closure for the thunk which it passes to the task. This can indeed make variable references tricky.

PetrKryslUCSD · March 25, 2023, 4:43am

Can this allocation overhead be avoided? This makes divide and conquer algorithms
allocate way too much…

sgaure · March 25, 2023, 9:30am

You can interpolate values into @spawn.

   a = b+c
   a += 1
   task = @spawn myfun($a)

PetrKryslUCSD · March 26, 2023, 5:45pm

That has a nil effect, I am afraid.

eaubanel · March 27, 2023, 12:52pm

I agree that divide and conquer implementations can allocate too much memory, only I don’t see how that applies in this example. Could you elaborate?

PetrKryslUCSD · March 27, 2023, 3:06pm

Take for instance this sorting algorithm (credit @pitsianis ):

- const SEQ_THRESH = 1 << 9
        - 
        - @inline function partition!(A, pivot, left,right)
        -     @inbounds while left <= right
        -       while A[left] < pivot
        -         left += 1
        -       end
        -       while A[right] > pivot
        -         right -= 1
        -       end
        -       if left <= right
        -         A[left], A[right] = A[right], A[left]
        -         left += 1
        -         right -= 1
        -       end
        -     end
        - 
        -   return (left,right)
        - end
        - 
        - function quicksort!(A, i=1, j=length(A))
        0   if j > i
        0     left, right = partition!(A, A[(j+i) >>> 1], i, j)
        0     quicksort!(A,i,right)
        0     quicksort!(A,left,j)
        -   end
        0   return
        - end
        - 
        - function parallel_quicksort!(A,i=1,j=length(A))
        0   if j-i <= SEQ_THRESH
        0     quicksort!(A, i, j)
        0     return
        -   end
        0   left, right = partition!(A, A[(j+i) >>> 1], i, j)
 12327200   t = Threads.@spawn parallel_quicksort!(A,i,right)
        0   parallel_quicksort!(A,left,j)
 18542720   wait(t)
        - 
        0   return
        - end
        - 
        - function test(n)
  8388656   b = rand(n)
        - 
        0   for i = 1:5
      240     print("Parallel  : ")
 41943280     a1 = copy(b); @time parallel_quicksort!(a1);
      240     print("Sequential: ")
 41943280     a2 = copy(b); @time quicksort!(a2);
      720     println("")
 41943280     @assert a1 == sort(a1)
        0     @assert a1 == a2
        0   end
        - 
        - end
        - 
        - test(2^20)

Notice the huge allocations where the task is spawned and where it is waited on.

eaubanel · March 27, 2023, 7:00pm

That is ugly. So even if the implementation is good in theory, these allocations couldn’t be avoided. BTW how is this profile generated? I’ve only used totals given by @time and @btime

PetrKryslUCSD · March 27, 2023, 7:12pm

julia --track-allocation=user

PetrKryslUCSD · March 27, 2023, 7:55pm

Sorry, i think it was =all

Oscar_Smith · March 27, 2023, 7:57pm

--track-allocation inserts allocations that don’t exist. What happens if you use the allocation profiler instead?

PetrKryslUCSD · March 27, 2023, 8:55pm

That sounds promising. Alas, output is anything but easy to interpret. I’ve been looking at it least ten minutes without being able to figure out whether my code allocated anything or not.
This is a sample of the graph:

Is there a trick to it?

Oscar_Smith · March 27, 2023, 9:02pm

That looks like a lot of allocations in compilation. One thing to be aware of is that you probably need to set the sample_rate higher. It defaults to only capturing 1/1000 allocations (I want to change that default but haven’t yet).

PetrKryslUCSD · March 27, 2023, 9:20pm

Thanks! With ten thousand samples, there is a mention of my own code:

Does this mean it does not allocate?

Oscar_Smith · March 27, 2023, 9:25pm

No. that 8316 means that there are 8316 allocations that it recorded.

PetrKryslUCSD · March 27, 2023, 9:30pm

I do long for the old memory-allocation way of display: simply show the source code with the numbers on the left. Is that possible?

Oscar_Smith · March 27, 2023, 9:35pm

The problem with that display is that it forces the compiler to lie to you. In order to get that display you have to turn off most inlining (since once you inline code it’s very hard to attribute it to a line number), and a number of other unfortunate things happen that reduce the ability of the compiler to optimize your code. The fundamental problem is that Julia’s compiler works on an “as if” basis. This means that it is only allowed to make optimizations if it can prove that you can’t tell the difference. This has the unfortunate effect that when trying to do something like correspond allocations wiht line numbers, you are inherently preventing the compiler from optimizing your code.

PetrKryslUCSD · March 27, 2023, 9:58pm

How do I find out where the memory was allocated?

Oscar_Smith · March 27, 2023, 10:02pm

The boxes give you the function names. following the arrows down gives you the call stack that leads to them.

Topic		Replies	Views
@spawn large memory allocation reduced when some code abstracted out in a function Performance multithreading , memory-allocation	6	878	October 15, 2022
Scaling of @threads for "embarrassingly parallel" problem Performance threads	29	1955	January 20, 2023
Parallel quicksort openmp New to Julia parallel , multithreading , sort , sortperm , quicksort	30	1843	March 24, 2023
Independent threads much slower by parallelizable Performance	2	244	January 25, 2024
Tricks for parallel computing in Julia Numerics	11	1037	July 7, 2020

Parallel merge in Cilk+ vs Julia

Related topics