Parallel quicksort openmp

But you’re looking for a performance speedup. In which case you should not be spawning a task for every single element of the array, which is what happens if you recurse down to a base case of size = 1.

If you are benchmarking, then you have to pay at least some attention to performance issues, otherwise the results are pretty meaningless.

1 Like

Actually the recursion is stopped in @gitboy16’s implementation.

1 Like

Was reading this blog and m is not defined in:

temp = temps[Threads.threadid()]
length(temp) < m-lo+1 && resize!(temp, m-lo+1)
copyto!(temp, 1, v, lo, m-lo+1)

You may think that your are not optimizing, but in order to make an O(n \log n) expected complexity algorithm implementation to work faster in parallel than sequentially, you have to. If it were for an O(n) algorithm, like for quickselect() that is like quicksort() but only recursing into one of the two ‘halfs’, it would be even harder!

You may have to parallelize the partition computation, because in the earlier stages (especially the first call), you will be partitioning the data sequentially, and it will take a few calls to ramp up to use all your cores. A parallel partition in-place, is non-trivial. It is a typical Google interview question. :wink:

1 Like

Please look at the C link I sent at the beginning. Look at solution 2 not 1 (because it is wrong). The partition is not parallelised. If you wish, leave the partition on the side…how do you parallelise the rest?

I have made a few minor changes in your code here:


const SEQ_THRESH = 1 << 9

@inline function partition!(A, pivot, left,right)
    @inbounds while left <= right
      while A[left] < pivot
        left += 1
      end
      while A[right] > pivot
        right -= 1
      end
      if left <= right
        A[left], A[right] = A[right], A[left]
        left += 1
        right -= 1
      end
    end

  return (left,right)
end

function quicksort!(A, i=1, j=length(A))
  if j > i
    left, right = partition!(A, A[(j+i) >>> 1], i, j)
    quicksort!(A,i,right)
    quicksort!(A,left,j)
  end
  return
end

function parallel_quicksort!(A,i=1,j=length(A))
  if j-i <= SEQ_THRESH
    quicksort!(A, i, j)
    return
  end
  left, right = partition!(A, A[(j+i) >>> 1], i, j)
  t = Threads.@spawn parallel_quicksort!(A,i,right)
  parallel_quicksort!(A,left,j)
  wait(t)

  return
end

function test(n)
  b = rand(n)

  for i = 1:5
    print("Parallel  : ")
    a1 = copy(b); @time parallel_quicksort!(a1);
    print("Sequential: ")
    a2 = copy(b); @time quicksort!(a2);
    println("")
    @assert a1 == sort(a1)
    @assert a1 == a2
  end

end

test(2^20)

And I do get about 2.33x speedup on a 2-core 2013 MacBookPro with 16GB RAM.

~/p/quicksort.jl> julia -tauto --project=. test_qs.jl
Parallel  :   0.057559 seconds (14.31 k allocations: 1.012 MiB, 7.69% compilation time)
Sequential:   0.114027 seconds

Parallel  :   0.047487 seconds (13.52 k allocations: 992.812 KiB)
Sequential:   0.106023 seconds

Parallel  :   0.047373 seconds (13.52 k allocations: 992.906 KiB)
Sequential:   0.106490 seconds

Parallel  :   0.046449 seconds (13.48 k allocations: 991.656 KiB)
Sequential:   0.109551 seconds

Parallel  :   0.045878 seconds (13.47 k allocations: 991.375 KiB)
Sequential:   0.107051 seconds

Perhaps if you were to undo my changes one by one, it would be very interesting to find out which aspect of your initial code caused the slowdown, (probably some type instability). Somebody wiser to help here? @Elrod @stevengj perhaps.

CHANGES:

  1. parallel and sequential quicksort() return nothing instead of A
  2. abstracted out partition!() and made it inline
  3. asserted external while block as inbounds
  4. timing from a test function, not the REPL
3 Likes

You may try to use a different random array for each timing.

It is intentional to test and time the same problem.
The first run is to have the code compiled. The rest to see the variation.

Successive runs of test() generate different problems (of the same size).

Thank you @pitsianis , it works in my side as well. I am struggling to understand what is wrong with my code. When I use code_warntype it does not show any type instability. It seems that it is your changes 2/ that fixes it, but why is that, I have no idea. It would be great to have the view of a senior developer on the subject. It might be a good way to learn what to do and not to do when programming in Julia.

Ok. In this case one may randomize the order of silver application to make sure that the branch prediction plays no role.

Likely that m is “mid”

When you do the changes described in that section, does it run ok for you? It does not for me. I have a large memory allocation, and the code is actually slower. I am probably doing something wrong. Showing the full code in that blog would have maybe been better for new people like me who are trying to learn and reproduce their “experiment”.

Hello @gitboy16,

To chime in on your issue that your code provides no speedup VS the code provided as a solution.

I found this thread to be helpful: For loops run SUPER slow in Julia when outside a function - Stack Overflow
and I believe it provides hints as to why you did not see the expected results in your code.

Hi @GeoKou , thank you for the link but I don’t get it. All my loops are inside functions.

It seems, after testing it myself, that the difference that enables the desired behavior is having the “partition” part of the code in its own function.

I think it is not exactly the issue in the thread I sent, but a similar one; by having the while loops in a function that is called on each thread, julia must be putting up “safety” mechanics to prevent any issues (even though the code is written in a way that no such issues should arise), and this slowdown is removed by having the while loops (the partition function) in their own function.

Ok let’s assume you are right, then if you look at the blog sent by @blackeneth you can see that the parallel mergesort has a loop in function body and the code is running perfectly fine in parallel. So how do you explain this?

Also, interestingly if we move the 2 while loops in mergesort to a function and call that function inside parallel merge sort then it slows down the function. So we have here the opposite effect :thinking:

I spawned a new thread, with the two codes in one, here:

@gitboy16 or anybody else, please post there the mergesort with the opposite behavior.

2 Likes

This is a nice piece of code. I hope I have your permission to borrow it? Thanks.

Absolutely!

I teach for 25 years a Parallel Computing class to advanced engineering students. I’ve been toying with the idea of using Julia as a common “algorithmic notation” to describe parallel algorithms and ideas, --executable pseudocode-- as I call it :slight_smile:

Since Julia 1.8, I think I got all I need.

3 Likes