Parallel quicksort openmp

Hi,

I was wondering how it is possible to parallelise a quicksort algorithm below (from rosetta code) in Julia?

In c with openmp one would use #pragma omp task. Example :

But what is the equivalent in Julia?

Thank you

You probably want multithreading, using something like a Threads.@threads macro. This is just the plain default type of parallelisation but there are other packages that provide similar functionality.

As a start, using the inbuilt threads library is easy:

Threads.@threads for i in 1:100
# do something in parallel
end

You will find this blog useful – as it shows an example of how to multithread mergesort:

1 Like

I have tried the following but it makes runtime worst.

function quicksort!(A,i=1,j=length(A))
  if j > i
    pivot = A[(j+i) >>> 1]
    left, right = i, j
    while left <= right
      while A[left] < pivot
        left += 1
      end
      while A[right] > pivot
        right -= 1
      end
      if left <= right
        A[left], A[right] = A[right], A[left]
        left += 1
        right -= 1
      end
    end
    quicksort!(A,i,right)
    quicksort!(A,left,j)
  end
  return A
end

function parallel_quicksort!(A,i=1,j=length(A))
  if j > i
    if j-i < 100_000
      quicksort!(A, i, j)
      return A
    end
    pivot = A[(j+i) >>> 1]
    left, right = i, j
    while left <= right
      while A[left] < pivot
        left += 1
      end
      while A[right] > pivot
        right -= 1
      end
      if left <= right
        A[left], A[right] = A[right], A[left]
        left += 1
        right -= 1
      end
    end
    t = Threads.@spawn parallel_quicksort!(A,i,right)
    parallel_quicksort!(A,left,j)
    wait(t)
  end
  return A
end

You may check GitHub - tkf/ThreadsX.jl: Parallelized Base functions sort implementation by @tkf.

Wow! That is super complicated. I am not there yet in term of understanding the Julia language :grin:

Maybe the base Threads are not “light enough” for your recursive implementation. You may stop the recursion for large enough chunks or try lighter threads like GitHub - JuliaSIMD/Polyester.jl: The cheapest threads you can find!

2 Likes

Yes it is complicated but optimal parallel sort is not an easy problem anyway. You have a discussion on where to stop the recursion in the file ThreadsX.jl/quicksort.jl at master ¡ tkf/ThreadsX.jl ¡ GitHub

Base.@kwdef struct ParallelQuickSortAlg{Alg,SmallSize,BaseSize} <: ParallelSortAlgorithm
    smallsort::Alg = Base.Sort.DEFAULT_UNSTABLE
    smallsize::SmallSize = nothing  # lazily determined
    basesize::BaseSize = 10_000
end
# `basesize` is tuned using `Float64`.  Make it `eltype`-aware?
#
# Note: Small `smallsize` (e.g., 32) is beneficial on random data. However, it
# is slower on sorted data with small number of worker threads.
# TODO: Implement some heuristics for `smallsize`.
1 Like

I am not looking for something optimal at all, I am more at the stage of figuring out how Julia works and how to do things…like translating a c openmp quicksort implementation to Julia. The implementation that you refer to is probably well optimised and very advanced. I am more looking for something simple like done in C.

Even in the serial case, you really don’t want to recurse down to a length-1 base case if you care about performance.

Mathematically, recursing down to a trivial base case is the most elegant approach, but in practice you typically want a larger base case for performance reasons.

(Although this is well known, unfortunately it’s not typically taught to computer-science majors as far as I know. Of course, the fact that most undergraduate computer-science programming seems to be taught in Python these days makes it harder to meaningfully discuss performance optimization.)

1 Like

Thank you for your input. I am not looking to optimise anything. Maybe I did not express myself correctly. Sorry for that. I am just trying figure out how to do a simple parallel quicksort in Julia like it is done in C and can be found all over internet. I am new to Julia so first step is really managing to make things work before starting optimisation.

Also, I am not sure why you are referring to computer science undergraduates. If you believe I am one, you are wrong. I have studied economics.

But you’re looking for a performance speedup. In which case you should not be spawning a task for every single element of the array, which is what happens if you recurse down to a base case of size = 1.

If you are benchmarking, then you have to pay at least some attention to performance issues, otherwise the results are pretty meaningless.

1 Like

Actually the recursion is stopped in @gitboy16’s implementation.

1 Like

Was reading this blog and m is not defined in:

temp = temps[Threads.threadid()]
length(temp) < m-lo+1 && resize!(temp, m-lo+1)
copyto!(temp, 1, v, lo, m-lo+1)

You may think that your are not optimizing, but in order to make an O(n \log n) expected complexity algorithm implementation to work faster in parallel than sequentially, you have to. If it were for an O(n) algorithm, like for quickselect() that is like quicksort() but only recursing into one of the two ‘halfs’, it would be even harder!

You may have to parallelize the partition computation, because in the earlier stages (especially the first call), you will be partitioning the data sequentially, and it will take a few calls to ramp up to use all your cores. A parallel partition in-place, is non-trivial. It is a typical Google interview question. :wink:

1 Like

Please look at the C link I sent at the beginning. Look at solution 2 not 1 (because it is wrong). The partition is not parallelised. If you wish, leave the partition on the side…how do you parallelise the rest?

I have made a few minor changes in your code here:


const SEQ_THRESH = 1 << 9

@inline function partition!(A, pivot, left,right)
    @inbounds while left <= right
      while A[left] < pivot
        left += 1
      end
      while A[right] > pivot
        right -= 1
      end
      if left <= right
        A[left], A[right] = A[right], A[left]
        left += 1
        right -= 1
      end
    end

  return (left,right)
end

function quicksort!(A, i=1, j=length(A))
  if j > i
    left, right = partition!(A, A[(j+i) >>> 1], i, j)
    quicksort!(A,i,right)
    quicksort!(A,left,j)
  end
  return
end

function parallel_quicksort!(A,i=1,j=length(A))
  if j-i <= SEQ_THRESH
    quicksort!(A, i, j)
    return
  end
  left, right = partition!(A, A[(j+i) >>> 1], i, j)
  t = Threads.@spawn parallel_quicksort!(A,i,right)
  parallel_quicksort!(A,left,j)
  wait(t)

  return
end

function test(n)
  b = rand(n)

  for i = 1:5
    print("Parallel  : ")
    a1 = copy(b); @time parallel_quicksort!(a1);
    print("Sequential: ")
    a2 = copy(b); @time quicksort!(a2);
    println("")
    @assert a1 == sort(a1)
    @assert a1 == a2
  end

end

test(2^20)

And I do get about 2.33x speedup on a 2-core 2013 MacBookPro with 16GB RAM.

~/p/quicksort.jl> julia -tauto --project=. test_qs.jl
Parallel  :   0.057559 seconds (14.31 k allocations: 1.012 MiB, 7.69% compilation time)
Sequential:   0.114027 seconds

Parallel  :   0.047487 seconds (13.52 k allocations: 992.812 KiB)
Sequential:   0.106023 seconds

Parallel  :   0.047373 seconds (13.52 k allocations: 992.906 KiB)
Sequential:   0.106490 seconds

Parallel  :   0.046449 seconds (13.48 k allocations: 991.656 KiB)
Sequential:   0.109551 seconds

Parallel  :   0.045878 seconds (13.47 k allocations: 991.375 KiB)
Sequential:   0.107051 seconds

Perhaps if you were to undo my changes one by one, it would be very interesting to find out which aspect of your initial code caused the slowdown, (probably some type instability). Somebody wiser to help here? @Elrod @stevengj perhaps.

CHANGES:

  1. parallel and sequential quicksort() return nothing instead of A
  2. abstracted out partition!() and made it inline
  3. asserted external while block as inbounds
  4. timing from a test function, not the REPL
5 Likes

You may try to use a different random array for each timing.

1 Like

It is intentional to test and time the same problem.
The first run is to have the code compiled. The rest to see the variation.

Successive runs of test() generate different problems (of the same size).

Thank you @pitsianis , it works in my side as well. I am struggling to understand what is wrong with my code. When I use code_warntype it does not show any type instability. It seems that it is your changes 2/ that fixes it, but why is that, I have no idea. It would be great to have the view of a senior developer on the subject. It might be a good way to learn what to do and not to do when programming in Julia.