Parallel quicksort openmp

gitboy16 · October 9, 2022, 9:11pm

Hi,

I was wondering how it is possible to parallelise a quicksort algorithm below (from rosetta code) in Julia?

In c with openmp one would use #pragma omp task. Example :

But what is the equivalent in Julia?

Thank you

jmair · October 9, 2022, 11:20pm

You probably want multithreading, using something like a Threads.@threads macro. This is just the plain default type of parallelisation but there are other packages that provide similar functionality.

As a start, using the inbuilt threads library is easy:

Threads.@threads for i in 1:100
# do something in parallel
end

blackeneth · October 10, 2022, 2:09am

You will find this blog useful – as it shows an example of how to multithread mergesort:

gitboy16 · October 10, 2022, 9:23am

I have tried the following but it makes runtime worst.

function quicksort!(A,i=1,j=length(A))
  if j > i
    pivot = A[(j+i) >>> 1]
    left, right = i, j
    while left <= right
      while A[left] < pivot
        left += 1
      end
      while A[right] > pivot
        right -= 1
      end
      if left <= right
        A[left], A[right] = A[right], A[left]
        left += 1
        right -= 1
      end
    end
    quicksort!(A,i,right)
    quicksort!(A,left,j)
  end
  return A
end

function parallel_quicksort!(A,i=1,j=length(A))
  if j > i
    if j-i < 100_000
      quicksort!(A, i, j)
      return A
    end
    pivot = A[(j+i) >>> 1]
    left, right = i, j
    while left <= right
      while A[left] < pivot
        left += 1
      end
      while A[right] > pivot
        right -= 1
      end
      if left <= right
        A[left], A[right] = A[right], A[left]
        left += 1
        right -= 1
      end
    end
    t = Threads.@spawn parallel_quicksort!(A,i,right)
    parallel_quicksort!(A,left,j)
    wait(t)
  end
  return A
end

LaurentPlagne · October 10, 2022, 9:56am

You may check GitHub - tkf/ThreadsX.jl: Parallelized Base functions sort implementation by @tkf.

gitboy16 · October 10, 2022, 10:26am

Wow! That is super complicated. I am not there yet in term of understanding the Julia language

LaurentPlagne · October 10, 2022, 10:32am

Maybe the base Threads are not “light enough” for your recursive implementation. You may stop the recursion for large enough chunks or try lighter threads like GitHub - JuliaSIMD/Polyester.jl: The cheapest threads you can find!

LaurentPlagne · October 10, 2022, 11:40am

Yes it is complicated but optimal parallel sort is not an easy problem anyway. You have a discussion on where to stop the recursion in the file ThreadsX.jl/quicksort.jl at master · tkf/ThreadsX.jl · GitHub

Base.@kwdef struct ParallelQuickSortAlg{Alg,SmallSize,BaseSize} <: ParallelSortAlgorithm
    smallsort::Alg = Base.Sort.DEFAULT_UNSTABLE
    smallsize::SmallSize = nothing  # lazily determined
    basesize::BaseSize = 10_000
end
# `basesize` is tuned using `Float64`.  Make it `eltype`-aware?
#
# Note: Small `smallsize` (e.g., 32) is beneficial on random data. However, it
# is slower on sorted data with small number of worker threads.
# TODO: Implement some heuristics for `smallsize`.

gitboy16 · October 10, 2022, 12:21pm

I am not looking for something optimal at all, I am more at the stage of figuring out how Julia works and how to do things…like translating a c openmp quicksort implementation to Julia. The implementation that you refer to is probably well optimised and very advanced. I am more looking for something simple like done in C.

stevengj · October 10, 2022, 1:45pm

Even in the serial case, you really don’t want to recurse down to a length-1 base case if you care about performance.

Mathematically, recursing down to a trivial base case is the most elegant approach, but in practice you typically want a larger base case for performance reasons.

(Although this is well known, unfortunately it’s not typically taught to computer-science majors as far as I know. Of course, the fact that most undergraduate computer-science programming seems to be taught in Python these days makes it harder to meaningfully discuss performance optimization.)

gitboy16 · October 10, 2022, 2:05pm

Thank you for your input. I am not looking to optimise anything. Maybe I did not express myself correctly. Sorry for that. I am just trying figure out how to do a simple parallel quicksort in Julia like it is done in C and can be found all over internet. I am new to Julia so first step is really managing to make things work before starting optimisation.

Also, I am not sure why you are referring to computer science undergraduates. If you believe I am one, you are wrong. I have studied economics.

stevengj · October 10, 2022, 3:09pm

But you’re looking for a performance speedup. In which case you should not be spawning a task for every single element of the array, which is what happens if you recurse down to a base case of size = 1.

If you are benchmarking, then you have to pay at least some attention to performance issues, otherwise the results are pretty meaningless.

LaurentPlagne · October 10, 2022, 3:14pm

Actually the recursion is stopped in @gitboy16’s implementation.

gitboy16 · October 10, 2022, 4:21pm

Was reading this blog and m is not defined in:

temp = temps[Threads.threadid()]
length(temp) < m-lo+1 && resize!(temp, m-lo+1)
copyto!(temp, 1, v, lo, m-lo+1)

pitsianis · October 10, 2022, 4:28pm

You may think that your are not optimizing, but in order to make an O(n \log n) expected complexity algorithm implementation to work faster in parallel than sequentially, you have to. If it were for an O(n) algorithm, like for quickselect() that is like quicksort() but only recursing into one of the two ‘halfs’, it would be even harder!

You may have to parallelize the partition computation, because in the earlier stages (especially the first call), you will be partitioning the data sequentially, and it will take a few calls to ramp up to use all your cores. A parallel partition in-place, is non-trivial. It is a typical Google interview question.

gitboy16 · October 10, 2022, 4:47pm

Please look at the C link I sent at the beginning. Look at solution 2 not 1 (because it is wrong). The partition is not parallelised. If you wish, leave the partition on the side…how do you parallelise the rest?

pitsianis · October 11, 2022, 10:39am

I have made a few minor changes in your code here:


const SEQ_THRESH = 1 << 9

@inline function partition!(A, pivot, left,right)
    @inbounds while left <= right
      while A[left] < pivot
        left += 1
      end
      while A[right] > pivot
        right -= 1
      end
      if left <= right
        A[left], A[right] = A[right], A[left]
        left += 1
        right -= 1
      end
    end

  return (left,right)
end

function quicksort!(A, i=1, j=length(A))
  if j > i
    left, right = partition!(A, A[(j+i) >>> 1], i, j)
    quicksort!(A,i,right)
    quicksort!(A,left,j)
  end
  return
end

function parallel_quicksort!(A,i=1,j=length(A))
  if j-i <= SEQ_THRESH
    quicksort!(A, i, j)
    return
  end
  left, right = partition!(A, A[(j+i) >>> 1], i, j)
  t = Threads.@spawn parallel_quicksort!(A,i,right)
  parallel_quicksort!(A,left,j)
  wait(t)

  return
end

function test(n)
  b = rand(n)

  for i = 1:5
    print("Parallel  : ")
    a1 = copy(b); @time parallel_quicksort!(a1);
    print("Sequential: ")
    a2 = copy(b); @time quicksort!(a2);
    println("")
    @assert a1 == sort(a1)
    @assert a1 == a2
  end

end

test(2^20)

And I do get about 2.33x speedup on a 2-core 2013 MacBookPro with 16GB RAM.

~/p/quicksort.jl> julia -tauto --project=. test_qs.jl
Parallel  :   0.057559 seconds (14.31 k allocations: 1.012 MiB, 7.69% compilation time)
Sequential:   0.114027 seconds

Parallel  :   0.047487 seconds (13.52 k allocations: 992.812 KiB)
Sequential:   0.106023 seconds

Parallel  :   0.047373 seconds (13.52 k allocations: 992.906 KiB)
Sequential:   0.106490 seconds

Parallel  :   0.046449 seconds (13.48 k allocations: 991.656 KiB)
Sequential:   0.109551 seconds

Parallel  :   0.045878 seconds (13.47 k allocations: 991.375 KiB)
Sequential:   0.107051 seconds

Perhaps if you were to undo my changes one by one, it would be very interesting to find out which aspect of your initial code caused the slowdown, (probably some type instability). Somebody wiser to help here? @Elrod @stevengj perhaps.

CHANGES:

parallel and sequential quicksort() return nothing instead of A
abstracted out partition!() and made it inline
asserted external while block as inbounds
timing from a test function, not the REPL

LaurentPlagne · October 11, 2022, 11:03am

You may try to use a different random array for each timing.

pitsianis · October 11, 2022, 4:02pm

It is intentional to test and time the same problem.
The first run is to have the code compiled. The rest to see the variation.

Successive runs of test() generate different problems (of the same size).

gitboy16 · October 11, 2022, 4:10pm

Thank you @pitsianis , it works in my side as well. I am struggling to understand what is wrong with my code. When I use code_warntype it does not show any type instability. It seems that it is your changes 2/ that fixes it, but why is that, I have no idea. It would be great to have the view of a senior developer on the subject. It might be a good way to learn what to do and not to do when programming in Julia.

Topic		Replies	Views
@spawn large memory allocation reduced when some code abstracted out in a function Performance multithreading , memory-allocation	6	875	October 15, 2022
Threading issues General Usage	0	469	December 17, 2018
Is any implementation of sortperm running on multiple threads available? General Usage multithreading , sortperm	0	198	March 20, 2024
Questions about getting started with parallel computing Julia at Scale	18	5742	June 22, 2019
What is the difference between these two implementations? New to Julia question , quicksort	4	141	September 15, 2024

Parallel quicksort openmp

Related topics