@spawn large memory allocation reduced when some code abstracted out in a function

In another discussion,

bizarre behavior that requires further attention and explanation came up.

A recursive textbook implementation of quicksort makes no memory allocations as it modifies its argument input vector. However, the straightforward parallelization by @spawning the first recursive call results in a large number and volume of memory allocations. As a result, the parallel code is slower than the sequential. The above two functions are referred to as v0_quicksort!() and v0_p_quicksort!() in the attached source code.

But if we remove as a separate function the partition!() code, nothing changes with the sequential code. But now the parallel version makes substantially fewer memory allocations, and it is faster 2.3x times on a two-core i5 than the sequential code. These are the v1_quicksort!() and v1_p_quicksort!(), respectively.

Semantically as far as I can tell, the v0 and the v1 codes are identical. So why does the parallel version v0 allocate a few GiB of memory while the parallel v1 version only allocates a few MiB of memory for the same input?

Here is the source:


const SEQ_THRESH = 1 << 9

@inline function partition!(A, pivot, left, right)
    @inbounds while left <= right
      while A[left] < pivot
        left += 1
      end
      while A[right] > pivot
        right -= 1
      end
      if left <= right
        A[left], A[right] = A[right], A[left]
        left += 1
        right -= 1
      end
    end

  return (left,right)
end

function v0_quicksort!(A, i=1, j=length(A))
  if j > i
    pivot, left, right = A[(j+i) >>> 1], i, j
    @inbounds while left <= right
      while A[left] < pivot
        left += 1
      end
      while A[right] > pivot
        right -= 1
      end
      if left <= right
        A[left], A[right] = A[right], A[left]
        left += 1
        right -= 1
      end
    end
    v0_quicksort!(A,i,right)
    v0_quicksort!(A,left,j)
  end
  return
end

function v0_p_quicksort!(A, i=1, j=length(A))
  if j-i <= SEQ_THRESH
    v0_quicksort!(A, i, j)
  else
    pivot, left, right = A[(j+i) >>> 1], i, j
    @inbounds while left <= right
      while A[left] < pivot
        left += 1
      end
      while A[right] > pivot
        right -= 1
      end
      if left <= right
        A[left], A[right] = A[right], A[left]
        left += 1
        right -= 1
      end
    end
    t = Threads.@spawn v0_p_quicksort!(A, i,    right)
                       v0_p_quicksort!(A, left, j)
    wait(t)
  end
  return
end

function v1_quicksort!(A, i=1, j=length(A))
  if j > i
    left, right = partition!(A, A[(j+i) >>> 1], i, j)
    v1_quicksort!(A,i,right)
    v1_quicksort!(A,left,j)
  end
  return
end

function v1_p_quicksort!(A, i=1, j=length(A))
  if j-i <= SEQ_THRESH
    v1_quicksort!(A, i, j)
  else
    left, right = partition!(A, A[(j+i) >>> 1], i, j)
    t = Threads.@spawn v1_p_quicksort!(A, i,    right)
                       v1_p_quicksort!(A, left, j)
    wait(t)
  end
  return
end


function test(n)
  b = rand(n)
  g = sort(b)

  for i = 1:5
    print("v0 Parallel  : ")
    a1 = copy(b); @time v0_p_quicksort!(a1);
    @assert a1 == g
    print("v0 Sequential: ")
    a2 = copy(b); @time v0_quicksort!(a2);
    @assert a2 == g

    print("v1 Parallel  : ")
    a1 = copy(b); @time v1_p_quicksort!(a1);
    @assert a1 == g
    print("v1 Sequential: ")
    a2 = copy(b); @time v1_quicksort!(a2);
    @assert a2 == g
    println("")
  end

end

test(2^22)

And here is how I run it and the output I get:

~/p/quicksort.jl> julia -tauto --project=. test_qs.jl
v0 Parallel  :   3.924779 seconds (176.85 M allocations: 2.642 GiB, 19.06% gc time, 0.68% compilation time)
v0 Sequential:   0.500592 seconds
v1 Parallel  :   0.217315 seconds (109.48 k allocations: 7.841 MiB, 2.58% compilation time)
v1 Sequential:   0.491915 seconds

v0 Parallel  :   3.943327 seconds (176.83 M allocations: 2.641 GiB, 20.12% gc time)
v0 Sequential:   0.489989 seconds
v1 Parallel  :   0.216695 seconds (108.72 k allocations: 7.800 MiB)
v1 Sequential:   0.490239 seconds

v0 Parallel  :   3.732871 seconds (176.83 M allocations: 2.641 GiB, 15.50% gc time)
v0 Sequential:   0.497291 seconds
v1 Parallel  :   0.214354 seconds (108.73 k allocations: 7.800 MiB)
v1 Sequential:   0.490226 seconds

v0 Parallel  :   4.121204 seconds (176.83 M allocations: 2.641 GiB, 21.79% gc time)
v0 Sequential:   0.500073 seconds
v1 Parallel  :   0.254187 seconds (108.78 k allocations: 7.802 MiB)
v1 Sequential:   0.733544 seconds

v0 Parallel  :   4.161715 seconds (176.83 M allocations: 2.641 GiB, 15.43% gc time)
v0 Sequential:   0.507738 seconds
v1 Parallel  :   0.277322 seconds (108.72 k allocations: 7.800 MiB)
v1 Sequential:   0.659662 seconds
2 Likes

That may be due to bad performance of captured variables, because right and left are mutated in v0 and are constant in v1, and @spawn creates a closure to execute as a new task.
What if you write v0_p_quicksort with

...
iright, ileft = right, left
t = Threads.@spawn v0_p_quicksort!(A, i,    iright)
                   v0_p_quicksort!(A, ileft, j)

?
That, if my understanding is correct, should make the performance of both versions equal.

5 Likes

Indeed it does solve the problem of the massive memory allocations!

This reassignment seems so counterintuitive. Shouldn’t there be some syntactic sugar to do this automatically and avoid entering a seemingly useless and redundant list of assignments?

2 Likes

Thanks @Vasily_Pisarev for the explanation. I agree with @pitsianis it is counterintuitive.
I believe the Julia issue linked to this discussion is :

Bravo! That would have taken me ages to spot. This is quite an ugly (and old!) issue, so easy to miss.

Yes, that’s a Github issue on that.

@pitsianis The docs page mentions FastClosures.jl package as a partial solution, although in the mentioned case the closures themselves are constructed implicitly via a macro, so… I don’t know a non-ugly solution, other than to separate partitioning into its own function.

Just in case: if you want to capture a variable and need it to be mutable, you can wrap it into a mutable container (usually it is Base.RefValue{T}), i.e.:

function make_counter(init::Integer)
    cnt = Ref(Int(init))
    function count()
        val = cnt[]
        cnt[] += 1
        return val
    end
    return count
end

JET.jl is a wonderful tool to detect these issues.
It also detects runtime dispatches

using JET
@report_opt test(2^10)

β”‚β”‚β”‚β”Œ @ threadingconstructs.jl:258 v0_p_quicksort!(%1, %2, %8)
β”‚β”‚β”‚β”‚ runtime dispatch detected: v0_p_quicksort!(%1::Vector{Float64}, %2::Int64, %8::Any)::Nothing
│││└──────────────────────────────
β”‚β”‚β”Œ @ sort.jl:44 right = Core.Box()
β”‚β”‚β”‚ captured variable `right` detected
││└────────────────────────────────────────────────
β”‚β”‚β”Œ @ sort.jl:48 %16 <= %21
β”‚β”‚β”‚ runtime dispatch detected: (%16::Int64 <= %21::Any)::Any
││└────────────────────────────────────────────────
β”‚β”‚β”Œ @ sort.jl:52 A[%34]
β”‚β”‚β”‚ runtime dispatch detected: (A::Vector{Float64})[%34::Any]::Any
││└────────────────────────────────────────────────
β”‚β”‚β”Œ @ sort.jl:52 %35 > %14
β”‚β”‚β”‚ runtime dispatch detected: (%35::Any > %14::Float64)::Any
││└────────────────────────────────────────────────
β”‚β”‚β”Œ @ sort.jl:53 %42 - 1
β”‚β”‚β”‚ runtime dispatch detected: (%42::Any - 1)::Any
││└────────────────────────────────────────────────
β”‚β”‚β”Œ @ sort.jl:55 %24 <= %50
β”‚β”‚β”‚ runtime dispatch detected: (%24::Int64 <= %50::Any)::Any
││└────────────────────────────────────────────────
β”‚β”‚β”Œ @ sort.jl:56 A[%57]
β”‚β”‚β”‚ runtime dispatch detected: (A::Vector{Float64})[%57::Any]::Any
││└────────────────────────────────────────────────
β”‚β”‚β”Œ @ sort.jl:56 A[%24] = %58
β”‚β”‚β”‚ runtime dispatch detected: (A::Vector{Float64})[%24::Int64] = %58::Any::Any
││└────────────────────────────────────────────────
β”‚β”‚β”Œ @ sort.jl:56 A[%65] = %59
β”‚β”‚β”‚ runtime dispatch detected: (A::Vector{Float64})[%65::Any] = %59::Float64::Any
││└────────────────────────────────────────────────
β”‚β”‚β”Œ @ sort.jl:58 %78 - 1
β”‚β”‚β”‚ runtime dispatch detected: (%78::Any - 1)::Any
││└────────────────────────────────────────────────```
2 Likes