@spawn large memory allocation reduced when some code abstracted out in a function

In another discussion,

bizarre behavior that requires further attention and explanation came up.

A recursive textbook implementation of quicksort makes no memory allocations as it modifies its argument input vector. However, the straightforward parallelization by @spawning the first recursive call results in a large number and volume of memory allocations. As a result, the parallel code is slower than the sequential. The above two functions are referred to as v0_quicksort!() and v0_p_quicksort!() in the attached source code.

But if we remove as a separate function the partition!() code, nothing changes with the sequential code. But now the parallel version makes substantially fewer memory allocations, and it is faster 2.3x times on a two-core i5 than the sequential code. These are the v1_quicksort!() and v1_p_quicksort!(), respectively.

Semantically as far as I can tell, the v0 and the v1 codes are identical. So why does the parallel version v0 allocate a few GiB of memory while the parallel v1 version only allocates a few MiB of memory for the same input?

Here is the source:


const SEQ_THRESH = 1 << 9

@inline function partition!(A, pivot, left, right)
    @inbounds while left <= right
      while A[left] < pivot
        left += 1
      end
      while A[right] > pivot
        right -= 1
      end
      if left <= right
        A[left], A[right] = A[right], A[left]
        left += 1
        right -= 1
      end
    end

  return (left,right)
end

function v0_quicksort!(A, i=1, j=length(A))
  if j > i
    pivot, left, right = A[(j+i) >>> 1], i, j
    @inbounds while left <= right
      while A[left] < pivot
        left += 1
      end
      while A[right] > pivot
        right -= 1
      end
      if left <= right
        A[left], A[right] = A[right], A[left]
        left += 1
        right -= 1
      end
    end
    v0_quicksort!(A,i,right)
    v0_quicksort!(A,left,j)
  end
  return
end

function v0_p_quicksort!(A, i=1, j=length(A))
  if j-i <= SEQ_THRESH
    v0_quicksort!(A, i, j)
  else
    pivot, left, right = A[(j+i) >>> 1], i, j
    @inbounds while left <= right
      while A[left] < pivot
        left += 1
      end
      while A[right] > pivot
        right -= 1
      end
      if left <= right
        A[left], A[right] = A[right], A[left]
        left += 1
        right -= 1
      end
    end
    t = Threads.@spawn v0_p_quicksort!(A, i,    right)
                       v0_p_quicksort!(A, left, j)
    wait(t)
  end
  return
end

function v1_quicksort!(A, i=1, j=length(A))
  if j > i
    left, right = partition!(A, A[(j+i) >>> 1], i, j)
    v1_quicksort!(A,i,right)
    v1_quicksort!(A,left,j)
  end
  return
end

function v1_p_quicksort!(A, i=1, j=length(A))
  if j-i <= SEQ_THRESH
    v1_quicksort!(A, i, j)
  else
    left, right = partition!(A, A[(j+i) >>> 1], i, j)
    t = Threads.@spawn v1_p_quicksort!(A, i,    right)
                       v1_p_quicksort!(A, left, j)
    wait(t)
  end
  return
end


function test(n)
  b = rand(n)
  g = sort(b)

  for i = 1:5
    print("v0 Parallel  : ")
    a1 = copy(b); @time v0_p_quicksort!(a1);
    @assert a1 == g
    print("v0 Sequential: ")
    a2 = copy(b); @time v0_quicksort!(a2);
    @assert a2 == g

    print("v1 Parallel  : ")
    a1 = copy(b); @time v1_p_quicksort!(a1);
    @assert a1 == g
    print("v1 Sequential: ")
    a2 = copy(b); @time v1_quicksort!(a2);
    @assert a2 == g
    println("")
  end

end

test(2^22)

And here is how I run it and the output I get:

~/p/quicksort.jl> julia -tauto --project=. test_qs.jl
v0 Parallel  :   3.924779 seconds (176.85 M allocations: 2.642 GiB, 19.06% gc time, 0.68% compilation time)
v0 Sequential:   0.500592 seconds
v1 Parallel  :   0.217315 seconds (109.48 k allocations: 7.841 MiB, 2.58% compilation time)
v1 Sequential:   0.491915 seconds

v0 Parallel  :   3.943327 seconds (176.83 M allocations: 2.641 GiB, 20.12% gc time)
v0 Sequential:   0.489989 seconds
v1 Parallel  :   0.216695 seconds (108.72 k allocations: 7.800 MiB)
v1 Sequential:   0.490239 seconds

v0 Parallel  :   3.732871 seconds (176.83 M allocations: 2.641 GiB, 15.50% gc time)
v0 Sequential:   0.497291 seconds
v1 Parallel  :   0.214354 seconds (108.73 k allocations: 7.800 MiB)
v1 Sequential:   0.490226 seconds

v0 Parallel  :   4.121204 seconds (176.83 M allocations: 2.641 GiB, 21.79% gc time)
v0 Sequential:   0.500073 seconds
v1 Parallel  :   0.254187 seconds (108.78 k allocations: 7.802 MiB)
v1 Sequential:   0.733544 seconds

v0 Parallel  :   4.161715 seconds (176.83 M allocations: 2.641 GiB, 15.43% gc time)
v0 Sequential:   0.507738 seconds
v1 Parallel  :   0.277322 seconds (108.72 k allocations: 7.800 MiB)
v1 Sequential:   0.659662 seconds

That may be due to bad performance of captured variables, because right and left are mutated in v0 and are constant in v1, and @spawn creates a closure to execute as a new task.
What if you write v0_p_quicksort with

...
iright, ileft = right, left
t = Threads.@spawn v0_p_quicksort!(A, i,    iright)
                   v0_p_quicksort!(A, ileft, j)

?
That, if my understanding is correct, should make the performance of both versions equal.

Indeed it does solve the problem of the massive memory allocations!

This reassignment seems so counterintuitive. Shouldn’t there be some syntactic sugar to do this automatically and avoid entering a seemingly useless and redundant list of assignments?

Thanks @Vasily_Pisarev for the explanation. I agree with @pitsianis it is counterintuitive.
I believe the Julia issue linked to this discussion is :

Bravo! That would have taken me ages to spot. This is quite an ugly (and old!) issue, so easy to miss.

Yes, that’s a Github issue on that.

@pitsianis The docs page mentions FastClosures.jl package as a partial solution, although in the mentioned case the closures themselves are constructed implicitly via a macro, so… I don’t know a non-ugly solution, other than to separate partitioning into its own function.

Just in case: if you want to capture a variable and need it to be mutable, you can wrap it into a mutable container (usually it is Base.RefValue{T}), i.e.:

function make_counter(init::Integer)
    cnt = Ref(Int(init))
    function count()
        val = cnt[]
        cnt[] += 1
        return val
    end
    return count
end

JET.jl is a wonderful tool to detect these issues.
It also detects runtime dispatches

using JET
@report_opt test(2^10)

β”‚β”‚β”‚β”Œ @ threadingconstructs.jl:258 v0_p_quicksort!(%1, %2, %8)
β”‚β”‚β”‚β”‚ runtime dispatch detected: v0_p_quicksort!(%1::Vector{Float64}, %2::Int64, %8::Any)::Nothing
│││└──────────────────────────────
β”‚β”‚β”Œ @ sort.jl:44 right = Core.Box()
β”‚β”‚β”‚ captured variable `right` detected
││└────────────────────────────────────────────────
β”‚β”‚β”Œ @ sort.jl:48 %16 <= %21
β”‚β”‚β”‚ runtime dispatch detected: (%16::Int64 <= %21::Any)::Any
││└────────────────────────────────────────────────
β”‚β”‚β”Œ @ sort.jl:52 A[%34]
β”‚β”‚β”‚ runtime dispatch detected: (A::Vector{Float64})[%34::Any]::Any
││└────────────────────────────────────────────────
β”‚β”‚β”Œ @ sort.jl:52 %35 > %14
β”‚β”‚β”‚ runtime dispatch detected: (%35::Any > %14::Float64)::Any
││└────────────────────────────────────────────────
β”‚β”‚β”Œ @ sort.jl:53 %42 - 1
β”‚β”‚β”‚ runtime dispatch detected: (%42::Any - 1)::Any
││└────────────────────────────────────────────────
β”‚β”‚β”Œ @ sort.jl:55 %24 <= %50
β”‚β”‚β”‚ runtime dispatch detected: (%24::Int64 <= %50::Any)::Any
││└────────────────────────────────────────────────
β”‚β”‚β”Œ @ sort.jl:56 A[%57]
β”‚β”‚β”‚ runtime dispatch detected: (A::Vector{Float64})[%57::Any]::Any
││└────────────────────────────────────────────────
β”‚β”‚β”Œ @ sort.jl:56 A[%24] = %58
β”‚β”‚β”‚ runtime dispatch detected: (A::Vector{Float64})[%24::Int64] = %58::Any::Any
││└────────────────────────────────────────────────
β”‚β”‚β”Œ @ sort.jl:56 A[%65] = %59
β”‚β”‚β”‚ runtime dispatch detected: (A::Vector{Float64})[%65::Any] = %59::Float64::Any
││└────────────────────────────────────────────────
β”‚β”‚β”Œ @ sort.jl:58 %78 - 1
β”‚β”‚β”‚ runtime dispatch detected: (%78::Any - 1)::Any
││└────────────────────────────────────────────────```