@spawn large memory allocation reduced when some code abstracted out in a function

In another discussion,

bizarre behavior that requires further attention and explanation came up.

A recursive textbook implementation of quicksort makes no memory allocations as it modifies its argument input vector. However, the straightforward parallelization by @spawning the first recursive call results in a large number and volume of memory allocations. As a result, the parallel code is slower than the sequential. The above two functions are referred to as v0_quicksort!() and v0_p_quicksort!() in the attached source code.

But if we remove as a separate function the partition!() code, nothing changes with the sequential code. But now the parallel version makes substantially fewer memory allocations, and it is faster 2.3x times on a two-core i5 than the sequential code. These are the v1_quicksort!() and v1_p_quicksort!(), respectively.

Semantically as far as I can tell, the v0 and the v1 codes are identical. So why does the parallel version v0 allocate a few GiB of memory while the parallel v1 version only allocates a few MiB of memory for the same input?

Here is the source:

const SEQ_THRESH = 1 << 9

@inline function partition!(A, pivot, left, right)
@inbounds while left <= right
while A[left] < pivot
left += 1
end
while A[right] > pivot
right -= 1
end
if left <= right
A[left], A[right] = A[right], A[left]
left += 1
right -= 1
end
end

return (left,right)
end

function v0_quicksort!(A, i=1, j=length(A))
if j > i
pivot, left, right = A[(j+i) >>> 1], i, j
@inbounds while left <= right
while A[left] < pivot
left += 1
end
while A[right] > pivot
right -= 1
end
if left <= right
A[left], A[right] = A[right], A[left]
left += 1
right -= 1
end
end
v0_quicksort!(A,i,right)
v0_quicksort!(A,left,j)
end
return
end

function v0_p_quicksort!(A, i=1, j=length(A))
if j-i <= SEQ_THRESH
v0_quicksort!(A, i, j)
else
pivot, left, right = A[(j+i) >>> 1], i, j
@inbounds while left <= right
while A[left] < pivot
left += 1
end
while A[right] > pivot
right -= 1
end
if left <= right
A[left], A[right] = A[right], A[left]
left += 1
right -= 1
end
end
t = Threads.@spawn v0_p_quicksort!(A, i,    right)
v0_p_quicksort!(A, left, j)
wait(t)
end
return
end

function v1_quicksort!(A, i=1, j=length(A))
if j > i
left, right = partition!(A, A[(j+i) >>> 1], i, j)
v1_quicksort!(A,i,right)
v1_quicksort!(A,left,j)
end
return
end

function v1_p_quicksort!(A, i=1, j=length(A))
if j-i <= SEQ_THRESH
v1_quicksort!(A, i, j)
else
left, right = partition!(A, A[(j+i) >>> 1], i, j)
t = Threads.@spawn v1_p_quicksort!(A, i,    right)
v1_p_quicksort!(A, left, j)
wait(t)
end
return
end

function test(n)
b = rand(n)
g = sort(b)

for i = 1:5
print("v0 Parallel  : ")
a1 = copy(b); @time v0_p_quicksort!(a1);
@assert a1 == g
print("v0 Sequential: ")
a2 = copy(b); @time v0_quicksort!(a2);
@assert a2 == g

print("v1 Parallel  : ")
a1 = copy(b); @time v1_p_quicksort!(a1);
@assert a1 == g
print("v1 Sequential: ")
a2 = copy(b); @time v1_quicksort!(a2);
@assert a2 == g
println("")
end

end

test(2^22)

And here is how I run it and the output I get:

~/p/quicksort.jl> julia -tauto --project=. test_qs.jl
v0 Parallel  :   3.924779 seconds (176.85 M allocations: 2.642 GiB, 19.06% gc time, 0.68% compilation time)
v0 Sequential:   0.500592 seconds
v1 Parallel  :   0.217315 seconds (109.48 k allocations: 7.841 MiB, 2.58% compilation time)
v1 Sequential:   0.491915 seconds

v0 Parallel  :   3.943327 seconds (176.83 M allocations: 2.641 GiB, 20.12% gc time)
v0 Sequential:   0.489989 seconds
v1 Parallel  :   0.216695 seconds (108.72 k allocations: 7.800 MiB)
v1 Sequential:   0.490239 seconds

v0 Parallel  :   3.732871 seconds (176.83 M allocations: 2.641 GiB, 15.50% gc time)
v0 Sequential:   0.497291 seconds
v1 Parallel  :   0.214354 seconds (108.73 k allocations: 7.800 MiB)
v1 Sequential:   0.490226 seconds

v0 Parallel  :   4.121204 seconds (176.83 M allocations: 2.641 GiB, 21.79% gc time)
v0 Sequential:   0.500073 seconds
v1 Parallel  :   0.254187 seconds (108.78 k allocations: 7.802 MiB)
v1 Sequential:   0.733544 seconds

v0 Parallel  :   4.161715 seconds (176.83 M allocations: 2.641 GiB, 15.43% gc time)
v0 Sequential:   0.507738 seconds
v1 Parallel  :   0.277322 seconds (108.72 k allocations: 7.800 MiB)
v1 Sequential:   0.659662 seconds
2 Likes

That may be due to bad performance of captured variables, because right and left are mutated in v0 and are constant in v1, and @spawn creates a closure to execute as a new task.
What if you write v0_p_quicksort with

...
iright, ileft = right, left
t = Threads.@spawn v0_p_quicksort!(A, i,    iright)
v0_p_quicksort!(A, ileft, j)

?
That, if my understanding is correct, should make the performance of both versions equal.

5 Likes

Indeed it does solve the problem of the massive memory allocations!

This reassignment seems so counterintuitive. Shouldnβt there be some syntactic sugar to do this automatically and avoid entering a seemingly useless and redundant list of assignments?

2 Likes

Thanks @Vasily_Pisarev for the explanation. I agree with @pitsianis it is counterintuitive.
I believe the Julia issue linked to this discussion is :

Bravo! That would have taken me ages to spot. This is quite an ugly (and old!) issue, so easy to miss.

Yes, thatβs a Github issue on that.

@pitsianis The docs page mentions FastClosures.jl package as a partial solution, although in the mentioned case the closures themselves are constructed implicitly via a macro, soβ¦ I donβt know a non-ugly solution, other than to separate partitioning into its own function.

Just in case: if you want to capture a variable and need it to be mutable, you can wrap it into a mutable container (usually it is Base.RefValue{T}), i.e.:

function make_counter(init::Integer)
cnt = Ref(Int(init))
function count()
val = cnt[]
cnt[] += 1
return val
end
return count
end

JET.jl is a wonderful tool to detect these issues.
It also detects runtime dispatches

using JET
@report_opt test(2^10)

ββββ @ threadingconstructs.jl:258 v0_p_quicksort!(%1, %2, %8)
ββββ runtime dispatch detected: v0_p_quicksort!(%1::Vector{Float64}, %2::Int64, %8::Any)::Nothing
ββββββββββββββββββββββββββββββββββ
βββ @ sort.jl:44 right = Core.Box()
βββ captured variable `right` detected
βββββββββββββββββββββββββββββββββββββββββββββββββββ
βββ @ sort.jl:48 %16 <= %21
βββ runtime dispatch detected: (%16::Int64 <= %21::Any)::Any
βββββββββββββββββββββββββββββββββββββββββββββββββββ
βββ @ sort.jl:52 A[%34]
βββ runtime dispatch detected: (A::Vector{Float64})[%34::Any]::Any
βββββββββββββββββββββββββββββββββββββββββββββββββββ
βββ @ sort.jl:52 %35 > %14
βββ runtime dispatch detected: (%35::Any > %14::Float64)::Any
βββββββββββββββββββββββββββββββββββββββββββββββββββ
βββ @ sort.jl:53 %42 - 1
βββ runtime dispatch detected: (%42::Any - 1)::Any
βββββββββββββββββββββββββββββββββββββββββββββββββββ
βββ @ sort.jl:55 %24 <= %50
βββ runtime dispatch detected: (%24::Int64 <= %50::Any)::Any
βββββββββββββββββββββββββββββββββββββββββββββββββββ
βββ @ sort.jl:56 A[%57]
βββ runtime dispatch detected: (A::Vector{Float64})[%57::Any]::Any
βββββββββββββββββββββββββββββββββββββββββββββββββββ
βββ @ sort.jl:56 A[%24] = %58
βββ runtime dispatch detected: (A::Vector{Float64})[%24::Int64] = %58::Any::Any
βββββββββββββββββββββββββββββββββββββββββββββββββββ
βββ @ sort.jl:56 A[%65] = %59
βββ runtime dispatch detected: (A::Vector{Float64})[%65::Any] = %59::Float64::Any
βββββββββββββββββββββββββββββββββββββββββββββββββββ
βββ @ sort.jl:58 %78 - 1
βββ runtime dispatch detected: (%78::Any - 1)::Any
βββββββββββββββββββββββββββββββββββββββββββββββββββ```
2 Likes