@spawn large memory allocation reduced when some code abstracted out in a function

pitsianis · October 13, 2022, 9:08pm

In another discussion,

bizarre behavior that requires further attention and explanation came up.

A recursive textbook implementation of quicksort makes no memory allocations as it modifies its argument input vector. However, the straightforward parallelization by @spawning the first recursive call results in a large number and volume of memory allocations. As a result, the parallel code is slower than the sequential. The above two functions are referred to as v0_quicksort!() and v0_p_quicksort!() in the attached source code.

But if we remove as a separate function the partition!() code, nothing changes with the sequential code. But now the parallel version makes substantially fewer memory allocations, and it is faster 2.3x times on a two-core i5 than the sequential code. These are the v1_quicksort!() and v1_p_quicksort!(), respectively.

Semantically as far as I can tell, the v0 and the v1 codes are identical. So why does the parallel version v0 allocate a few GiB of memory while the parallel v1 version only allocates a few MiB of memory for the same input?

Here is the source:


const SEQ_THRESH = 1 << 9

@inline function partition!(A, pivot, left, right)
    @inbounds while left <= right
      while A[left] < pivot
        left += 1
      end
      while A[right] > pivot
        right -= 1
      end
      if left <= right
        A[left], A[right] = A[right], A[left]
        left += 1
        right -= 1
      end
    end

  return (left,right)
end

function v0_quicksort!(A, i=1, j=length(A))
  if j > i
    pivot, left, right = A[(j+i) >>> 1], i, j
    @inbounds while left <= right
      while A[left] < pivot
        left += 1
      end
      while A[right] > pivot
        right -= 1
      end
      if left <= right
        A[left], A[right] = A[right], A[left]
        left += 1
        right -= 1
      end
    end
    v0_quicksort!(A,i,right)
    v0_quicksort!(A,left,j)
  end
  return
end

function v0_p_quicksort!(A, i=1, j=length(A))
  if j-i <= SEQ_THRESH
    v0_quicksort!(A, i, j)
  else
    pivot, left, right = A[(j+i) >>> 1], i, j
    @inbounds while left <= right
      while A[left] < pivot
        left += 1
      end
      while A[right] > pivot
        right -= 1
      end
      if left <= right
        A[left], A[right] = A[right], A[left]
        left += 1
        right -= 1
      end
    end
    t = Threads.@spawn v0_p_quicksort!(A, i,    right)
                       v0_p_quicksort!(A, left, j)
    wait(t)
  end
  return
end

function v1_quicksort!(A, i=1, j=length(A))
  if j > i
    left, right = partition!(A, A[(j+i) >>> 1], i, j)
    v1_quicksort!(A,i,right)
    v1_quicksort!(A,left,j)
  end
  return
end

function v1_p_quicksort!(A, i=1, j=length(A))
  if j-i <= SEQ_THRESH
    v1_quicksort!(A, i, j)
  else
    left, right = partition!(A, A[(j+i) >>> 1], i, j)
    t = Threads.@spawn v1_p_quicksort!(A, i,    right)
                       v1_p_quicksort!(A, left, j)
    wait(t)
  end
  return
end


function test(n)
  b = rand(n)
  g = sort(b)

  for i = 1:5
    print("v0 Parallel  : ")
    a1 = copy(b); @time v0_p_quicksort!(a1);
    @assert a1 == g
    print("v0 Sequential: ")
    a2 = copy(b); @time v0_quicksort!(a2);
    @assert a2 == g

    print("v1 Parallel  : ")
    a1 = copy(b); @time v1_p_quicksort!(a1);
    @assert a1 == g
    print("v1 Sequential: ")
    a2 = copy(b); @time v1_quicksort!(a2);
    @assert a2 == g
    println("")
  end

end

test(2^22)

And here is how I run it and the output I get:

~/p/quicksort.jl> julia -tauto --project=. test_qs.jl
v0 Parallel  :   3.924779 seconds (176.85 M allocations: 2.642 GiB, 19.06% gc time, 0.68% compilation time)
v0 Sequential:   0.500592 seconds
v1 Parallel  :   0.217315 seconds (109.48 k allocations: 7.841 MiB, 2.58% compilation time)
v1 Sequential:   0.491915 seconds

v0 Parallel  :   3.943327 seconds (176.83 M allocations: 2.641 GiB, 20.12% gc time)
v0 Sequential:   0.489989 seconds
v1 Parallel  :   0.216695 seconds (108.72 k allocations: 7.800 MiB)
v1 Sequential:   0.490239 seconds

v0 Parallel  :   3.732871 seconds (176.83 M allocations: 2.641 GiB, 15.50% gc time)
v0 Sequential:   0.497291 seconds
v1 Parallel  :   0.214354 seconds (108.73 k allocations: 7.800 MiB)
v1 Sequential:   0.490226 seconds

v0 Parallel  :   4.121204 seconds (176.83 M allocations: 2.641 GiB, 21.79% gc time)
v0 Sequential:   0.500073 seconds
v1 Parallel  :   0.254187 seconds (108.78 k allocations: 7.802 MiB)
v1 Sequential:   0.733544 seconds

v0 Parallel  :   4.161715 seconds (176.83 M allocations: 2.641 GiB, 15.43% gc time)
v0 Sequential:   0.507738 seconds
v1 Parallel  :   0.277322 seconds (108.72 k allocations: 7.800 MiB)
v1 Sequential:   0.659662 seconds

Vasily_Pisarev · October 13, 2022, 10:09pm

That may be due to bad performance of captured variables, because right and left are mutated in v0 and are constant in v1, and @spawn creates a closure to execute as a new task.
What if you write v0_p_quicksort with

...
iright, ileft = right, left
t = Threads.@spawn v0_p_quicksort!(A, i,    iright)
                   v0_p_quicksort!(A, ileft, j)

?
That, if my understanding is correct, should make the performance of both versions equal.

pitsianis · October 13, 2022, 10:45pm

Indeed it does solve the problem of the massive memory allocations!

This reassignment seems so counterintuitive. Shouldn’t there be some syntactic sugar to do this automatically and avoid entering a seemingly useless and redundant list of assignments?

gitboy16 · October 14, 2022, 8:29am

Thanks @Vasily_Pisarev for the explanation. I agree with @pitsianis it is counterintuitive.
I believe the Julia issue linked to this discussion is :

github.com/JuliaLang/julia

performance of captured variables in closures

opened 03:19PM - 28 Feb 16 UTC

timholy

performance lowering

``` jl using Images: realtype function ifi{T<:Real,K,N}(img::AbstractArray{T,N}…, kern::AbstractArray{K,N}, border::AbstractString, value) if border == "circular" && size(img) == size(kern) out = real(ifftshift(ifft(fft(img).*fft(kern)))) elseif border != "inner" prepad = [div(size(kern,i)-1, 2) for i = 1:N] postpad = [div(size(kern,i), 2) for i = 1:N] fullpad = [nextprod([2,3], size(img,i) + prepad[i] + postpad[i]) - size(img, i) - prepad[i] for i = 1:N] A = padarray(img, prepad, fullpad, border, convert(T, value)) krn = zeros(typeof(one(T)*one(K)), size(A)) indexesK = ntuple(d->[size(krn,d)-prepad[d]+1:size(krn,d);1:size(kern,d)-prepad[d]], N)::NTuple{N,Vector{Int}} AF = ifft(fft(A).*fft(krn)) out = Array(realtype(eltype(AF)), size(img)) end out end ``` Test: ``` jl julia> @code_warntype ifi(rand(3,3), rand(3,3), "replicate", 0) Variables: #self#::#ifi img::Array{Float64,2} kern::Array{Float64,2} border::ASCIIString value::Int64 prepad::Box ... ``` Now comment out the `indexesK = ...` line (the output of which is not used at all). Suddenly `prepad` is inferred as `Array{Int, 1}`.

pablosanjose · October 14, 2022, 11:29am

Bravo! That would have taken me ages to spot. This is quite an ugly (and old!) issue, so easy to miss.

Vasily_Pisarev · October 14, 2022, 3:34pm

Yes, that’s a Github issue on that.

@pitsianis The docs page mentions FastClosures.jl package as a partial solution, although in the mentioned case the closures themselves are constructed implicitly via a macro, so… I don’t know a non-ugly solution, other than to separate partitioning into its own function.

Just in case: if you want to capture a variable and need it to be mutable, you can wrap it into a mutable container (usually it is Base.RefValue{T}), i.e.:

function make_counter(init::Integer)
    cnt = Ref(Int(init))
    function count()
        val = cnt[]
        cnt[] += 1
        return val
    end
    return count
end

dpinol · October 15, 2022, 6:24am

JET.jl is a wonderful tool to detect these issues.
It also detects runtime dispatches

using JET
@report_opt test(2^10)

│││┌ @ threadingconstructs.jl:258 v0_p_quicksort!(%1, %2, %8)
││││ runtime dispatch detected: v0_p_quicksort!(%1::Vector{Float64}, %2::Int64, %8::Any)::Nothing
│││└──────────────────────────────
││┌ @ sort.jl:44 right = Core.Box()
│││ captured variable `right` detected
││└────────────────────────────────────────────────
││┌ @ sort.jl:48 %16 <= %21
│││ runtime dispatch detected: (%16::Int64 <= %21::Any)::Any
││└────────────────────────────────────────────────
││┌ @ sort.jl:52 A[%34]
│││ runtime dispatch detected: (A::Vector{Float64})[%34::Any]::Any
││└────────────────────────────────────────────────
││┌ @ sort.jl:52 %35 > %14
│││ runtime dispatch detected: (%35::Any > %14::Float64)::Any
││└────────────────────────────────────────────────
││┌ @ sort.jl:53 %42 - 1
│││ runtime dispatch detected: (%42::Any - 1)::Any
││└────────────────────────────────────────────────
││┌ @ sort.jl:55 %24 <= %50
│││ runtime dispatch detected: (%24::Int64 <= %50::Any)::Any
││└────────────────────────────────────────────────
││┌ @ sort.jl:56 A[%57]
│││ runtime dispatch detected: (A::Vector{Float64})[%57::Any]::Any
││└────────────────────────────────────────────────
││┌ @ sort.jl:56 A[%24] = %58
│││ runtime dispatch detected: (A::Vector{Float64})[%24::Int64] = %58::Any::Any
││└────────────────────────────────────────────────
││┌ @ sort.jl:56 A[%65] = %59
│││ runtime dispatch detected: (A::Vector{Float64})[%65::Any] = %59::Float64::Any
││└────────────────────────────────────────────────
││┌ @ sort.jl:58 %78 - 1
│││ runtime dispatch detected: (%78::Any - 1)::Any
││└────────────────────────────────────────────────```

Topic		Replies	Views
Parallel quicksort openmp New to Julia parallel , multithreading , sort , sortperm , quicksort	30	1791	March 24, 2023
Threading issues General Usage	0	468	December 17, 2018
Dubious code: @spawn psort!(v, lo, mid) New to Julia question	2	424	December 2, 2019
Notes on multithreading with Julia Teaching & Outreach parallel , multithreading	5	1276	June 29, 2020
More threads, slower code, even if not spawning them Performance	19	809	January 29, 2022

@spawn large memory allocation reduced when some code abstracted out in a function

Related topics