Can the output of --track-allocation be trusted with multi-threading enabled?

Below the output of --track-allocation. The first time when the code is ran with a single thread, the second time when the work is divided and the code is called from 8 threads. Is this output to be trusted? If so where do all the extra allocations come from? In looking at the Windows memory monitor, wild variations in memory consumption can be seen, suggesting there might be something to these profiling outputs…

single-thread

        - function assemblechunk_body!(biop,
        -         test_shapes, test_elements, test_assembly_data,
        -         trial_shapes, trial_elements, trial_assembly_data,
        -         qd, zlocal, store; quadstrat)
        - 
        0     myid = Threads.threadid()
       48     myid == 1 && print("dots out of 10: ")
        0     todo, done, pctg = length(test_elements), 0, 0
        0     for (p,tcell) in enumerate(test_elements)
        0         for (q,bcell) in enumerate(trial_elements)
        - 
        0         fill!(zlocal, 0)
        0         qrule = quadrule(biop, test_shapes, trial_shapes, p, tcell, q, bcell, qd, quadstrat)
 52569600         momintegrals!(biop, test_shapes, trial_shapes, tcell, bcell, zlocal, qrule)
        0         I = length(test_assembly_data[p])
        0         J = length(trial_assembly_data[q])
        0         for j in 1 : J, i in 1 : I
        0             zij = zlocal[i,j]
        0             for (n,b) in trial_assembly_data[q][j]
        0                 zb = zij*b
        0                 for (m,a) in test_assembly_data[p][i]
        0                     store(a*zb, m, n)
        -         end end end end
        - 
        0         done += 1
        0         new_pctg = round(Int, done / todo * 100)
        0         if new_pctg > pctg + 9
      480             myid == 1 && print(".")
        0             pctg = new_pctg
        -     end end
      144     myid == 1 && println("")
        - end

multi-threading

        - function assemblechunk_body!(biop,
        -         test_shapes, test_elements, test_assembly_data,
        -         trial_shapes, trial_elements, trial_assembly_data,
        -         qd, zlocal, store; quadstrat)
        - 
      192     myid = Threads.threadid()
        0     myid == 1 && print("dots out of 10: ")
        0     todo, done, pctg = length(test_elements), 0, 0
      496     for (p,tcell) in enumerate(test_elements)
    43136         for (q,bcell) in enumerate(trial_elements)
        - 
  4491664         fill!(zlocal, 0)
    47040         qrule = quadrule(biop, test_shapes, trial_shapes, p, tcell, q, bcell, qd, quadstrat)
 23514928         momintegrals!(biop, test_shapes, trial_shapes, tcell, bcell, zlocal, qrule)
    23392         I = length(test_assembly_data[p])
    26208         J = length(trial_assembly_data[q])
   678336         for j in 1 : J, i in 1 : I
   105616             zij = zlocal[i,j]
  2133792             for (n,b) in trial_assembly_data[q][j]
   407968                 zb = zij*b
  1984048                 for (m,a) in test_assembly_data[p][i]
  3907536                     store(a*zb, m, n)
        -         end end end end
        - 
        0         done += 1
        0         new_pctg = round(Int, done / todo * 100)
        0         if new_pctg > pctg + 9
        0             myid == 1 && print(".")
      832             pctg = new_pctg
        -     end end
        0     myid == 1 && println("")
        - end

PS: the many allocations on the line with momintegrals! are to be expected. I explicitly rely on dynamic dispatch to choose the appropriate implementation of this function, based on the type of qrule.

performance of captured variables in closures · Issue #15276 · JuliaLang/julia · GitHub?

It would be probably helpful to see @btimes for a small example and the main threading code?

This is what I got from @btime , run on the level where the workload is split over the available threads:

single:

312.831 ms (714049 allocations: 75.27 MiB)

multi:

82.234 ms (1084562 allocations: 133.65 MiB)

Some per thread allocation happens on that level, so this seems to suggest the above results are spurious.

I notice that the allocations for momintegrals! in the multi-threading case are lower than in the single-threading case. Could it be possible that --track-allocation has smeared out the count over the neighboring lines of code?

1 Like

That indeed does not look that terrible, but 300000 allocations for Threads.@threads could still be a bit too much.

This is kernel functionality so I would be surprised by such a behavior. One more idea: is this 1.7.2?

The results were on 1.6.2. On julia-1.7.2 I get:

single-threading:

266.927 ms (690355 allocations: 73.10 MiB)

multi-threading:

75.139 ms (970369 allocations: 119.17 MiB)

So again like you mentioned 300_000 allocations extra.

1 Like

Thanks for checking. Sorry I was unclear, I asked because we recently saw this unrelated problem in 1.7.2.

Other ideas:

  • you could double check with a profiler and search for irregularities
  • depending on your program structure you could try Polyester.jl’s @batch (if this works and behaves differently, please let us now: AFAIK that would be a indicator for the mentioned problem)

Using Polyester results in a crash:

function assemble!(operator::Operator, test_functions::Space, trial_functions::Space, store,
    threading::Type{Threading{:multi}} = Threading{:multi};
    quadstrat=defaultquadstrat(operator, test_functions, trial_functions))

    P = Threads.nthreads()
    numchunks = P

    splits = [round(Int,s) for s in range(0, stop=numfunctions(test_functions), length=numchunks+1)]

    # Threads.@threads for i in 1:P
    @batch for i in 1:P
        lo, hi = splits[i]+1, splits[i+1]
        lo <= hi || continue
        test_functions_p = subset(test_functions, lo:hi)
        store1 = _OffsetStore(store, lo-1, 0)
        assemblechunk!(operator, test_functions_p, trial_functions, store1, quadstrat=quadstrat)
    end

end

This crashes all the way to the command line:

Please submit a bug report with steps to reproduce this fault, and any error messages that follow (in their entirety). Thanks.
Exception: EXCEPTION_ACCESS_VIOLATION at 0x60193d71 -- size at .\array.jl:151 [inlined]
axes at .\abstractarray.jl:95 [inlined]
axes1 at .\abstractarray.jl:116 [inlined]
eachindex at .\abstractarray.jl:335 [inlined]
checkbounds at .\abstractarray.jl:641 [inlined]
checkbounds at .\abstractarray.jl:656 [inlined]
getindex at .\array.jl:867 [inlined]
subset at C:\Users\krcools\.julia\dev\BEAST\src\bases\rtspace.jl:13
in expression starting at C:\Users\krcools\.julia\dev\BEAST\examples\pmchwt.jl:34
size at .\array.jl:151 [inlined]
axes at .\abstractarray.jl:95 [inlined]
axes1 at .\abstractarray.jl:116 [inlined]
eachindex at .\abstractarray.jl:335 [inlined]
checkbounds at .\abstractarray.jl:641 [inlined]
checkbounds at .\abstractarray.jl:656 [inlined]
getindex at .\array.jl:867 [inlined]
subset at C:\Users\krcools\.julia\dev\BEAST\src\bases\rtspace.jl:13
macro expansion at C:\Users\krcools\.julia\dev\BEAST\src\operator.jl:179 [inlined]
#136 at C:\Users\krcools\.julia\packages\Polyester\4Q0NM\src\closure.jl:242 [inlined]
BatchClosure at C:\Users\krcools\.julia\packages\Polyester\4Q0NM\src\batch.jl:12
unknown function (ip: 00000000601950c2)
_call at C:\Users\krcools\.julia\packages\ThreadingUtilities\k3fsO\src\threadtasks.jl:11 [inlined]
ThreadTask at C:\Users\krcools\.julia\packages\ThreadingUtilities\k3fsO\src\threadtasks.jl:29
unknown function (ip: 00000000601970e3)
 reproduce this fault, and any error messages that follow (in their entirety). Thanks.
Exception: EXCEPTION_ACCESS_VIOLATION at 0x60193d71 -- OLATION000601970e3)
 at 0x60193d71 -- size at .\array.jl:151 [inlined]
axes at .\abstractarray.jl:95 [inlined]
axes1 at .\abstractarray.jl:116 [inlined]
eachindex at .\abstractarray.jl:335 [inlined]
checkbounds at .\abstractarray.jl:641 [inlined]
checkbounds at .\abstractarray.jl:656 [inlined]
getindex at .\array.jl:867 [inlined]
subset at C:\Users\krcools\.julia\dev\BEAST\src\bases\rtspace.jl:13
in expression starting at C:\Users\krcools\.julia\dev\BEAST\examples\pmchwt.jl:34
size at .\array.jl:151 [inlined]
axes at .\abstractarray.jl:95 [inlined]
axes1 at .\abstractarray.jl:116 [inlined]
eachindex at .\abstractarray.jl:335 [inlined]
checkbounds at .\abstractarray.jl:641 [inlined]
checkbounds at .\abstractarray.jl:656 [inlined]
getindex at .\array.jl:867 [inlined]
subset at C:\Users\krcools\.julia\dev\BEAST\src\bases\rtspace.jl:13
macro expansion at C:\Users\krcools\.julia\dev\BEAST\src\operator.jl:179 [inlined]
#136 at C:\Users\krcools\.julia\packages\Polyester\4Q0NM\src\closure.jl:242 [inlined]
BatchClosure at C:\Users\krcools\.julia\packages\Polyester\4Q0NM\src\batch.jl:12
unknown function (ip: 00000000601950c2)
ackage_win64/build/src\julia.h:1788 [inlined]
start_task at /cygdrive/c/buildbot/worker/package_win64/build/src\task.c:877
Allocations: 39598028 (Pool: 39587367; BAllocations: 39598028 (Pool: 39587367; Big: 10661); GC: 37
_call at C:\Users\krcools\.julia\packages\ThreadingUtilities\k3fsO\src\threadtasks.jl:11 [inlined]
ThreadTask at C:\Users\krcools\.julia\packages\ThreadingUtilities\k3fsO\src\threadtasks.jl:29
unknown function (ip: 00000000601970e3)

Ah, I am not allowed to use continue here am I…

The code runs to completion now. Will check the benchmarks…

1 Like

I get essentially the name results with @batch:

single

270.386 ms (690342 allocations: 70.96 MiB)

multi

76.493 ms (970272 allocations: 119.16 MiB)
1 Like

Then everything should be fine (i.e. I don’t have an idea anymore).

Then this might indeed be a problem. If after all these variations you still see the original problem with 1.7.2 it might make sense to file an issue here?