Can the output of --track-allocation be trusted with multi-threading enabled?

krcools · March 30, 2022, 2:19pm

Below the output of --track-allocation. The first time when the code is ran with a single thread, the second time when the work is divided and the code is called from 8 threads. Is this output to be trusted? If so where do all the extra allocations come from? In looking at the Windows memory monitor, wild variations in memory consumption can be seen, suggesting there might be something to these profiling outputs…

single-thread

        - function assemblechunk_body!(biop,
        -         test_shapes, test_elements, test_assembly_data,
        -         trial_shapes, trial_elements, trial_assembly_data,
        -         qd, zlocal, store; quadstrat)
        - 
        0     myid = Threads.threadid()
       48     myid == 1 && print("dots out of 10: ")
        0     todo, done, pctg = length(test_elements), 0, 0
        0     for (p,tcell) in enumerate(test_elements)
        0         for (q,bcell) in enumerate(trial_elements)
        - 
        0         fill!(zlocal, 0)
        0         qrule = quadrule(biop, test_shapes, trial_shapes, p, tcell, q, bcell, qd, quadstrat)
 52569600         momintegrals!(biop, test_shapes, trial_shapes, tcell, bcell, zlocal, qrule)
        0         I = length(test_assembly_data[p])
        0         J = length(trial_assembly_data[q])
        0         for j in 1 : J, i in 1 : I
        0             zij = zlocal[i,j]
        0             for (n,b) in trial_assembly_data[q][j]
        0                 zb = zij*b
        0                 for (m,a) in test_assembly_data[p][i]
        0                     store(a*zb, m, n)
        -         end end end end
        - 
        0         done += 1
        0         new_pctg = round(Int, done / todo * 100)
        0         if new_pctg > pctg + 9
      480             myid == 1 && print(".")
        0             pctg = new_pctg
        -     end end
      144     myid == 1 && println("")
        - end

multi-threading

        - function assemblechunk_body!(biop,
        -         test_shapes, test_elements, test_assembly_data,
        -         trial_shapes, trial_elements, trial_assembly_data,
        -         qd, zlocal, store; quadstrat)
        - 
      192     myid = Threads.threadid()
        0     myid == 1 && print("dots out of 10: ")
        0     todo, done, pctg = length(test_elements), 0, 0
      496     for (p,tcell) in enumerate(test_elements)
    43136         for (q,bcell) in enumerate(trial_elements)
        - 
  4491664         fill!(zlocal, 0)
    47040         qrule = quadrule(biop, test_shapes, trial_shapes, p, tcell, q, bcell, qd, quadstrat)
 23514928         momintegrals!(biop, test_shapes, trial_shapes, tcell, bcell, zlocal, qrule)
    23392         I = length(test_assembly_data[p])
    26208         J = length(trial_assembly_data[q])
   678336         for j in 1 : J, i in 1 : I
   105616             zij = zlocal[i,j]
  2133792             for (n,b) in trial_assembly_data[q][j]
   407968                 zb = zij*b
  1984048                 for (m,a) in test_assembly_data[p][i]
  3907536                     store(a*zb, m, n)
        -         end end end end
        - 
        0         done += 1
        0         new_pctg = round(Int, done / todo * 100)
        0         if new_pctg > pctg + 9
        0             myid == 1 && print(".")
      832             pctg = new_pctg
        -     end end
        0     myid == 1 && println("")
        - end

PS: the many allocations on the line with momintegrals! are to be expected. I explicitly rely on dynamic dispatch to choose the appropriate implementation of this function, based on the type of qrule.

goerch · March 30, 2022, 2:23pm

performance of captured variables in closures · Issue #15276 · JuliaLang/julia · GitHub?

It would be probably helpful to see @btimes for a small example and the main threading code?

krcools · March 30, 2022, 2:45pm

This is what I got from @btime , run on the level where the workload is split over the available threads:

single:

312.831 ms (714049 allocations: 75.27 MiB)

multi:

82.234 ms (1084562 allocations: 133.65 MiB)

Some per thread allocation happens on that level, so this seems to suggest the above results are spurious.

I notice that the allocations for momintegrals! in the multi-threading case are lower than in the single-threading case. Could it be possible that --track-allocation has smeared out the count over the neighboring lines of code?

goerch · March 30, 2022, 2:54pm

That indeed does not look that terrible, but 300000 allocations for Threads.@threads could still be a bit too much.

This is kernel functionality so I would be surprised by such a behavior. One more idea: is this 1.7.2?

krcools · March 30, 2022, 3:16pm

The results were on 1.6.2. On julia-1.7.2 I get:

single-threading:

266.927 ms (690355 allocations: 73.10 MiB)

multi-threading:

75.139 ms (970369 allocations: 119.17 MiB)

So again like you mentioned 300_000 allocations extra.

goerch · March 30, 2022, 3:30pm

Thanks for checking. Sorry I was unclear, I asked because we recently saw this unrelated problem in 1.7.2.

Other ideas:

you could double check with a profiler and search for irregularities
depending on your program structure you could try Polyester.jl’s @batch (if this works and behaves differently, please let us now: AFAIK that would be a indicator for the mentioned problem)

krcools · March 30, 2022, 4:10pm

Using Polyester results in a crash:

function assemble!(operator::Operator, test_functions::Space, trial_functions::Space, store,
    threading::Type{Threading{:multi}} = Threading{:multi};
    quadstrat=defaultquadstrat(operator, test_functions, trial_functions))

    P = Threads.nthreads()
    numchunks = P

    splits = [round(Int,s) for s in range(0, stop=numfunctions(test_functions), length=numchunks+1)]

    # Threads.@threads for i in 1:P
    @batch for i in 1:P
        lo, hi = splits[i]+1, splits[i+1]
        lo <= hi || continue
        test_functions_p = subset(test_functions, lo:hi)
        store1 = _OffsetStore(store, lo-1, 0)
        assemblechunk!(operator, test_functions_p, trial_functions, store1, quadstrat=quadstrat)
    end

end

This crashes all the way to the command line:

Please submit a bug report with steps to reproduce this fault, and any error messages that follow (in their entirety). Thanks.
Exception: EXCEPTION_ACCESS_VIOLATION at 0x60193d71 -- size at .\array.jl:151 [inlined]
axes at .\abstractarray.jl:95 [inlined]
axes1 at .\abstractarray.jl:116 [inlined]
eachindex at .\abstractarray.jl:335 [inlined]
checkbounds at .\abstractarray.jl:641 [inlined]
checkbounds at .\abstractarray.jl:656 [inlined]
getindex at .\array.jl:867 [inlined]
subset at C:\Users\krcools\.julia\dev\BEAST\src\bases\rtspace.jl:13
in expression starting at C:\Users\krcools\.julia\dev\BEAST\examples\pmchwt.jl:34
size at .\array.jl:151 [inlined]
axes at .\abstractarray.jl:95 [inlined]
axes1 at .\abstractarray.jl:116 [inlined]
eachindex at .\abstractarray.jl:335 [inlined]
checkbounds at .\abstractarray.jl:641 [inlined]
checkbounds at .\abstractarray.jl:656 [inlined]
getindex at .\array.jl:867 [inlined]
subset at C:\Users\krcools\.julia\dev\BEAST\src\bases\rtspace.jl:13
macro expansion at C:\Users\krcools\.julia\dev\BEAST\src\operator.jl:179 [inlined]
#136 at C:\Users\krcools\.julia\packages\Polyester\4Q0NM\src\closure.jl:242 [inlined]
BatchClosure at C:\Users\krcools\.julia\packages\Polyester\4Q0NM\src\batch.jl:12
unknown function (ip: 00000000601950c2)
_call at C:\Users\krcools\.julia\packages\ThreadingUtilities\k3fsO\src\threadtasks.jl:11 [inlined]
ThreadTask at C:\Users\krcools\.julia\packages\ThreadingUtilities\k3fsO\src\threadtasks.jl:29
unknown function (ip: 00000000601970e3)
 reproduce this fault, and any error messages that follow (in their entirety). Thanks.
Exception: EXCEPTION_ACCESS_VIOLATION at 0x60193d71 -- OLATION000601970e3)
 at 0x60193d71 -- size at .\array.jl:151 [inlined]
axes at .\abstractarray.jl:95 [inlined]
axes1 at .\abstractarray.jl:116 [inlined]
eachindex at .\abstractarray.jl:335 [inlined]
checkbounds at .\abstractarray.jl:641 [inlined]
checkbounds at .\abstractarray.jl:656 [inlined]
getindex at .\array.jl:867 [inlined]
subset at C:\Users\krcools\.julia\dev\BEAST\src\bases\rtspace.jl:13
in expression starting at C:\Users\krcools\.julia\dev\BEAST\examples\pmchwt.jl:34
size at .\array.jl:151 [inlined]
axes at .\abstractarray.jl:95 [inlined]
axes1 at .\abstractarray.jl:116 [inlined]
eachindex at .\abstractarray.jl:335 [inlined]
checkbounds at .\abstractarray.jl:641 [inlined]
checkbounds at .\abstractarray.jl:656 [inlined]
getindex at .\array.jl:867 [inlined]
subset at C:\Users\krcools\.julia\dev\BEAST\src\bases\rtspace.jl:13
macro expansion at C:\Users\krcools\.julia\dev\BEAST\src\operator.jl:179 [inlined]
#136 at C:\Users\krcools\.julia\packages\Polyester\4Q0NM\src\closure.jl:242 [inlined]
BatchClosure at C:\Users\krcools\.julia\packages\Polyester\4Q0NM\src\batch.jl:12
unknown function (ip: 00000000601950c2)
ackage_win64/build/src\julia.h:1788 [inlined]
start_task at /cygdrive/c/buildbot/worker/package_win64/build/src\task.c:877
Allocations: 39598028 (Pool: 39587367; BAllocations: 39598028 (Pool: 39587367; Big: 10661); GC: 37
_call at C:\Users\krcools\.julia\packages\ThreadingUtilities\k3fsO\src\threadtasks.jl:11 [inlined]
ThreadTask at C:\Users\krcools\.julia\packages\ThreadingUtilities\k3fsO\src\threadtasks.jl:29
unknown function (ip: 00000000601970e3)

krcools · March 30, 2022, 4:22pm

Ah, I am not allowed to use continue here am I…

The code runs to completion now. Will check the benchmarks…

krcools · March 30, 2022, 4:26pm

I get essentially the name results with @batch:

single

270.386 ms (690342 allocations: 70.96 MiB)

multi

76.493 ms (970272 allocations: 119.16 MiB)

goerch · March 30, 2022, 6:07pm

Then everything should be fine (i.e. I don’t have an idea anymore).

Then this might indeed be a problem. If after all these variations you still see the original problem with 1.7.2 it might make sense to file an issue here?

Topic		Replies	Views
--track-allocation and @threads General Usage question , multithreading	10	765	August 1, 2022
Type-instability because of @threads boxing variables Performance parallel	23	1519	March 25, 2022
Increased allocations when using threads Performance question	20	284	July 11, 2024
Large memory allocation when a loop is threaded, none when run single-threaded General Usage multithreading , memory-allocation	12	447	October 11, 2021
Scaling of @threads for "embarrassingly parallel" problem Performance threads	29	1965	January 20, 2023

Can the output of --track-allocation be trusted with multi-threading enabled?

Related topics