Container broadcasting memory/performance hit in 0.6?

marius311 · March 13, 2017, 11:09pm

I don’t understand why the latter two calls below use twice the memory and are much slower than the first call. I had thought this type of broadcasting should work efficiently now but maybe I misunderstood? I’ve checked both on v0.6.0-pre.alpha as well as master.

$ julia
               _
   _       _ _(_)_     |  A fresh approach to technical computing
  (_)     | (_) (_)    |  Documentation: http://docs.julialang.org
   _ _   _| |_  __ _   |  Type "?help" for help.
  | | | | | | |/ _  |  |
  | | |_| | | | (_| |  |  Version 0.6.0-pre.alpha.137 (2017-03-13 14:27 UTC)
 _/ |\__'_|_|_|\__'_|  |  Commit a132ae2* (0 days old master)
|__/                   |  x86_64-linux-gnu

julia> a = b = rand(100,100);

julia> using BenchmarkTools

julia> @benchmark a .+ b .+ a
BenchmarkTools.Trial: 
  memory estimate:  79.33 KiB
  allocs estimate:  27
  --------------
  minimum time:     23.414 μs (0.00% GC)
  median time:      27.117 μs (0.00% GC)
  mean time:        31.874 μs (8.69% GC)
  maximum time:     1.079 ms (92.62% GC)
  --------------
  samples:          10000
  evals/sample:     1
  time tolerance:   5.00%
  memory tolerance: 1.00%

julia> @benchmark (a,) .* (b,) .* (a,)
BenchmarkTools.Trial: 
  memory estimate:  156.45 KiB
  allocs estimate:  7
  --------------
  minimum time:     75.679 μs (0.00% GC)
  median time:      84.202 μs (0.00% GC)
  mean time:        179.888 μs (3.84% GC)
  maximum time:     23.779 ms (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     1
  time tolerance:   5.00%
  memory tolerance: 1.00%

julia> @benchmark [a] .* [b] .* [a]
BenchmarkTools.Trial: 
  memory estimate:  158.34 KiB
  allocs estimate:  56
  --------------
  minimum time:     124.935 μs (0.00% GC)
  median time:      134.950 μs (0.00% GC)
  mean time:        151.803 μs (4.55% GC)
  maximum time:     6.196 ms (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     1
  time tolerance:   5.00%
  memory tolerance: 1.00%

yuyichao · March 13, 2017, 11:19pm

You are not benchmarking what you think you are benchmarking. You are benchmarking global variable access and dynamic dispatch.

julia> @benchmark $a .+ $b .+ $a
BenchmarkTools.Trial:
  memory estimate:  78.73 KiB
  allocs estimate:  18
  --------------
  minimum time:     10.818 μs (0.00% GC)
  median time:      11.459 μs (0.00% GC)
  mean time:        14.529 μs (10.52% GC)
  maximum time:     671.217 μs (90.69% GC)
  --------------
  samples:          10000
  evals/sample:     1
  time tolerance:   5.00%
  memory tolerance: 1.00%

julia> @benchmark ($a,) .+ ($b,) .+ ($a,)
BenchmarkTools.Trial:
  memory estimate:  156.45 KiB
  allocs estimate:  7
  --------------
  minimum time:     9.232 μs (0.00% GC)
  median time:      11.862 μs (0.00% GC)
  mean time:        17.192 μs (18.78% GC)
  maximum time:     792.954 μs (92.01% GC)
  --------------
  samples:          10000
  evals/sample:     1
  time tolerance:   5.00%
  memory tolerance: 1.00%

julia> @benchmark [$a] .+ [$b] .+ [$a]
BenchmarkTools.Trial:
  memory estimate:  156.78 KiB
  allocs estimate:  8
  --------------
  minimum time:     9.348 μs (0.00% GC)
  median time:      12.399 μs (0.00% GC)
  mean time:        19.361 μs (21.76% GC)
  maximum time:     1.003 ms (93.81% GC)
  --------------
  samples:          10000
  evals/sample:     1
  time tolerance:   5.00%
  memory tolerance: 1.00%

It’s unclear what you are comparing. The second and the third benchmark doesn’t use fusion at all (I mean not for the actual 100x100 addition).

marius311 · March 13, 2017, 11:25pm

You are not benchmarking what you think you are benchmarking. You are benchmarking global variable access and dynamic dispatch.

Ah I see, thanks, I had thought @benchmark took care of this.

It’s unclear what you are comparing. The second and the third benchmark doesn’t use fusion at all (I mean not for the actual 100x100 addition).

Ok, I was under the impression broadcasting over tuples “forwarded” the fusion to the contents as well, but I see now that it just does normal addition over the contents, hence the creation of the temporary arrays that double the memory.

Topic		Replies	Views
Bad performance in simple array broadcast operations Performance broadcast	5	1098	October 12, 2019
Blog post: Loop fusion and vectorization in Julia 0.6 Internals & Design announcement , broadcast	28	8538	May 4, 2017
Speed of broadcast operations Performance	4	801	December 22, 2020
Why is broadcast faster than the dot syntax? (Performance differences between @., ., broadcast and broadcast!) Performance broadcast , syntax , broadcasting	5	1348	January 23, 2021
Performance of simple broadcasting operations with many arguments Performance performance , broadcast	15	1648	November 29, 2021

Container broadcasting memory/performance hit in 0.6?

Related topics