Container broadcasting memory/performance hit in 0.6?

I don’t understand why the latter two calls below use twice the memory and are much slower than the first call. I had thought this type of broadcasting should work efficiently now but maybe I misunderstood? I’ve checked both on v0.6.0-pre.alpha as well as master.

$ julia
               _
   _       _ _(_)_     |  A fresh approach to technical computing
  (_)     | (_) (_)    |  Documentation: http://docs.julialang.org
   _ _   _| |_  __ _   |  Type "?help" for help.
  | | | | | | |/ _  |  |
  | | |_| | | | (_| |  |  Version 0.6.0-pre.alpha.137 (2017-03-13 14:27 UTC)
 _/ |\__'_|_|_|\__'_|  |  Commit a132ae2* (0 days old master)
|__/                   |  x86_64-linux-gnu

julia> a = b = rand(100,100);

julia> using BenchmarkTools

julia> @benchmark a .+ b .+ a
BenchmarkTools.Trial: 
  memory estimate:  79.33 KiB
  allocs estimate:  27
  --------------
  minimum time:     23.414 μs (0.00% GC)
  median time:      27.117 μs (0.00% GC)
  mean time:        31.874 μs (8.69% GC)
  maximum time:     1.079 ms (92.62% GC)
  --------------
  samples:          10000
  evals/sample:     1
  time tolerance:   5.00%
  memory tolerance: 1.00%

julia> @benchmark (a,) .* (b,) .* (a,)
BenchmarkTools.Trial: 
  memory estimate:  156.45 KiB
  allocs estimate:  7
  --------------
  minimum time:     75.679 μs (0.00% GC)
  median time:      84.202 μs (0.00% GC)
  mean time:        179.888 μs (3.84% GC)
  maximum time:     23.779 ms (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     1
  time tolerance:   5.00%
  memory tolerance: 1.00%

julia> @benchmark [a] .* [b] .* [a]
BenchmarkTools.Trial: 
  memory estimate:  158.34 KiB
  allocs estimate:  56
  --------------
  minimum time:     124.935 μs (0.00% GC)
  median time:      134.950 μs (0.00% GC)
  mean time:        151.803 μs (4.55% GC)
  maximum time:     6.196 ms (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     1
  time tolerance:   5.00%
  memory tolerance: 1.00%
  1. You are not benchmarking what you think you are benchmarking. You are benchmarking global variable access and dynamic dispatch.

    julia> @benchmark $a .+ $b .+ $a
    BenchmarkTools.Trial:
      memory estimate:  78.73 KiB
      allocs estimate:  18
      --------------
      minimum time:     10.818 μs (0.00% GC)
      median time:      11.459 μs (0.00% GC)
      mean time:        14.529 μs (10.52% GC)
      maximum time:     671.217 μs (90.69% GC)
      --------------
      samples:          10000
      evals/sample:     1
      time tolerance:   5.00%
      memory tolerance: 1.00%
    
    julia> @benchmark ($a,) .+ ($b,) .+ ($a,)
    BenchmarkTools.Trial:
      memory estimate:  156.45 KiB
      allocs estimate:  7
      --------------
      minimum time:     9.232 μs (0.00% GC)
      median time:      11.862 μs (0.00% GC)
      mean time:        17.192 μs (18.78% GC)
      maximum time:     792.954 μs (92.01% GC)
      --------------
      samples:          10000
      evals/sample:     1
      time tolerance:   5.00%
      memory tolerance: 1.00%
    
    julia> @benchmark [$a] .+ [$b] .+ [$a]
    BenchmarkTools.Trial:
      memory estimate:  156.78 KiB
      allocs estimate:  8
      --------------
      minimum time:     9.348 μs (0.00% GC)
      median time:      12.399 μs (0.00% GC)
      mean time:        19.361 μs (21.76% GC)
      maximum time:     1.003 ms (93.81% GC)
      --------------
      samples:          10000
      evals/sample:     1
      time tolerance:   5.00%
      memory tolerance: 1.00%
    
  2. It’s unclear what you are comparing. The second and the third benchmark doesn’t use fusion at all (I mean not for the actual 100x100 addition).

1 Like

You are not benchmarking what you think you are benchmarking. You are benchmarking global variable access and dynamic dispatch.

Ah I see, thanks, I had thought @benchmark took care of this.

It’s unclear what you are comparing. The second and the third benchmark doesn’t use fusion at all (I mean not for the actual 100x100 addition).

Ok, I was under the impression broadcasting over tuples “forwarded” the fusion to the contents as well, but I see now that it just does normal addition over the contents, hence the creation of the temporary arrays that double the memory.