In-place/fused broadcasting

robsmith11 · February 22, 2019, 7:37am

I often write something like

exp.(rand(10^6))

which is convienent, but inefficient as it allocates 2 arrays.

What I really want is

x = rand(10^6)
map!(exp, x, x)

Is there a single-line/idiomatic way to write this operation without binding to a variable and still get only the 1 allocation?

improbable22 · February 22, 2019, 7:45am

I was going to suggest this, as rand.() is often a neat trick for avoiding some allocations:

exp.(rand.()) .+ 0 .* (1:10^6)

BUT it turns out this is an error, ArgumentError: step cannot be zero.

Tamas_Papp · February 22, 2019, 7:53am

[exp(rand()) for _ in 1:10^6]

They key is that you are not really broadcasting over anything, so forcing this into the broadcasting framework is not idiomatic. You are generating a vector elementwise.

mauro3 · February 22, 2019, 8:55am

I came up with this

exp.(rand() for i=1:10^6)

but it allocate also two arrays as OP’s. Any idea why? (Also note that although it has two allocations it’s as fast as Tamas’, and as fast as OP’s)

Tamas_Papp · February 22, 2019, 11:07am

I did not profile, but I would imagine that the cost of rand() would dominate.

bennedich · February 22, 2019, 11:30am

The problem with that approach is that it’s broadcasting a generator, which ends up materializing the generator first, hence the double allocations. If you’re intent on using broadcasting, you can do better by avoiding the generator as in this example:

exp.(1:10^6 .|>_-> rand())

As for performance, both versions benchmark about the same if using @btime, BUT it’s because @btime reports the minimum time of many invocations. Versions allocating more memory will have a higher average runtime due to less cache efficiency and garbage collection. See below; first @Tamas_Papp’s version:

julia> @benchmark [exp(rand()) for _ in 1:10^6]
  memory estimate:  7.63 MiB
  --------------
  minimum time:     11.669 ms (0.00% GC)
  median time:      12.656 ms (0.00% GC)
  mean time:        13.682 ms (7.57% GC)
  maximum time:     18.180 ms (17.96% GC)
  --------------
  samples:          366

Next my broadcasted version:

julia> @benchmark exp.(1:10^6 .|>_-> rand())
  memory estimate:  7.63 MiB
  --------------
  minimum time:     11.263 ms (0.00% GC)
  median time:      12.247 ms (0.00% GC)
  mean time:        13.254 ms (7.66% GC)
  maximum time:     17.265 ms (18.91% GC)
  --------------
  samples:          378

Finally the broadcasted generator (double allocations). Note how the minimum time is about the same, but average is quite a bit higher, and there’s more time spent GCing:

julia> @benchmark exp.(rand() for i=1:10^6)
  memory estimate:  15.26 MiB
  --------------
  minimum time:     12.164 ms (0.00% GC)
  median time:      17.054 ms (15.13% GC)
  mean time:        16.380 ms (13.03% GC)
  maximum time:     20.556 ms (19.80% GC)
  --------------
  samples:          305

robsmith11 · February 22, 2019, 12:27pm

Looking at the timings, I see that the two once-allocating versions suggested in this thread are actually slower than the twice-allocating version and the explicit map!() (which is the fastest):

julia> @benchmark exp.(rand(10^6))
BenchmarkTools.Trial:
  memory estimate:  15.26 MiB
  allocs estimate:  4
  --------------
  minimum time:     9.125 ms (0.00% GC)
  median time:      9.517 ms (2.78% GC)
  mean time:        9.590 ms (3.56% GC)
  maximum time:     50.124 ms (80.48% GC)
  --------------
  samples:          522
  evals/sample:     1

julia> @benchmark [exp(rand()) for _ in 1:10^6]
BenchmarkTools.Trial:
  memory estimate:  7.63 MiB
  allocs estimate:  2
  --------------
  minimum time:     10.398 ms (0.00% GC)
  median time:      10.463 ms (0.00% GC)
  mean time:        10.710 ms (2.36% GC)
  maximum time:     51.215 ms (78.24% GC)
  --------------
  samples:          467
  evals/sample:     1

julia> @benchmark exp.(1:10^6 .|> _ -> rand())
BenchmarkTools.Trial:
  memory estimate:  7.63 MiB
  allocs estimate:  2
  --------------
  minimum time:     10.348 ms (0.00% GC)
  median time:      10.412 ms (0.00% GC)
  mean time:        10.671 ms (2.41% GC)
  maximum time:     51.442 ms (78.33% GC)
  --------------
  samples:          469
  evals/sample:     1

julia> @benchmark (x = rand(10^6); map!(exp, x, x))
BenchmarkTools.Trial:
  memory estimate:  7.63 MiB
  allocs estimate:  2
  --------------
  minimum time:     8.997 ms (0.00% GC)
  median time:      9.059 ms (0.00% GC)
  mean time:        9.350 ms (3.04% GC)
  maximum time:     49.337 ms (80.68% GC)
  --------------
  samples:          535
  evals/sample:     1

I guess it’s because the vectorized version of rand() is faster:

julia> @benchmark [rand() for _ in 1:10^6]
BenchmarkTools.Trial:
  memory estimate:  7.63 MiB
  allocs estimate:  2
  --------------
  minimum time:     1.567 ms (0.00% GC)
  median time:      1.581 ms (0.00% GC)
  mean time:        1.673 ms (5.33% GC)
  maximum time:     42.036 ms (94.77% GC)
  --------------
  samples:          2987
  evals/sample:     1

julia> @benchmark rand(10^6)
BenchmarkTools.Trial:
  memory estimate:  7.63 MiB
  allocs estimate:  2
  --------------
  minimum time:     716.757 μs (0.00% GC)
  median time:      731.317 μs (0.00% GC)
  mean time:        805.241 μs (8.98% GC)
  maximum time:     41.032 ms (96.89% GC)
  --------------
  samples:          6201
  evals/sample:     1

mauro3 · February 22, 2019, 12:37pm

Do you know why this is done/needed?

bennedich · February 22, 2019, 1:09pm

Yes. Generators are not broadcastable, so they will be materialized during broadcasting:

julia> a = Base.broadcasted(_->rand(), 1:3) # this is 1:3 .|>_-> rand()
Base.Broadcast.Broadcasted(getfield(Main, Symbol("##271#272"))(), (1:3,))

julia> Broadcast.broadcastable(a) # same object, no materialization
Base.Broadcast.Broadcasted(getfield(Main, Symbol("##271#272"))(), (1:3,))

julia> b = (rand() for i=1:3)
Base.Generator{UnitRange{Int64},getfield(Main, Symbol("##273#274"))}(getfield(Main, Symbol("##273#274"))(), 1:3)

julia> Broadcast.broadcastable(b) # materialization happens here
3-element Array{Float64,1}:
 0.9143249336534409
 0.733436533928558
 0.9617715309192958

As for why it’s done this way, perhaps someone more familiar with the internals can clarify, but I believe that it has to do with generators not being indexable:

julia> a[2]
0.848578901008149

julia> b[2]
ERROR: MethodError: no method matching getindex(::Base.Generator{UnitRange{Int64},getfield(Main, Symbol("##273#274"))}, ::Int64)

Seif_Shebl · February 22, 2019, 1:21pm

Also map(_ -> exp(rand()), 1:10^6), it’s elegant and has the same speed as a list comprehension.

julia> @benchmark map(_->exp(rand()),1:10^6)
BenchmarkTools.Trial:
  memory estimate:  7.63 MiB
  allocs estimate:  2
  --------------
  minimum time:     11.519 ms (0.00% GC)
  median time:      11.833 ms (0.00% GC)
  mean time:        12.518 ms (6.99% GC)
  maximum time:     59.233 ms (80.46% GC)
  --------------
  samples:          400
  evals/sample:     1

julia> @benchmark [exp(rand()) for _ in 1:10^6]
BenchmarkTools.Trial:
  memory estimate:  7.63 MiB
  allocs estimate:  2
  --------------
  minimum time:     11.605 ms (0.00% GC)
  median time:      11.933 ms (0.00% GC)
  mean time:        12.600 ms (6.97% GC)
  maximum time:     59.423 ms (80.24% GC)
  --------------
  samples:          397
  evals/sample:     1

Nosferican · February 22, 2019, 2:32pm

I would use,

rand(10^6) |> (x -> map!(exp, x, x))

Topic		Replies	Views
Why does broadcast operation allocate? Performance	3	1274	July 28, 2021
Performance of simple broadcasting operations with many arguments Performance performance , broadcast	15	1591	November 29, 2021
Understanding the behaviour of @. macro New to Julia	5	383	May 25, 2024
In-place assignment, without allocation Performance indexing , inplace	8	662	September 13, 2023
Strange allocations in broadcasting (with `^`?) Performance performance , potential-bug	3	491	December 1, 2021

In-place/fused broadcasting

Related topics