About dot notation for pre-allocated values a .= f.(x)

Hi! This is maybe a bit of a random question, so I hope that this is the right place to ask it - my apologies if not.

In the Manual in the section about Pre-allocating outputs under the Performance tips (Pre-allocating outputs) there are two examples: one where they do and one where they don’t pre-allocate the values in a function containing a loop over 10^7 values. Here are the two examples they provide:

Version 1 (without preallocation)

julia> function xinc(x)
           return [x, x+1, x+2]
       end;

julia> function loopinc()
           y = 0
           for i = 1:10^7
               ret = xinc(i)
               y += ret[2]
           end
           return y
       end;

and Version 2 (with preallocating):

julia> function xinc!(ret::AbstractVector{T}, x::T) where T
           ret[1] = x
           ret[2] = x+1
           ret[3] = x+2
           nothing
       end;

julia> function loopinc_prealloc()
           ret = Vector{Int}(undef, 3)
           y = 0
           for i = 1:10^7
               xinc!(ret, i)
               y += ret[2]
           end
           return y
       end;

and these are the performance results:

julia> @time loopinc()
  0.529894 seconds (40.00 M allocations: 1.490 GiB, 12.14% gc time)
50000015000000

julia> @time loopinc_prealloc()
  0.030850 seconds (6 allocations: 288 bytes)
50000015000000

Obviously pre-allocating is better, but they acknowledge that writing it like this can often be a bit ugly and cumbersome, which is why the syntax a .= f.(x) exists. And they say:

However, for “vectorized” (element-wise) functions, the convenient syntax x .= f.(y) can be used for in-place operations with fused loops and no temporary arrays (see the dot syntax for vectorizing functions).

Now coming to my questions: how would I use the suggested syntax x .= f.(y) to rewrite the above example? I was not able to figure that out.

Also I tried to create a little example myself to compare the efficiency with and without dot indexation, however, in my example the version with the dot index was actually worse than the other one and is not clear to me why. My example:

f(x) = x^2
y = rand(10^9);
function withDot(x)
    a = similar(x)
    a .= f.(x)
    return a
end
function withoutDot(x)
    b = f.(x)
    return b
end

@time withDot(y);
2.911010 seconds (2 allocations: 7.451 GiB, 22.19% gc time)

@time withoutDot(y);
1.736906 seconds (2 allocations: 7.451 GiB)

Here I don’t understand why the first one is worse and why the garbage collection time is so large.

To summarise: I don’t understand in what situation and how I would use the a .= f.(x) syntax correctly to improve the performance. I would much appreciate if someone could maybe provide some examples and explain it a bit more (since unfortunately I seem to not have properly understood how it is meant to work from reading the Manual).

Thanks for the help! I hope this is the right place to be asking something like this. (I only started learning Julia about 2 days ago)

you’re using @time, so sometimes you hit GC, sometimes you don’t, and that’s just luck. you should use @btime from BenchmarkTools.jl

in this case clearly the first one hit GC (look at the 22.19% gc time), and the second run just happens to not hit, but if you run each multiple times, you will see what I mean

These two functions do the same thing. Writing b = f.(x) is (almost) the same as writing

temp = similar(x)
temp .= f.(x)
b = temp

As for your main question, it’s a little bit difficult to answer. I don’t know how much I am allowed to change your code. I don’t see I direct way to use in-place broadcasting for what you want, but I could write sum(2:10^7+1) to achieve the same result. What am I allowed to change?

And yes, use BenchmarkTools.jl for benchmarks.

Thank you, I tried to use the benchmark tool and indeed got (pretty much) the same time then!

Thanks for clarifying that the two version of my example do the same thing!

About the main question: The piece of code I pasted there is the example from the Manual. You can change it however much you like. In the Manual they just say that in situations like these you can use the dot indexation, and I don’t understand how that would be done in their example.

But if that is not a good example to turn into dot indexation, I would also be happy to just see any example with and without dot indexation, where the dot indexation makes it better (because as you have clarified, the example I tried to come up with ended up being the same for both).

Just so that I can understand in what situation the dot indexation would actually be useful.

What’s ugly about the example is that there are three separate assignments which could be replaced by a single fused expression.

Here’s one way:

julia> function xinc!(ret::AbstractVector{T}, x::T) where T
           ys = (0, 1, 2)
           ret .= x .+ ys
           nothing
       end;

Thank you for the answer! I tested this and found that in your example the “.=” is not better than simply using “=” Following this example I still don’t understand when the “.=” could actually be used to improve code that would otherwise be much slower.

See the benchmark:

function xinc1(x)
    return [x, x+1, x+2]
end;

function xinc2!(ret::AbstractVector{T}, x::T) where T
    ret[1] = x
    ret[2] = x+1
    ret[3] = x+2
    nothing
end;

function xinc3!(ret::AbstractVector{T}, x::T) where T
    ys = (0, 1, 2)
    ret .= x .+ ys
    nothing
end;

function xinc4!(ret::AbstractVector{T}, x::T) where T
    ys = (0, 1, 2)
    ret = x .+ ys
    nothing
end;
function loopinc()
    y = 0
    for i = 1:10^7
        ret = xinc1(i)
        y += ret[2]
    end
    return y
end;

function loopinc_prealloc2()
    ret = Vector{Int}(undef, 3)
    y = 0
    for i = 1:10^7
        xinc2!(ret, i)
        y += ret[2]
    end
    return y
end;

function loopinc_prealloc3()
    ret = Vector{Int}(undef, 3)
    y = 0
    for i = 1:10^7
        xinc3!(ret, i)
        y += ret[2]
    end
    return y
end;

function loopinc_prealloc4()
    ret = Vector{Int}(undef, 3)
    y = 0
    for i = 1:10^7
        xinc4!(ret, i)
        y += ret[2]
    end
    return y
end;
using BenchmarkTools
@btime loopinc();
@btime loopinc_prealloc2();
@btime loopinc_prealloc3();
@btime loopinc_prealloc4();
201.271 ms (10000000 allocations: 762.94 MiB)
  5.205 ms (1 allocation: 80 bytes)
  38.481 ms (1 allocation: 80 bytes)
  15.716 ns (1 allocation: 80 bytes)

It is also not clear to me why your suggestion including the .= takes 40ms while the version with simply = takes only 15 ns! From everything I have seen so far in the answers I don’t yet see any situation in which I would actually want to use the “.=” (and I would like to understand in what situation it would be useful)

It’s not clear to me quite what you are asking. Do you want an example of how .= would be beneficial for this particular example code, or any use of it in any situation?

Note, by the way, that this function does not modify the input argument ret. It’s a function that does no work that can be seen from the outside, so the compiler can just skip it.

2 Likes

Either is fine. I just thought the example from the Manual might be useful for it, but I am happy to see it in any example. I am not looking to improve a specific section of code, I am looking for conceptual understanding of the benefits of dot indexation on any concrete example.

Note, by the way, that this function does not modify the input argument ret . It’s a function that does no work that can be seen from the outside, so the compiler can just skip it.

This highlights my lack of understanding the dot indexation :sweat_smile: The only difference between xinc3! and xinc4! is the line

ret .= x  .+ ys

or

ret = x  .+ ys

Why does the former modify ret but the latter doesn’t?

The above means modify ret in-place.

This means create a new array x .+ ys, and name it ret. The old ret is unaffected.

The purpose of the dot broadcasting is syntactical convenience. You could always write a loop (or a nested loop) instead.

x = a .+ b creates a new array and calls it x. While x .= a .+ b modifies a pre-existing array x.

1 Like

I must admit that I have never fully understood why
ret .= x + ys
allocates.

(Or rather, the syntax sometimes involve an annoying number of dots on the RHS.)

There is the @. macro for that particular problem.

Yes, but macros are kind of unsightly (matter of taste, clearly). It would be great if the .= were enough. Maybe there are good reasons for why not, but that’s beyond me.

Assigning to an existing buffer and broadcasting over the rhs are not the same, i.e., you might want to
distinguish between

res = f.(x)  # broadcasting f and storing result
res .= f.(x)  # broadcast f and assign to an existing buffer

and

res .= g(x)  # call vector-valued function g and assign result to an existing buffer

where f acts on scalars and g on arrays.

All these cases could still be handled if .= would proliferate broadcasting, i.e., be equivalent to @. res = f(x), but would require to prevent broadcasting in the last example – note that the usual Ref trick would not work in this case (as Ref is also broadcasted itself) and @. provides a special splicing $ syntax for that case:

ret = @. f(x)  # same as first above
@. ret = f(x)  # same as second above
@. ret = $g(x)  # same as third above

In the end, it’s just a question of whether broadcasting or not should be your default and Julia happened to choose the latter.

1 Like

I had thought that dot broadcasting would just write a loop, but in this case it’s timing about 10x worse than a simple for loop.

On my machine:
three explicit assignments: 4ms
for loop: 5ms
for loop with @inbounds annotation: 4ms
.= on mapped sum: 10ms
.= on dot broadcast .+: 44ms

Is there a rule of thumb to predict when broadcasting could cause such a 10x performance degradation?

Not an answer, but there was a similar report (although in the specific context of an Any array) a few days ago here:

I’d be keen to have an answer to your question myself.

1 Like