Performance of loops

Julia has several ways to define loops or repeat code, each with advantages and disadvantages. I am looking at the performance of some also inspired by previous posts on this forum.

#!/usr/bin/env julia

# https://discourse.julialang.org/t/map-vs-loops-array-comprehensions-in-julia-1-0/16137
# Note, this thread is from 2018.

import BenchmarkTools
import Random

function loop_with_map(x)
    x = map(a -> 2 * a, x)
    return x
end

function loop_with_map_dot(x)
    x .= map(a -> 2 * a, x)     # Slowest
    return x
end

function loop_list_comprehension(x)
    # A list comprehension creates a new local scope and is therefore slow.
    x = [ 2 * val for val in x]
    return x
end

function loop_list_comprehension_dot(x)
    # A list comprehension creates a new local scope and is therefore slow.
    x .= [ 2 * val for val in x]
    return x
end

function for_loop(x)
    for i in range(1, length(x))
        x[i] = 2 * x[i]
    end
    return x
end

function for_loop_opt(x)
    for i in eachindex(x)
        @inbounds x[i] = 2 * x[i]
    end
    return x
end

function loop_via_broadcast(x)
    @. x = 2 * x
    return x
end

function setup_function(n)
    return Random.rand(Int64, n)
end

for n in [100, 1000, 10_000, 100_000, 1_000_000]
    println("\n\nn = ", n)
    BenchmarkTools.@btime loop_with_map(x_vec)                  setup = (x_vec = setup_function($n)) # Slow
    BenchmarkTools.@btime loop_with_map_dot(x_vec)              setup = (x_vec = setup_function($n)) # Very slow
    BenchmarkTools.@btime loop_list_comprehension(x_vec)        setup = (x_vec = setup_function($n)) # Slow
    BenchmarkTools.@btime loop_list_comprehension_dot(x_vec)    setup = (x_vec = setup_function($n)) # Very slow
    BenchmarkTools.@btime for_loop(x_vec)                       setup = (x_vec = setup_function($n)) # Fast
    BenchmarkTools.@btime for_loop_opt(x_vec)                   setup = (x_vec = setup_function($n)) # Fast
    BenchmarkTools.@btime loop_via_broadcast(x_vec)             setup = (x_vec = setup_function($n)) # Fast
end

Output:

2024_08_16 14:18:30 ~/temp $ loop_performance.jl


n = 100
  89.105 ns (1 allocation: 896 bytes)
  163.902 ns (1 allocation: 896 bytes)
  138.964 ns (1 allocation: 896 bytes)
  163.821 ns (1 allocation: 896 bytes)
  12.209 ns (0 allocations: 0 bytes)
  11.278 ns (0 allocations: 0 bytes)
  12.198 ns (0 allocations: 0 bytes)


n = 1000
  1.049 ÎĽs (1 allocation: 7.94 KiB)
  1.180 ÎĽs (1 allocation: 7.94 KiB)
  1.119 ÎĽs (1 allocation: 7.94 KiB)
  1.140 ÎĽs (1 allocation: 7.94 KiB)
  80.231 ns (0 allocations: 0 bytes)
  80.838 ns (0 allocations: 0 bytes)
  81.080 ns (0 allocations: 0 bytes)


n = 10000
  7.885 ÎĽs (2 allocations: 78.17 KiB)
  11.209 ÎĽs (2 allocations: 78.17 KiB)
  8.743 ÎĽs (2 allocations: 78.17 KiB)
  11.228 ÎĽs (2 allocations: 78.17 KiB)
  1.239 ÎĽs (0 allocations: 0 bytes)
  1.215 ÎĽs (0 allocations: 0 bytes)
  1.306 ÎĽs (0 allocations: 0 bytes)


n = 100000
  116.843 ÎĽs (2 allocations: 781.30 KiB)
  171.743 ÎĽs (2 allocations: 781.30 KiB)
  119.491 ÎĽs (2 allocations: 781.30 KiB)
  174.575 ÎĽs (2 allocations: 781.30 KiB)
  15.298 ÎĽs (0 allocations: 0 bytes)
  14.419 ÎĽs (0 allocations: 0 bytes)
  15.221 ÎĽs (0 allocations: 0 bytes)


n = 1000000
  1.428 ms (2 allocations: 7.63 MiB)
  2.420 ms (2 allocations: 7.63 MiB)
  1.420 ms (2 allocations: 7.63 MiB)
  2.573 ms (2 allocations: 7.63 MiB)
  696.670 ÎĽs (0 allocations: 0 bytes)
  664.293 ÎĽs (0 allocations: 0 bytes)
  698.923 ÎĽs (0 allocations: 0 bytes)
2024_08_16 14:21:31 ~/temp $  

Questions:

  1. loop_with_map: map has been reported to be slow before, but a long time ago. I am surprised that it is slow today. Am I doing something wrong?
  2. loop_with_map_dot: This uses broadcasting to place the result of the calculation back in x. The idea is that no new memory for x has to be allocated; memory allocation is slow. Obviously this makes loop_with_map_dot slower, not faster. I think the reason is that map already allocates memory and that the broadcast makes that the old memory of x has to be filled instead of just making x point to the memory created by map. More on broadcasting below. Correct?
  3. loop_list_comprehension: List comprehensions create their own scope which makes them slow. But I expected this creation of a new scope to take fixed time, i.e. independent of the number of times the loop is executed. What is going on?
  4. loop_list_comprehension_dot: Just the same as in loop_with_map_dot, the broadcast serves to re-use the memory that x points to, but as a result all of the values have to be pumped over. Correct?
  5. for_loop_opt: Skips index checks. Why is this faster than loop_via_broadcast? It suggests (falsely?) that loop_via_broadcast does a range check while this can’t fail since there is only one container involved. If two with different lengths were involved, I can imagine that a range check makes sense.
  6. loop_via_broadcast: here the re-use of the memory of x actually works. This has been clarified here. Now I wonder if this is the only time this works, in essence.

Yes.

You can use map! if you want a map that updates the input, instead of allocating a result vector. Definitely prefer map!(f, x) over x .= map(f, x).

2 Likes

A two argument map! does not exist in Base.

2 Likes

map!(f, x, x), then.
E.g.

map!(x -> x + x, x, x)
1 Like

The docs warn against the arguments sharing memory.

│ Warning
  │
  │  Behavior can be unexpected when any mutated argument
  │  shares memory with any other argument.


8 Likes

Link to #50824 that added that aliasing warning and has discussion and links to some related issues. While map!(f, x, x) is not currently guaranteed to work, it is relatively uncontroversial that it had ought to work in most circumstances. If you’re feeling slightly brave, you’ll probably never run into issues when the input is identical to the output (but more complicated aliasing could still be problematic). But broadcasting does guarantee that simple aliasing like x .= 2 .* x will work, so probably safer to use that (see also broadcast!).

Simple counter-example with severe consequences:

julia> v = zeros(2); map!(x->x+1, v, v); v
2-element Vector{Float64}:
 1.0
 1.0

julia> using SparseArrays

julia> v = spzeros(2); map!(x->x+1, v, v); v
2-element SparseVector{Float64, Int64} with 2 stored entries:
  [1]  =  2.0
  [2]  =  2.0
5 Likes

Professionally, I never feel brave, sorry. My company makes extremely expensive machines and downtime of these machines is also very expensive… and rework of the products of these machines is also costly. So I am looking for the safest, securest ways for loops without loosing performance so I can promote such ways to colleagues.

4 Likes

SparseArrays has a specialized method for map!, which seems to cause the issue (but could potentially be improved, at least for the trivial case of === output with an input). But in any case,

julia> v = spzeros(2); broadcast!(x->x+1, v, v); v
2-element SparseVector{Float64, Int64} with 2 stored entries:
  [1]  =  1.0
  [2]  =  1.0

so broadcast! is a suitable replacement for map!.

2 Likes

In that case, use one of the two above, with some added remarks:

  • Add a trailing ! to the function names to indicate mutation.
  • Remove @inbounds, it already happens automatically with eachindex.
  • you can use the nice update operator *=, so that a = 2 * a becomes a *= 2.
1 Like

I don’t know about this. Isn’t it just the allocation that’s the problem?

1 Like

Yes, broadcasting is much more conservative when it comes to aliasing — broadcast! doesn’t have that warning, but you’ll also see that it (sometimes) allocates more than you’d expect. It errs on the side of (sometimes) making surprising defensive copies, which you may not want in cases where you’re chasing peak performance — which points us back on topic a bit more here.

In this case, with arrays that use the default broadcast implementation, the loop_via_broadcast sees that it iterates both the LHS and RHS “in step” with eachother and thus new outputs won’t overwrite yet-to-be-read inputs… and so the implementation needn’t make that defensive copy.

The most important takeaways when looking at all of these different ways of performing 2x is that:

  • Assignment (the single =) is just about names that you’re choosing to use to describe things. Choosing to change what a name means doesn’t have anything to do with “memory re-use”.
  • All arguments are always evaluated — separately! — before being passed to a function. It can sometimes be tricky to see what makes for a separate function call — especially in the face of broadcast dots and other syntax sugars, though.

Finally, and perhaps most mind-bendingly,

  • The superpower of Julia is that the code you write can be as fast — or faster — than its “builtins.” Broadcasting is complicated, far more complicated than your simple loop. And simple is often fast.
6 Likes

The weak evidence about scope in list comprehensions is here.. Checking this is above my pay grade. Also, I think a simple function has its own scope, so I doubt now my own statement. I got it from Discourse somewhere though, but I can’t find it back.

loop_with_map and loop_list_comprehension actually do not mutate (to also my surprise), the others do so.

1 Like

That is baffling indeed. Great catch. Although off-topic, if anybody understands this, please share!

loop_with_map and loop_list_comprehension are both of the form:

function f(x)
    x = #... something
end

This is just deciding you have a better use for the name x within the context of that function body. What x happened to be (or not) before the re-assignment has no effect. It’s just a name for us humans, no different than initially using the name “Matt” for me and then deciding that you’d rather use it for yourself. You’re not changing anything about either me or you by changing our nicknames!

This is just deciding you have a better use for the name x within the context of that function body. What x happened to be (or not) before the re-assignment has no effect. It’s just a name for us humans, no different than initially using the name “Matt” for me and then deciding that you’d rather use it for yourself. You’re not changing anything about either me or you by changing our nicknames!
[/quote]

Correct, I spoke in haste. If x is assigned a new object in the function, the old object is unchanged and found back after the function is terminated. If the object is essentially re-used, it is mutated and the changed values are found back on the caller side. I understand this.

List comprehensions create a closure, i.e. an anonymous function, so definitely a new scope. However, a new scope does not slow anything down (except possibly the compiler, though I doubt it’s very significant).

To actually answer your question: Just write the loop. It is simple, expressive and fast. Everything else (like broadcasting) is eye candy/convenience in the end. You can still (and actually) should use the appropriate iteration function.

So the best version of your function (which incorporates the tips by @DNF):

function for_loop_opt!(x)
    for i in eachindex(x)
        x[i] *= 2
    end
    return x
end

This

  • is easy to read and understand even when you are not a Julia professional
  • does not allocate
  • has no possible out-of-bounds memory access (@inbounds is dangerous!)
  • actually infers the inbounds property if the compiler can prove it (e.g. for standard Arrays)
2 Likes

But I was specifically referring to for_loop_opt and loop_via_broadcast, which both do.