Performance of loops

Matthijs_1971 · August 16, 2024, 1:04pm

Julia has several ways to define loops or repeat code, each with advantages and disadvantages. I am looking at the performance of some also inspired by previous posts on this forum.

#!/usr/bin/env julia

# https://discourse.julialang.org/t/map-vs-loops-array-comprehensions-in-julia-1-0/16137
# Note, this thread is from 2018.

import BenchmarkTools
import Random

function loop_with_map(x)
    x = map(a -> 2 * a, x)
    return x
end

function loop_with_map_dot(x)
    x .= map(a -> 2 * a, x)     # Slowest
    return x
end

function loop_list_comprehension(x)
    # A list comprehension creates a new local scope and is therefore slow.
    x = [ 2 * val for val in x]
    return x
end

function loop_list_comprehension_dot(x)
    # A list comprehension creates a new local scope and is therefore slow.
    x .= [ 2 * val for val in x]
    return x
end

function for_loop(x)
    for i in range(1, length(x))
        x[i] = 2 * x[i]
    end
    return x
end

function for_loop_opt(x)
    for i in eachindex(x)
        @inbounds x[i] = 2 * x[i]
    end
    return x
end

function loop_via_broadcast(x)
    @. x = 2 * x
    return x
end

function setup_function(n)
    return Random.rand(Int64, n)
end

for n in [100, 1000, 10_000, 100_000, 1_000_000]
    println("\n\nn = ", n)
    BenchmarkTools.@btime loop_with_map(x_vec)                  setup = (x_vec = setup_function($n)) # Slow
    BenchmarkTools.@btime loop_with_map_dot(x_vec)              setup = (x_vec = setup_function($n)) # Very slow
    BenchmarkTools.@btime loop_list_comprehension(x_vec)        setup = (x_vec = setup_function($n)) # Slow
    BenchmarkTools.@btime loop_list_comprehension_dot(x_vec)    setup = (x_vec = setup_function($n)) # Very slow
    BenchmarkTools.@btime for_loop(x_vec)                       setup = (x_vec = setup_function($n)) # Fast
    BenchmarkTools.@btime for_loop_opt(x_vec)                   setup = (x_vec = setup_function($n)) # Fast
    BenchmarkTools.@btime loop_via_broadcast(x_vec)             setup = (x_vec = setup_function($n)) # Fast
end

Output:

2024_08_16 14:18:30 ~/temp $ loop_performance.jl


n = 100
  89.105 ns (1 allocation: 896 bytes)
  163.902 ns (1 allocation: 896 bytes)
  138.964 ns (1 allocation: 896 bytes)
  163.821 ns (1 allocation: 896 bytes)
  12.209 ns (0 allocations: 0 bytes)
  11.278 ns (0 allocations: 0 bytes)
  12.198 ns (0 allocations: 0 bytes)


n = 1000
  1.049 μs (1 allocation: 7.94 KiB)
  1.180 μs (1 allocation: 7.94 KiB)
  1.119 μs (1 allocation: 7.94 KiB)
  1.140 μs (1 allocation: 7.94 KiB)
  80.231 ns (0 allocations: 0 bytes)
  80.838 ns (0 allocations: 0 bytes)
  81.080 ns (0 allocations: 0 bytes)


n = 10000
  7.885 μs (2 allocations: 78.17 KiB)
  11.209 μs (2 allocations: 78.17 KiB)
  8.743 μs (2 allocations: 78.17 KiB)
  11.228 μs (2 allocations: 78.17 KiB)
  1.239 μs (0 allocations: 0 bytes)
  1.215 μs (0 allocations: 0 bytes)
  1.306 μs (0 allocations: 0 bytes)


n = 100000
  116.843 μs (2 allocations: 781.30 KiB)
  171.743 μs (2 allocations: 781.30 KiB)
  119.491 μs (2 allocations: 781.30 KiB)
  174.575 μs (2 allocations: 781.30 KiB)
  15.298 μs (0 allocations: 0 bytes)
  14.419 μs (0 allocations: 0 bytes)
  15.221 μs (0 allocations: 0 bytes)


n = 1000000
  1.428 ms (2 allocations: 7.63 MiB)
  2.420 ms (2 allocations: 7.63 MiB)
  1.420 ms (2 allocations: 7.63 MiB)
  2.573 ms (2 allocations: 7.63 MiB)
  696.670 μs (0 allocations: 0 bytes)
  664.293 μs (0 allocations: 0 bytes)
  698.923 μs (0 allocations: 0 bytes)
2024_08_16 14:21:31 ~/temp $

Questions:

loop_with_map: map has been reported to be slow before, but a long time ago. I am surprised that it is slow today. Am I doing something wrong?
loop_with_map_dot: This uses broadcasting to place the result of the calculation back in x. The idea is that no new memory for x has to be allocated; memory allocation is slow. Obviously this makes loop_with_map_dot slower, not faster. I think the reason is that map already allocates memory and that the broadcast makes that the old memory of x has to be filled instead of just making x point to the memory created by map. More on broadcasting below. Correct?
loop_list_comprehension: List comprehensions create their own scope which makes them slow. But I expected this creation of a new scope to take fixed time, i.e. independent of the number of times the loop is executed. What is going on?
loop_list_comprehension_dot: Just the same as in loop_with_map_dot, the broadcast serves to re-use the memory that x points to, but as a result all of the values have to be pumped over. Correct?
for_loop_opt: Skips index checks. Why is this faster than loop_via_broadcast? It suggests (falsely?) that loop_via_broadcast does a range check while this can’t fail since there is only one container involved. If two with different lengths were involved, I can imagine that a range check makes sense.
loop_via_broadcast: here the re-use of the memory of x actually works. This has been clarified here. Now I wonder if this is the only time this works, in essence.

Elrod · August 16, 2024, 1:32pm

Yes.

You can use map! if you want a map that updates the input, instead of allocating a result vector. Definitely prefer map!(f, x) over x .= map(f, x).

mkitti · August 16, 2024, 1:34pm

A two argument map! does not exist in Base.

Elrod · August 16, 2024, 2:10pm

map!(f, x, x), then.
E.g.

map!(x -> x + x, x, x)

mkitti · August 16, 2024, 2:13pm

The docs warn against the arguments sharing memory.

│ Warning
  │
  │  Behavior can be unexpected when any mutated argument
  │  shares memory with any other argument.

mikmoore · August 16, 2024, 2:35pm

Link to #50824 that added that aliasing warning and has discussion and links to some related issues. While map!(f, x, x) is not currently guaranteed to work, it is relatively uncontroversial that it had ought to work in most circumstances. If you’re feeling slightly brave, you’ll probably never run into issues when the input is identical to the output (but more complicated aliasing could still be problematic). But broadcasting does guarantee that simple aliasing like x .= 2 .* x will work, so probably safer to use that (see also broadcast!).

mbauman · August 16, 2024, 2:41pm

Simple counter-example with severe consequences:

julia> v = zeros(2); map!(x->x+1, v, v); v
2-element Vector{Float64}:
 1.0
 1.0

julia> using SparseArrays

julia> v = spzeros(2); map!(x->x+1, v, v); v
2-element SparseVector{Float64, Int64} with 2 stored entries:
  [1]  =  2.0
  [2]  =  2.0

Matthijs_1971 · August 16, 2024, 2:48pm

Professionally, I never feel brave, sorry. My company makes extremely expensive machines and downtime of these machines is also very expensive… and rework of the products of these machines is also costly. So I am looking for the safest, securest ways for loops without loosing performance so I can promote such ways to colleagues.

mikmoore · August 16, 2024, 2:52pm

SparseArrays has a specialized method for map!, which seems to cause the issue (but could potentially be improved, at least for the trivial case of === output with an input). But in any case,

julia> v = spzeros(2); broadcast!(x->x+1, v, v); v
2-element SparseVector{Float64, Int64} with 2 stored entries:
  [1]  =  1.0
  [2]  =  1.0

so broadcast! is a suitable replacement for map!.

DNF · August 16, 2024, 3:00pm

In that case, use one of the two above, with some added remarks:

Add a trailing ! to the function names to indicate mutation.
Remove @inbounds, it already happens automatically with eachindex.
you can use the nice update operator *=, so that a = 2 * a becomes a *= 2.

DNF · August 16, 2024, 3:03pm

I don’t know about this. Isn’t it just the allocation that’s the problem?

mbauman · August 16, 2024, 3:32pm

Yes, broadcasting is much more conservative when it comes to aliasing — broadcast! doesn’t have that warning, but you’ll also see that it (sometimes) allocates more than you’d expect. It errs on the side of (sometimes) making surprising defensive copies, which you may not want in cases where you’re chasing peak performance — which points us back on topic a bit more here.

In this case, with arrays that use the default broadcast implementation, the loop_via_broadcast sees that it iterates both the LHS and RHS “in step” with eachother and thus new outputs won’t overwrite yet-to-be-read inputs… and so the implementation needn’t make that defensive copy.

The most important takeaways when looking at all of these different ways of performing 2x is that:

Assignment (the single =) is just about names that you’re choosing to use to describe things. Choosing to change what a name means doesn’t have anything to do with “memory re-use”.
All arguments are always evaluated — separately! — before being passed to a function. It can sometimes be tricky to see what makes for a separate function call — especially in the face of broadcast dots and other syntax sugars, though.

Finally, and perhaps most mind-bendingly,

The superpower of Julia is that the code you write can be as fast — or faster — than its “builtins.” Broadcasting is complicated, far more complicated than your simple loop. And simple is often fast.

Matthijs_1971 · August 16, 2024, 4:18pm

The weak evidence about scope in list comprehensions is here.. Checking this is above my pay grade. Also, I think a simple function has its own scope, so I doubt now my own statement. I got it from Discourse somewhere though, but I can’t find it back.

stephancb · August 16, 2024, 4:26pm

loop_with_map and loop_list_comprehension actually do not mutate (to also my surprise), the others do so.

Matthijs_1971 · August 16, 2024, 4:38pm

That is baffling indeed. Great catch. Although off-topic, if anybody understands this, please share!

mbauman · August 16, 2024, 4:45pm

loop_with_map and loop_list_comprehension are both of the form:

function f(x)
    x = #... something
end

This is just deciding you have a better use for the name x within the context of that function body. What x happened to be (or not) before the re-assignment has no effect. It’s just a name for us humans, no different than initially using the name “Matt” for me and then deciding that you’d rather use it for yourself. You’re not changing anything about either me or you by changing our nicknames!

Matthijs_1971 · August 16, 2024, 5:51pm

This is just deciding you have a better use for the name x within the context of that function body. What x happened to be (or not) before the re-assignment has no effect. It’s just a name for us humans, no different than initially using the name “Matt” for me and then deciding that you’d rather use it for yourself. You’re not changing anything about either me or you by changing our nicknames!
[/quote]

Correct, I spoke in haste. If x is assigned a new object in the function, the old object is unchanged and found back after the function is terminated. If the object is essentially re-used, it is mutated and the changed values are found back on the caller side. I understand this.

sgaure · August 16, 2024, 7:00pm

List comprehensions create a closure, i.e. an anonymous function, so definitely a new scope. However, a new scope does not slow anything down (except possibly the compiler, though I doubt it’s very significant).

abraemer · August 16, 2024, 8:01pm

To actually answer your question: Just write the loop. It is simple, expressive and fast. Everything else (like broadcasting) is eye candy/convenience in the end. You can still (and actually) should use the appropriate iteration function.

So the best version of your function (which incorporates the tips by @DNF):

function for_loop_opt!(x)
    for i in eachindex(x)
        x[i] *= 2
    end
    return x
end

This

is easy to read and understand even when you are not a Julia professional
does not allocate
has no possible out-of-bounds memory access (@inbounds is dangerous!)
actually infers the inbounds property if the compiler can prove it (e.g. for standard Arrays)

DNF · August 16, 2024, 8:20pm

But I was specifically referring to for_loop_opt and loop_via_broadcast, which both do.

Topic		Replies	Views
When should I write loops or vectorised calls? General Usage	17	1768	December 1, 2020
Map vs Loops & Array Comprehensions in Julia 1.0 Performance	10	5779	March 3, 2021
For-loop vs list-comprehension Performance	15	1996	April 7, 2021
For loop Performance Performance question	8	1636	January 11, 2020
Performance of map!() Performance	12	2821	August 23, 2019

Performance of loops

Related topics