Why are for loops slower than broadcast with splatted arguments?

I’m broadcasting functions which return tuples, and have been converting arrays of tuples to tuples of arrays using “unzip”. I used for loops over each individual entry and re-assembled the output as a tuple.

using BenchmarkTools

function foo(a,b,c,d)
    ab = exp(a*b)
    return ab+c, ab-d
end

N = 10000
a,b,c,d = [randn(N) for i = 1:4]
X = (a,b)
Y = (c,d)

function foo_loop(N,a,b,c,d)
    for i = 1:N
        foo(a[i],b[i],c[i],d[i])
    end
end
@btime foo_loop($N,$a,$b,$c,$d)

function foo_loop_splat!(N,X,Y,X_entry,Y_entry)
    for i = 1:N
        for fld in eachindex(X)
            X_entry[fld] = X[fld][i]
            Y_entry[fld] = Y[fld][i]
        end
        foo(X_entry...,Y_entry...)
    end
end
X_entry, Y_entry = [zeros(length(X)) for i = 1:2]
@btime foo_loop_splat!($N,$X,$Y,$X_entry,$Y_entry)

I noticed that the splatted for loop is much slower than the broadcasted “unzip” approach

  102.436 μs (0 allocations: 0 bytes)
  1.046 ms (50000 allocations: 937.50 KiB)

In comparison, using broadcasting and unzip

unzip(a) = map(x->getfield.(a, x), fieldnames(eltype(a)))
@btime unzip(foo.($X...,$Y...))
@btime unzip(foo.($a,$b,$c,$d))

both runtimes are about the same

  158.962 μs (8 allocations: 312.78 KiB)
  156.331 μs (8 allocations: 312.78 KiB)

Why does this occur? I’m not clear on why splatting adds so much extra time. Is it runtime dispatch?

You’re benchmarking in global scope, so the performance numbers are not particularly meaningful. See https://docs.julialang.org/en/v1/manual/performance-tips/index.html#Avoid-global-variables-1 and https://github.com/JuliaCI/BenchmarkTools.jl#quick-start

3 Likes

Just a quck addition to @rdeits response. The benchmarking in global scope really slows things down the for loop code here. On my machine, I get:

using BenchmarkTools

function foo(a,b,c,d)
    ab = exp(a*b)
    return ab+c, ab-d
end

N = 1000
a,b,c,d = [randn(N) for i = 1:4]
X = (a,b)
Y = (c,d)

u,v = [zeros(N) for i = 1:2]
@btime begin
    for i = 1:N
        foo(a[i],b[i],c[i],d[i])        
    end

Then, if I define

function foo_loop(N,a,b,c,d)
         for i = 1:N
               foo(a[i],b[i],c[i],d[i])        
           end
end

109.880 μs

Then benchmark by calling the function and interpolating the input arguments into the benchmarking expression (https://github.com/JuliaCI/BenchmarkTools.jl/blob/master/doc/manual.md#interpolating-values-into-benchmark-expressions) I get:

julia> @btime foo_loop($N,$a,$b,$c,$d)
  7.273 μs (0 allocations: 0 bytes)
1 Like

Thanks @rdeits and @Pbellive! I’ve edited the original code snippet to interpolate values and avoid global scope timing.

function foo_loop(N,a,b,c,d)
    for i = 1:N
        foo(a[i],b[i],c[i],d[i])
    end
end
@btime foo_loop($N,$a,$b,$c,$d)

function foo_loop_splat!(N,X,Y,X_entry,Y_entry)
    for i = 1:N
        for fld in eachindex(X)
            X_entry[fld] = X[fld][i]
            Y_entry[fld] = Y[fld][i]
        end
        foo(X_entry...,Y_entry...)
    end
end
X_entry, Y_entry = [zeros(length(X)) for i = 1:2]
@btime foo_loop_splat!($N,$X,$Y,$X_entry,$Y_entry)

unzip(a) = map(x->getfield.(a, x), fieldnames(eltype(a)))
@btime unzip(foo.($X...,$Y...))
@btime unzip(foo.($a,$b,$c,$d))

The for loop is now faster than the unzip-and-broadcast code, though the splatting is still slower.

  158.962 μs (8 allocations: 312.78 KiB)
  156.331 μs (8 allocations: 312.78 KiB)
  102.436 μs (0 allocations: 0 bytes)
  1.046 ms (50000 allocations: 937.50 KiB)

I think I figured out why it’s so slow - I wasn’t being careful with tuples and arrays.

In the original splatted loop code, I’m splatting the X_entry, Y_entry arguments, which (I guess) converts from an Array to a Tuple.

function foo_loop_splat!(N,X,Y,Xi,Yi)
    for i = 1:N
        for fld in eachindex(X)
            Xi[fld] = X[fld][i]
            Yi[fld] = Y[fld][i]
        end
        foo(Xi...,Yi...)
    end
end

An old post by @rdeits on Array of tuples noted that tuples have much less overhead compared to arrays. If I replace the for loop with the (simpler) tuple-based version

function foo_loop_splat!(N,X,Y)
    for i = 1:N
        foo(getindex.(X,i)...,getindex.(Y,i)...)
    end
end

then the non-splatted and splatted for loop give identical runtimes

@btime foo_loop($N,$a,$b,$c,$d)
@btime foo_loop_splat!($N,$X,$Y)
  1.049 ms (0 allocations: 0 bytes)
  1.093 ms (0 allocations: 0 bytes)
1 Like

The key is that arrays don’t encode their lengths into the type system like tuples do. This means when you splat it into a function call, Julia doesn’t know (ahead of time) how many arguments there will be — and thus cannot predict which method will be called. This is essentially a type instability.

5 Likes

@mbauman - thanks for the interpretation as type instability! That’s helpful to know.

I converted my code to use tuples of arrays instead of arrays of arrays. However, the immutability of tuples prevents me from modifying them. Is there a recommended mutable alternative to tuples (StructArrays or StaticArrays?) which also encode lengths into their type?

You could try using an MArray from StaticArrays.jl: https://juliaarrays.github.io/StaticArrays.jl/latest/pages/api/#StaticArrays.MArray since they are mutable and fixed size.

2 Likes

I don’t think splatting MArrays will be as fast as tuples, though. The type-system-encoded length is necessary but not sufficient. I think tuples have some special builtin secret sauce — or they did last I checked.

It’s often possible to simply avoid splatting (or not worry about it).

1 Like

I’m curious why this is. A StaticArray is just a NTuple underneath. Is it just a missing optimization?

Splatting tuples does seem much faster than splatting mutable or sized arrays. Running the same tests with SizedVectors

# replace tuples with sized vectors
X = SizedVector{2}(a,b)
Y = SizedVector{2}(c,d)
@btime foo_loop_splat!($N,$X,$Y)

is pretty slow compared to splatting tuples.

  106.395 μs (0 allocations: 0 bytes) # splatted tuples
  4.501 ms (170000 allocations: 5.19 MiB) # splatted SizedVector

I’ve gotten around immutability by overwriting tuples in global scope, e.g.

X = X .+ Y

It doesn’t seem that slow to me, but I can see where it would be limiting.

Splatting is “just” iteration, though, and that’s a tough thing to “see through” to the end in general at the type system level.