Why are for loops slower than broadcast with splatted arguments?

jlchan · November 6, 2019, 6:02pm

I’m broadcasting functions which return tuples, and have been converting arrays of tuples to tuples of arrays using “unzip”. I used for loops over each individual entry and re-assembled the output as a tuple.

using BenchmarkTools

function foo(a,b,c,d)
    ab = exp(a*b)
    return ab+c, ab-d
end

N = 10000
a,b,c,d = [randn(N) for i = 1:4]
X = (a,b)
Y = (c,d)

function foo_loop(N,a,b,c,d)
    for i = 1:N
        foo(a[i],b[i],c[i],d[i])
    end
end
@btime foo_loop($N,$a,$b,$c,$d)

function foo_loop_splat!(N,X,Y,X_entry,Y_entry)
    for i = 1:N
        for fld in eachindex(X)
            X_entry[fld] = X[fld][i]
            Y_entry[fld] = Y[fld][i]
        end
        foo(X_entry...,Y_entry...)
    end
end
X_entry, Y_entry = [zeros(length(X)) for i = 1:2]
@btime foo_loop_splat!($N,$X,$Y,$X_entry,$Y_entry)

I noticed that the splatted for loop is much slower than the broadcasted “unzip” approach

  102.436 μs (0 allocations: 0 bytes)
  1.046 ms (50000 allocations: 937.50 KiB)

In comparison, using broadcasting and unzip

unzip(a) = map(x->getfield.(a, x), fieldnames(eltype(a)))
@btime unzip(foo.($X...,$Y...))
@btime unzip(foo.($a,$b,$c,$d))

both runtimes are about the same

  158.962 μs (8 allocations: 312.78 KiB)
  156.331 μs (8 allocations: 312.78 KiB)

Why does this occur? I’m not clear on why splatting adds so much extra time. Is it runtime dispatch?

rdeits · November 6, 2019, 6:10pm

You’re benchmarking in global scope, so the performance numbers are not particularly meaningful. See Performance Tips · The Julia Language and GitHub - JuliaCI/BenchmarkTools.jl: A benchmarking framework for the Julia language

Pbellive · November 6, 2019, 6:15pm

Just a quck addition to @rdeits response. The benchmarking in global scope really slows things down the for loop code here. On my machine, I get:

using BenchmarkTools

function foo(a,b,c,d)
    ab = exp(a*b)
    return ab+c, ab-d
end

N = 1000
a,b,c,d = [randn(N) for i = 1:4]
X = (a,b)
Y = (c,d)

u,v = [zeros(N) for i = 1:2]
@btime begin
    for i = 1:N
        foo(a[i],b[i],c[i],d[i])        
    end

Then, if I define

function foo_loop(N,a,b,c,d)
         for i = 1:N
               foo(a[i],b[i],c[i],d[i])        
           end
end

109.880 μs

Then benchmark by calling the function and interpolating the input arguments into the benchmarking expression (https://github.com/JuliaCI/BenchmarkTools.jl/blob/master/doc/manual.md#interpolating-values-into-benchmark-expressions) I get:

julia> @btime foo_loop($N,$a,$b,$c,$d)
  7.273 μs (0 allocations: 0 bytes)

jlchan · November 6, 2019, 6:23pm

Thanks @rdeits and @Pbellive! I’ve edited the original code snippet to interpolate values and avoid global scope timing.

function foo_loop(N,a,b,c,d)
    for i = 1:N
        foo(a[i],b[i],c[i],d[i])
    end
end
@btime foo_loop($N,$a,$b,$c,$d)

function foo_loop_splat!(N,X,Y,X_entry,Y_entry)
    for i = 1:N
        for fld in eachindex(X)
            X_entry[fld] = X[fld][i]
            Y_entry[fld] = Y[fld][i]
        end
        foo(X_entry...,Y_entry...)
    end
end
X_entry, Y_entry = [zeros(length(X)) for i = 1:2]
@btime foo_loop_splat!($N,$X,$Y,$X_entry,$Y_entry)

unzip(a) = map(x->getfield.(a, x), fieldnames(eltype(a)))
@btime unzip(foo.($X...,$Y...))
@btime unzip(foo.($a,$b,$c,$d))

The for loop is now faster than the unzip-and-broadcast code, though the splatting is still slower.

  158.962 μs (8 allocations: 312.78 KiB)
  156.331 μs (8 allocations: 312.78 KiB)
  102.436 μs (0 allocations: 0 bytes)
  1.046 ms (50000 allocations: 937.50 KiB)

jlchan · November 6, 2019, 7:55pm

I think I figured out why it’s so slow - I wasn’t being careful with tuples and arrays.

In the original splatted loop code, I’m splatting the X_entry, Y_entry arguments, which (I guess) converts from an Array to a Tuple.

function foo_loop_splat!(N,X,Y,Xi,Yi)
    for i = 1:N
        for fld in eachindex(X)
            Xi[fld] = X[fld][i]
            Yi[fld] = Y[fld][i]
        end
        foo(Xi...,Yi...)
    end
end

An old post by @rdeits on Array of tuples - #7 by rdeits noted that tuples have much less overhead compared to arrays. If I replace the for loop with the (simpler) tuple-based version

function foo_loop_splat!(N,X,Y)
    for i = 1:N
        foo(getindex.(X,i)...,getindex.(Y,i)...)
    end
end

then the non-splatted and splatted for loop give identical runtimes

@btime foo_loop($N,$a,$b,$c,$d)
@btime foo_loop_splat!($N,$X,$Y)

  1.049 ms (0 allocations: 0 bytes)
  1.093 ms (0 allocations: 0 bytes)

mbauman · November 6, 2019, 7:59pm

The key is that arrays don’t encode their lengths into the type system like tuples do. This means when you splat it into a function call, Julia doesn’t know (ahead of time) how many arguments there will be — and thus cannot predict which method will be called. This is essentially a type instability.

jlchan · November 12, 2019, 9:30pm

@mbauman - thanks for the interpretation as type instability! That’s helpful to know.

I converted my code to use tuples of arrays instead of arrays of arrays. However, the immutability of tuples prevents me from modifying them. Is there a recommended mutable alternative to tuples (StructArrays or StaticArrays?) which also encode lengths into their type?

Mason · November 12, 2019, 9:36pm

You could try using an MArray from StaticArrays.jl: API · StaticArrays.jl since they are mutable and fixed size.

mbauman · November 12, 2019, 9:38pm

I don’t think splatting MArrays will be as fast as tuples, though. The type-system-encoded length is necessary but not sufficient. I think tuples have some special builtin secret sauce — or they did last I checked.

It’s often possible to simply avoid splatting (or not worry about it).

Daniel_Berge · November 12, 2019, 9:45pm

I’m curious why this is. A StaticArray is just a NTuple underneath. Is it just a missing optimization?

jlchan · November 12, 2019, 9:54pm

Splatting tuples does seem much faster than splatting mutable or sized arrays. Running the same tests with SizedVectors

# replace tuples with sized vectors
X = SizedVector{2}(a,b)
Y = SizedVector{2}(c,d)
@btime foo_loop_splat!($N,$X,$Y)

is pretty slow compared to splatting tuples.

  106.395 μs (0 allocations: 0 bytes) # splatted tuples
  4.501 ms (170000 allocations: 5.19 MiB) # splatted SizedVector

I’ve gotten around immutability by overwriting tuples in global scope, e.g.

X = X .+ Y

It doesn’t seem that slow to me, but I can see where it would be limiting.

mbauman · November 12, 2019, 9:54pm

Splatting is “just” iteration, though, and that’s a tough thing to “see through” to the end in general at the type system level.