[ANN] UnzipLoops -- broadcasting and unzip the output without overhead

TL;DR GitHub - Suzhou-Tongyuan/UnzipLoops.jl: broadcast and unzip it! provides one single function broadcast_unzip – it works similar to broadcasting but unzip the output in the same loop for better performance.

Still need to wait 3 days to register it: New package: UnzipLoops v0.1.0 by JuliaRegistrator · Pull Request #70918 · JuliaRegistries/General · GitHub

Where the story begins

When broadcasting a function f that has multiple outputs, a need for unzip arises.

f(x, y) = x^y, x/y

X, Y = [1, 2, 3, 4], [4, 3, 2, 1]
out = f.(X, Y)

This out is of type Vector{Tuple{Int,Int}} . It’s often the case that we’ll need to unzip it to Tuple{Vector{Int}, Vector{Int}} . For most the case, we can trivially split it apart:

function g(X, Y)
    out = f.(X, Y)
    return getindex.(out, 1), getindex.(out, 2)
end

You may want to find some smart function to automatically unzip it so that g becomes g(X, Y) = unzip(f.(X, Y).

But here’s the conclusion – unzip alone, whether it is lazily- or eagerly- evaluated, isn’t the optimal solution. Making unzip lazily can save one allocation, but the additional loops are still there.

X, Y = rand(1:5, 1024), rand(1:5, 1024)
@btime f.($X, $Y) # 3.834 μs (1 allocation: 16.12 KiB)
@btime g($X, $Y) # 5.388 μs (4 allocations: 32.41 KiB)

By pre-allocating the output and manually writing the loops, we get a more performant version:

function g(X, Y)
    @assert size(X) == size(Y)
    Base.require_one_based_indexing(X, Y)

    T = promote_type(Float64, eltype(X), eltype(Y))
    N = ndims(X)
    Z1 = Array{T,N}(undef, size(X))
    Z2 = Array{T,N}(undef, size(X))
    @inbounds @simd for i in eachindex(X)
        v = f(X[i], Y[i])
        Z1[i] = v[1]
        Z2[i] = v[2]
    end
    return Z1, Z2
end

@btime g($X, $Y) # 3.999 μs (2 allocations: 16.25 KiB)

broadcast_unzip saves the labor work

Obviously, rewriting the trivial getindex solution into the verbose manual loop introduces much labor work and hurts the readability. This is why broadcast_unzip is introduced – it’s a combination of broadcasting and unzip. Most importantly, this is simple to use and yet fast:

g(X, Y) == broadcast_unzip(f, X, Y) # true
@btime broadcast_unzip(f, $X, $Y) # 4.009 μs (2 allocations: 16.25 KiB)

Additionally, broadcast_unzip accepts more inputs (just like map) as long as their sizes match and f outputs a Tuple of a scalar-like object.

X, Y, Z = rand(1:5, 1024), rand(1:5, 1024), rand(1:5, 1024)
f(x, y, z) = x ^ y ^ z, x / y / z, x * y * z, x / (y*z)
out = broadcast_unzip(f, X, Y, Z)
@assert out[1] == getindex.(f.(X, Y, Z), 1)

@btime map(f, $X, $Y, $Z) # 13.682 μs (2 allocations: 32.05 KiB)
@btime broadcast_unzip(f, $X, $Y, $Z) # 13.418 μs (6 allocations: 32.58 KiB)

This is a joint work (by-product) of me and @thautwarm when developing packages in Tongyuan and we believe the community needs this tool, too. – As far as I know, although this is a common need, such a tool didn’t exist before.

20 Likes

Interesting, how does something like this compare with, say, a StructArray and MappedArray composition, in case a map is equivalent to a broadcast?

julia> X, Y = [1, 2, 3, 4], [4, 3, 2, 1];

julia> f(x, y) = (; a=x^y, b=x/y)
f (generic function with 1 method)

julia> S = StructArray(mappedarray(f, X, Y))
4-element StructArray(::Vector{Int64}, ::Vector{Float64}) with eltype NamedTuple{(:a, :b), Tuple{Int64, Float64}}:
 (a = 1, b = 0.25)
 (a = 8, b = 0.6666666666666666)
 (a = 9, b = 1.5)
 (a = 4, b = 4.0)

julia> S.a
4-element Vector{Int64}:
 1
 8
 9
 4
5 Likes

Could this be a macro, so you can do

@unzip A, B = f.(X, Y)

?

3 Likes

This looks great!

No support for NamedTuples it looks like?

Nice … just as a side note: In case the returned values are of the same type, the split-apply-combine strategy can be used to achieve a similar effect:

julia> combinedims((SVector∘f).(X, Y))  # from SplitApplyCombine.jl
2×4 Matrix{Float64}:
 1.0   8.0       9.0  4.0
 0.25  0.666667  1.5  4.0

julia> rank"0 SVector∘f 0"(X, Y)  # from JJ.jl
2×4 Align{Float64, 2} with eltype Float64 with indices SOneTo(2)×Base.OneTo(4):
 1.0   8.0       9.0  4.0
 0.25  0.666667  1.5  4.0

Note that this

  • requires f to return a vector instead of a tuple (but thanks to StaticArrays this can be as efficient)
  • collects each returned value into the rows of an array

nice, it would be nice we have this in Base, like how we have @nexpr and nloops

Note that StructArrays + Base.Broadcast / Base.Iterators can do something similar. Perhaps this should be better advertised. One difference is (as noted above) is that UnzipLoops demands that sizes match, as does MappedArrays. Are there others?

julia> using StructArrays

julia> function unzip_broadcast(f, args...)
           bc = Broadcast.instantiate(Broadcast.broadcasted(f, args...))
           StructArrays.components(StructArray(bc))
       end;

julia> unzip_broadcast(tuple, [1,2,3], [4 5])  # broadcasts to 3×2
([1 1; 2 2; 3 3], [4 5; 4 5; 4 5])

julia> using UnzipLoops: broadcast_unzip

julia> broadcast_unzip(tuple, [1,2,3], [4 5])
ERROR: DimensionMismatch: size must match

julia> function unzip_map(f, args...)
           it = Iterators.map(f, args...)
           StructArrays.components(StructArray(it))
       end;

julia> broadcast_unzip((x,y) -> (x, x/y), [1,2,3], [4,5,6])
([1, 2, 3], [0.25, 0.4, 0.5])

julia> ans == unzip_map((x,y) -> (x, x/y), [1,2,3], [4,5,6])
true

julia> unzip_map((x,y) -> (x, x/y), [1,2,3], [4,5])  # stops short like zip
([1, 2], [0.25, 0.4])

julia> using MappedArrays

julia> mappedarray((x,y) -> (x, x/y), [1,2,3], [4,5])
ERROR: DimensionMismatch: arrays do not all have the same axes
1 Like

@jishnub Oh, nice trick! I was trying to find such a thing. It turns out that StructArray + MappedArray works quite efficiently already – only <10% performance difference here.

using BenchmarkTools
using StructArrays, MappedArrays
using UnzipLoops
using LinearAlgebra
using SplitApplyCombine
using StaticArrays

X, Y = Float64.(rand(1:5, 64)), Float64.(rand(1:5, 64))

# v1: UnzipLoops
f(x, y) = (x^y, x/y)
# unclear why, but inline is needed here
@inline function mydot_v1(f, X, Y)
    A, B = broadcast_unzip(f, X, Y)
    return dot(A, B)
end

@btime mydot_v1(f, $X, $Y)
# 64: 391.836 ns (2 allocations: 1.12 KiB)
# 1024: 7.536 μs (2 allocations: 16.25 KiB)

# v2: StructArrays + MappedArrays
f_named(x, y) = (; a=x^y, b=x/y)
@inline function mydot_v2(f, X, Y)
    S = StructArray(mappedarray(f, X, Y))
    return dot(S.a, S.b)
end

@assert mydot_v2(f_named, X, Y) ≈ mydot_v1(f, X, Y)
@btime mydot_v2(f_named, $X, $Y)
# 64: 425.266 ns (2 allocations: 1.12 KiB)
# 1024: 8.364 μs (2 allocations: 16.25 KiB)

# v3: SplitApplyCombine -- actually doesn't do the unzip work
function mydot_v3(f, X, Y)
    S = combinedims((SVector∘f).(X, Y))
    return dot(@view(S[1, :]), @view(S[2, :]))
end

@assert mydot_v3(f, X, Y) ≈ mydot_v1(f, X, Y)
@btime mydot_v3(f, $X, $Y)
# 64: 57.363 μs (14 allocations: 2.97 KiB)
# 1024: 71.938 μs (14 allocations: 33.09 KiB)

Re: @DNF

Could this be a macro?

It could, but it won’t make things easier to understand or use. Maybe we’ll introduce a macro in the future, but for the moment it isn’t something crucial.

Re: @pdeffebach

No support for NamedTuples it looks like?

Supporting NamedTuples won’t give anything useful here except for fusing compatibility – Julia optimizes it well so that unzipping a Tuple/NamedTuple introduces zero runtime overhead in practice. This package is mainly targeting unzipping “large” arrays – but we doesn’t optimize the unzip it alone here because we believe the main usage is broadcasting+unzip (they’re almost bundled when performance is a consideration).

Re: @bertschi

In case the returned values are of the same type, the split-apply-combine strategy can be used to achieve a similar effect:

Actually, I think they focus on different use cases. The SplitApplyCombine applies a “cat”-like operation after broadcasting, while UnzipLoops applies an unzip operation. I also included it in the above benchmark script and clearly that it doesn’t suit this usage.

Re: @mcabbott

One difference is (as noted above) is that UnzipLoops demands that sizes match, as does MappedArrays. Are there others?

We make the assumption simple for the first version to ensure that this trick works. But I do think it’s doable to make it a real broadcasting semantics (i.e., supporting unzip_broadcast(tuple, [1,2,3], [4 5])).

Meanwhile, I don’t think iterator will get the best performance here compared to mappedarray. – would be very easy to fall into Mapreduce is 10x slower than loop. · Issue #38558 · JuliaLang/julia · GitHub

2 Likes

I kind of disagree. The reason for suggesting it is that the terseness of the dot syntax is, I believe, key to its success. broadcast_unzip is a bit unwieldy (and has an underscore), and seems like an internal name more than a user-facing function.

Dot+macro decouples the concepts in a pleasing way, imo.

Just a suggestion, of course, thanks for writing this.

2 Likes

In addition to the StructArray solution to this isolated task, it’s often convenient to just keep those individual arrays together as a StructArray from the start. This is efficient and automatically maps/broadcasts into a StructArray.
Like:

data = StructArray(; X, Y)
data.X === X  # same array, not even a copy
f(a) = (X=a.X^a.X, Y=a.X/a.X)
out = f.(data)
# out.X and out.Y are individual component arrays

This way, your function f operates on the whole “object” with X and Y fields, and not on its components as separate arguments.

2 Likes

Can this functionality be handled by the extensible broadcast fusion mechanism?
So that unzip.(f.(X,Y))) will be fused into broadcast_unzip(...).
It seems like it has exactly same flavour as other fusion transformations.

Yes.

I think the easiest macro to write is one that applied only to the broadcasting bit, not to the assignment. And there’s a neat trick to implement this by hooking into how fused broadcasting works:

"""
    @unzipcast f(g(x))
Works like `unzip(@. expr)`, without materialising the broadcast before unzipping it.
"""
macro unzipcast(ex)
  bc = esc(Broadcast.__dot__(ex))
  :(_unz.($bc))
end
"""
    @unzip f.(g.(x))
Like `@unzipcast` but without `@.`, thus expects a broadcasting expression.
"""
macro unzip(ex)
  bc = esc(ex)
  :(_unz.($bc))
end

function _unz end  # this is never called
# (Reluctant to name it unzip, as if that were called with a . it would do something different)
Broadcast.broadcasted(::typeof(_unz), x) = _Unz(x)
struct _Unz{T}; bc::T; end
Broadcast.materialize(x::_Unz) = StructArrays.components(StructArray(Broadcast.instantiate(x.bc)))

f(x, y) = x, x/y;
@unzipcast f(1:3, [4 5])  # ([1 1; 2 2; 3 3], [0.25 0.2; 0.5 0.4; 0.75 0.6])

broadcast_unzip(f, 1:3, Broadcast.broadcasted(+, [4,5,7], 1))  # even with instantiate, eltype(bc) is Any
@unzip f.(1:3, [4,5,7] + 1)
5 Likes

Isn’t this problem solved in principle by doing the broadcasting (rather than the unzip) lazily? e.g. unzip(f(x) for x in v) instead of unzip(f.(v))? I for one almost always use unzip with comprehensions rather than broadcasting.
Currently the Unzip package is not very good performance-wise with generators like this, but I don’t see why it can’t be. I would prefer a better unzip implementation that could deal with generators rather than having to write things in terms of broadcasting (which can become cumbersome in cases like unzip(f(a.x, a.y) for a in v)).

Sorry, this is what I mean by NamedTuple support

julia> x = rand(10); y = rand(10);

julia> f(x, y) = (;x1 = 1, x2 = 2);

julia> h(x, y) = (1, 2);

julia> broadcast_unzip(f, x, y)
(Any[1, 1, 1, 1, 1, 1, 1, 1, 1, 1],)

julia> broadcast_unzip(h, x, y)
([1, 1, 1, 1, 1, 1, 1, 1, 1, 1], [2, 2, 2, 2, 2, 2, 2, 2, 2, 2])

it would be great if broadcast_unzip(f, x, y) made a NamedTuple of vectors.

2 Likes

Sure, did not want to claim that it can replace unzipping – which is certainly more general. In the special situation of homogeneously typed return values, just concatenating all of them might be a useable alternative to unzipping.

Re @yha
Isn’t this problem solved in principle by doing the broadcasting (rather than the unzip) lazily? e.g. unzip(f(x) for x in v) instead of unzip(f.(v)) ?

Indeed, and my point is that unzip must be fused into the calculation loop. Generators currently does have the overhead as mentioned in Mapreduce is 10x slower than loop. · Issue #38558 · JuliaLang/julia · GitHub. This is exactly why we use MappedArrays/StructArrays and why I made the package – I wouldn’t try to build the broadcast_unzip stuff if I’ve known the StructArray solution @jishnub proposed.


With @jishnub and @mcabbott’s StructArray idea in mind, the new version v0.1.1 uses Broadcasted to do the loop work. A few things to note:

  • mixed axes are supported (backed by Broadcast)
  • offset arrays are supported
  • named tuple functions such as f(x, y) = (; a=x+y, b=x-y) are supported (@pdeffebach)

Benchmark still shows some maybe marginal performance differences to the StructArray solution:

f(x, y) = (x^y, x/y)
# Baseline
@inline function mydot_v1(f, X, Y)
    tmp = broadcast(f, X, Y)
    return dot(map(first, tmp), map(last, tmp))
end

# UnzipLoops
@inline function mydot_v2(f, X, Y)
    A, B = broadcast_unzip(f, X, Y)
    return dot(A, B)
end

# StructArrays
@inline function mydot_v3(f, X, Y)
    A, B = @unzip f.(X, Y)
    return dot(A, B)
end

# vector case
X, Y = rand(1024), rand(1024);
mydot_v1(f, X, Y) ≈ mydot_v2(f, X, Y) ≈ mydot_v3(f, X, Y)
# essential computation time
@btime f.($X, $Y); # 32.636 μs (1 allocation: 16.12 KiB)
@btime dot($X, $Y); # 51.022 ns (0 allocations: 0 bytes)
# unzip overhead: 6μs vs 0μs vs 0μs
@btime mydot_v1(f, $X, $Y); # 38.929 μs (3 allocations: 32.38 KiB)
@btime mydot_v2(f, $X, $Y); # 32.067 μs (2 allocations: 16.25 KiB)
@btime mydot_v3(f, $X, $Y); # 32.311 μs (2 allocations: 16.25 KiB)

# matrix case
X, Y = rand(1024), collect(rand(1024)');
mydot_v1(f, X, Y) ≈ mydot_v2(f, X, Y) ≈ mydot_v3(f, X, Y)
# essential computation time
@btime f.($X, $Y); # 32.766 ms (2 allocations: 16.00 MiB)
@btime dot($X, $Y); # 51.138 ns (0 allocations: 0 bytes)
# unzip overhead: 11ms vs 7ms vs 13ms
@btime mydot_v1(f, $X, $Y); # 43.507 ms (6 allocations: 32.00 MiB)
@btime mydot_v2(f, $X, $Y); # 39.685 ms (4 allocations: 16.00 MiB)
@btime mydot_v3(f, $X, $Y); # 45.969 ms (4 allocations: 16.00 MiB)

Since this is a zero-dependency solution, this might be a good candidate to propose in Base, thoughts?

2 Likes

I think Base should first have standalone unzip, which is more general, rather than the coupled broadcast_unzip, which is much more restricted.
For example, broadcast_unzip can’t easily express unzip(f(a.x, a.y) for a in v)). And when I want a progress bar and don’t need the extra performance boost, I can do

using ProgressLogging, Unzip
# ...
@progress res = [f(...) for ...]
a,b,c = unzip(res)