# Performance of collecting a zip where one argument to zip has eltype == Union{Nothing, Int}

I ran into an interesting performance issue when collecting a `zip`. Before you say “don’t collect!”, the next step is to feed the collected array into `sort`, which throws a method error if you feed it a zip iterator. The issue involves collecting a zip where one of the arguments to zip has type `Vector{Union{Nothing, Int}}`. I thought that small unions were supposed to be efficient now, but this appears to be an exception to that rule. Or maybe there’s more going on here than I understand. Here are the input arrays:

``````using BenchmarkTools

A = rand(['a', 'b', 'c'], 10_000);
x = ['b', 'c', 'a'];
y = indexin(A, x);  # eltype == Union{Nothing, Int}
y_pure = convert(Vector{Int}, y);  # eltype == Int
``````

At the REPL:

``````julia> eltype(y)
Union{Nothing, Int64}

julia> eltype(y_pure)
Int64

julia> @btime collect(zip(\$A, \$y));
1.279 ms (10004 allocations: 390.75 KiB)

julia> @btime collect(Tuple{Char, Int64}, zip(\$A, \$y));
1.509 ms (19493 allocations: 617.14 KiB)

julia> @btime collect(zip(\$A, \$y_pure));
10.625 μs (4 allocations: 156.38 KiB)
``````

As you can see, collecting the zip with `y_pure` is much faster and only uses 4 allocations, rather than 10,000 allocations. Does anyone understand why collecting the zip with the Union type performs so much worse?

It is probably because

``````> isbitstype(Union{Nothing, Int})
false
``````

So a variable inferred to be of such type needs to be boxed, and as so, it is allocated in the heap. I think there are cases in which such boxing can be optimized away but does not seem to be this case. I would recommend analyzing the result of `@code_warntype` or other related macros, to have a better idea of why this happens.

1 Like

Thanks! That makes sense. I looked at the results of `@code_warntype`, but they don’t mean a whole lot to me. At any rate, it doesn’t look like there are any type inference problems. Here are the outputs, if anyone is interested:

``````julia> @code_warntype optimize=true collect(zip(A, y))
Variables
#self#::Core.Compiler.Const(collect, false)
itr::Base.Iterators.Zip{Tuple{Array{Char,1},Array{Union{Nothing, Int64},1}}}

Body::Array{Tuple{Char,Union{Nothing, Int64}},1}
1 ── %1  = Base.getfield(itr, :is)::Tuple{Array{Char,1},Array{Union{Nothing, Int64},1}}
│    %2  = Base.getfield(%1, 1, true)::Array{Char,1}
│    %3  = Base.arraysize(%2, 1)::Int64
│    %4  = Base.slt_int(%3, 0)::Bool
│    %5  = Base.ifelse(%4, 0, %3)::Int64
│    %6  = %new(Base.OneTo{Int64}, %5)::Base.OneTo{Int64}
│    %7  = Core.tuple(%6)::Tuple{Base.OneTo{Int64}}
│    %8  = Base.getfield(%1, 2, true)::Array{Union{Nothing, Int64},1}
│    %9  = Base.arraysize(%8, 1)::Int64
│    %10 = Base.slt_int(%9, 0)::Bool
│    %11 = Base.ifelse(%10, 0, %9)::Int64
│    %12 = %new(Base.OneTo{Int64}, %11)::Base.OneTo{Int64}
│    %13 = Core.tuple(%12)::Tuple{Base.OneTo{Int64}}
└───       goto #5 if not true
2 ── %15 = (%5 === %11)::Bool
│    %16 = Base.and_int(true, %15)::Bool
│    %17 = Base.not_int(%16)::Bool
└───       goto #4 if not %17
3 ── %19 = invoke Base.print_to_string("dimensions must match: a has dims "::String, %7::Vararg{Any,N} where N, ", b has dims ", %13, ", mismatch at ", 1)::String
│    %20 = %new(Base.DimensionMismatch, %19)::DimensionMismatch
│          Base.throw(%20)
└───       \$(Expr(:unreachable))
4 ┄─       nothing
5 ┄─       goto #6
6 ──       goto #7
7 ──       goto #8
8 ── %27 = \$(Expr(:foreigncall, :(:jl_alloc_array_1d), Array{Tuple{Char,Union{Nothing, Int64}},1}, svec(Any, Int64), 0, :(:ccall), Array{Tuple{Char,Union{Nothing, Int64}},1}, :(%5), :(%5)))::Array{Tuple{Char,Union{Nothing, Int64}},1}
└───       goto #9
9 ── %29 = invoke Base.copyto!(%27::Array{Tuple{Char,Union{Nothing, Int64}},1}, _2::Base.Iterators.Zip{Tuple{Array{Char,1},Array{Union{Nothing, Int64},1}}})::Array{Tuple{Char,Union{Nothing, Int64}},1}
└───       goto #10
10 ─       return %29
``````
``````julia> @code_warntype optimize=true collect(zip(A, y_pure))
Variables
#self#::Core.Compiler.Const(collect, false)
itr::Base.Iterators.Zip{Tuple{Array{Char,1},Array{Int64,1}}}

Body::Array{Tuple{Char,Int64},1}
1 ── %1  = Base.getfield(itr, :is)::Tuple{Array{Char,1},Array{Int64,1}}
│    %2  = Base.getfield(%1, 1, true)::Array{Char,1}
│    %3  = Base.arraysize(%2, 1)::Int64
│    %4  = Base.slt_int(%3, 0)::Bool
│    %5  = Base.ifelse(%4, 0, %3)::Int64
│    %6  = %new(Base.OneTo{Int64}, %5)::Base.OneTo{Int64}
│    %7  = Core.tuple(%6)::Tuple{Base.OneTo{Int64}}
│    %8  = Base.getfield(%1, 2, true)::Array{Int64,1}
│    %9  = Base.arraysize(%8, 1)::Int64
│    %10 = Base.slt_int(%9, 0)::Bool
│    %11 = Base.ifelse(%10, 0, %9)::Int64
│    %12 = %new(Base.OneTo{Int64}, %11)::Base.OneTo{Int64}
│    %13 = Core.tuple(%12)::Tuple{Base.OneTo{Int64}}
└───       goto #5 if not true
2 ── %15 = (%5 === %11)::Bool
│    %16 = Base.and_int(true, %15)::Bool
│    %17 = Base.not_int(%16)::Bool
└───       goto #4 if not %17
3 ── %19 = invoke Base.print_to_string("dimensions must match: a has dims "::String, %7::Vararg{Any,N} where N, ", b has dims ", %13, ", mismatch at ", 1)::String
│    %20 = %new(Base.DimensionMismatch, %19)::DimensionMismatch
│          Base.throw(%20)
└───       \$(Expr(:unreachable))
4 ┄─       nothing
5 ┄─       goto #6
6 ──       goto #7
7 ──       goto #8
8 ── %27 = \$(Expr(:foreigncall, :(:jl_alloc_array_1d), Array{Tuple{Char,Int64},1}, svec(Any, Int64), 0, :(:ccall), Array{Tuple{Char,Int64},1}, :(%5), :(%5)))::Array{Tuple{Char,Int64},1}
└───       goto #9
9 ── %29 = invoke Base.copyto!(%27::Array{Tuple{Char,Int64},1}, _2::Base.Iterators.Zip{Tuple{Array{Char,1},Array{Int64,1}}})::Array{Tuple{Char,Int64},1}
└───       goto #10
10 ─       return %29
``````

Yes. I checked the difference between them, and the problem is just that the collected `Vector` is of a not `isbitstype` and so each position of it is a pointer to a newly allocated small object in the heap. Memory allocation is not incredibly slow, but comparing the allocation of so many small objects against just writing `Int`s to a ‘compact’ vector of `Int`s sums up a lot of difference. I had a similar problem recently, I have an heuristic that runs a loop at least one million times, and it does not a lot of work each iteration. I have reduced its runtime by about 6x by avoiding making any allocations inside the loop.

I think what is meant by ‘small unions are efficient’ (someone correct me if I am wrong) is that they are not much worse than just the allocation needed to box the union, and in some cases (that will not create a new object with fields that are `Unions`, i.e., that the heap object will not escape the scope) they can be optimized away.

I don’t think this is the case. The memory layout of the “data” portion of the array is identical to that of `isbitstype` as long as the element type is of small union:

``````julia> x = Union{Int,Nothing}[1, 2, 3]
3-element Array{Union{Nothing, Int64},1}:
1
2
3

julia> y = unsafe_wrap(Array, Ptr{Int}(pointer(x)), size(x))
3-element Array{Int64,1}:
1
2
3
``````

Besides, it doesn’t explain why `collect(Tuple{Char, Int64}, zip(\$A, \$y))` is also slow.

I think the problem is the allocation of intermediate tuples due to union splitting not working well with tuple-of-unions. With manual implementations

``````function collect_int_char1(xs)
ys = Vector{Tuple{Char,Int}}(undef, length(xs))
a, b = xs.is
for i in eachindex(ys, xs.is...)
x1 = @inbounds a[i]
x2 = @inbounds b[i]
if x1 isa Char
if x2 isa Int
@inbounds ys[i] = (x1, x2)
end
end
end
return ys
end

function collect_int_char2(xs)
ys = Vector{Tuple{Char,Int}}(undef, length(xs))
a, b = xs.is
for i in eachindex(ys, xs.is...)
x1 = @inbounds a[i]
x2 = @inbounds b[i]
x = (x1, x2)
if x isa Tuple{Char,Int}
@inbounds ys[i] = x
end
end
return ys
end
``````

I get

``````julia> @btime collect_int_char1(zip(\$A, \$y));
11.582 μs (4 allocations: 156.38 KiB)

julia> @btime collect_int_char2(zip(\$A, \$y));
591.180 μs (10004 allocations: 468.88 KiB)
``````

while

``````julia> @btime collect(Tuple{Char, Int64}, zip(\$A, \$y));
795.722 μs (19493 allocations: 617.14 KiB)

julia> @btime collect(zip(\$A, \$y_pure));
18.759 μs (4 allocations: 156.38 KiB)
``````

So, type-asserting before creating a tuple helps:

``````julia> @btime collect(zip(\$A, (x::Int for x in \$y)));
23.899 μs (5 allocations: 156.39 KiB)
``````
1 Like