Performance of collecting a zip where one argument to zip has eltype == Union{Nothing, Int}

I ran into an interesting performance issue when collecting a zip. Before you say “don’t collect!”, the next step is to feed the collected array into sort, which throws a method error if you feed it a zip iterator. The issue involves collecting a zip where one of the arguments to zip has type Vector{Union{Nothing, Int}}. I thought that small unions were supposed to be efficient now, but this appears to be an exception to that rule. Or maybe there’s more going on here than I understand. Here are the input arrays:

using BenchmarkTools

A = rand(['a', 'b', 'c'], 10_000);
x = ['b', 'c', 'a'];
y = indexin(A, x);  # eltype == Union{Nothing, Int}
y_pure = convert(Vector{Int}, y);  # eltype == Int

At the REPL:

julia> eltype(y)
Union{Nothing, Int64}

julia> eltype(y_pure)
Int64

julia> @btime collect(zip($A, $y));
  1.279 ms (10004 allocations: 390.75 KiB)

julia> @btime collect(Tuple{Char, Int64}, zip($A, $y));
  1.509 ms (19493 allocations: 617.14 KiB)

julia> @btime collect(zip($A, $y_pure));
  10.625 μs (4 allocations: 156.38 KiB)

As you can see, collecting the zip with y_pure is much faster and only uses 4 allocations, rather than 10,000 allocations. Does anyone understand why collecting the zip with the Union type performs so much worse?

It is probably because

> isbitstype(Union{Nothing, Int})
false

So a variable inferred to be of such type needs to be boxed, and as so, it is allocated in the heap. I think there are cases in which such boxing can be optimized away but does not seem to be this case. I would recommend analyzing the result of @code_warntype or other related macros, to have a better idea of why this happens.

1 Like

Thanks! That makes sense. I looked at the results of @code_warntype, but they don’t mean a whole lot to me. At any rate, it doesn’t look like there are any type inference problems. Here are the outputs, if anyone is interested:

julia> @code_warntype optimize=true collect(zip(A, y))
Variables
  #self#::Core.Compiler.Const(collect, false)
  itr::Base.Iterators.Zip{Tuple{Array{Char,1},Array{Union{Nothing, Int64},1}}}

Body::Array{Tuple{Char,Union{Nothing, Int64}},1}
1 ── %1  = Base.getfield(itr, :is)::Tuple{Array{Char,1},Array{Union{Nothing, Int64},1}}
│    %2  = Base.getfield(%1, 1, true)::Array{Char,1}
│    %3  = Base.arraysize(%2, 1)::Int64
│    %4  = Base.slt_int(%3, 0)::Bool
│    %5  = Base.ifelse(%4, 0, %3)::Int64
│    %6  = %new(Base.OneTo{Int64}, %5)::Base.OneTo{Int64}
│    %7  = Core.tuple(%6)::Tuple{Base.OneTo{Int64}}
│    %8  = Base.getfield(%1, 2, true)::Array{Union{Nothing, Int64},1}
│    %9  = Base.arraysize(%8, 1)::Int64
│    %10 = Base.slt_int(%9, 0)::Bool
│    %11 = Base.ifelse(%10, 0, %9)::Int64
│    %12 = %new(Base.OneTo{Int64}, %11)::Base.OneTo{Int64}
│    %13 = Core.tuple(%12)::Tuple{Base.OneTo{Int64}}
└───       goto #5 if not true
2 ── %15 = (%5 === %11)::Bool
│    %16 = Base.and_int(true, %15)::Bool
│    %17 = Base.not_int(%16)::Bool
└───       goto #4 if not %17
3 ── %19 = invoke Base.print_to_string("dimensions must match: a has dims "::String, %7::Vararg{Any,N} where N, ", b has dims ", %13, ", mismatch at ", 1)::String
│    %20 = %new(Base.DimensionMismatch, %19)::DimensionMismatch
│          Base.throw(%20)
└───       $(Expr(:unreachable))
4 ┄─       nothing
5 ┄─       goto #6
6 ──       goto #7
7 ──       goto #8
8 ── %27 = $(Expr(:foreigncall, :(:jl_alloc_array_1d), Array{Tuple{Char,Union{Nothing, Int64}},1}, svec(Any, Int64), 0, :(:ccall), Array{Tuple{Char,Union{Nothing, Int64}},1}, :(%5), :(%5)))::Array{Tuple{Char,Union{Nothing, Int64}},1}
└───       goto #9
9 ── %29 = invoke Base.copyto!(%27::Array{Tuple{Char,Union{Nothing, Int64}},1}, _2::Base.Iterators.Zip{Tuple{Array{Char,1},Array{Union{Nothing, Int64},1}}})::Array{Tuple{Char,Union{Nothing, Int64}},1}
└───       goto #10
10 ─       return %29
julia> @code_warntype optimize=true collect(zip(A, y_pure))
Variables
  #self#::Core.Compiler.Const(collect, false)
  itr::Base.Iterators.Zip{Tuple{Array{Char,1},Array{Int64,1}}}

Body::Array{Tuple{Char,Int64},1}
1 ── %1  = Base.getfield(itr, :is)::Tuple{Array{Char,1},Array{Int64,1}}
│    %2  = Base.getfield(%1, 1, true)::Array{Char,1}
│    %3  = Base.arraysize(%2, 1)::Int64
│    %4  = Base.slt_int(%3, 0)::Bool
│    %5  = Base.ifelse(%4, 0, %3)::Int64
│    %6  = %new(Base.OneTo{Int64}, %5)::Base.OneTo{Int64}
│    %7  = Core.tuple(%6)::Tuple{Base.OneTo{Int64}}
│    %8  = Base.getfield(%1, 2, true)::Array{Int64,1}
│    %9  = Base.arraysize(%8, 1)::Int64
│    %10 = Base.slt_int(%9, 0)::Bool
│    %11 = Base.ifelse(%10, 0, %9)::Int64
│    %12 = %new(Base.OneTo{Int64}, %11)::Base.OneTo{Int64}
│    %13 = Core.tuple(%12)::Tuple{Base.OneTo{Int64}}
└───       goto #5 if not true
2 ── %15 = (%5 === %11)::Bool
│    %16 = Base.and_int(true, %15)::Bool
│    %17 = Base.not_int(%16)::Bool
└───       goto #4 if not %17
3 ── %19 = invoke Base.print_to_string("dimensions must match: a has dims "::String, %7::Vararg{Any,N} where N, ", b has dims ", %13, ", mismatch at ", 1)::String
│    %20 = %new(Base.DimensionMismatch, %19)::DimensionMismatch
│          Base.throw(%20)
└───       $(Expr(:unreachable))
4 ┄─       nothing
5 ┄─       goto #6
6 ──       goto #7
7 ──       goto #8
8 ── %27 = $(Expr(:foreigncall, :(:jl_alloc_array_1d), Array{Tuple{Char,Int64},1}, svec(Any, Int64), 0, :(:ccall), Array{Tuple{Char,Int64},1}, :(%5), :(%5)))::Array{Tuple{Char,Int64},1}
└───       goto #9
9 ── %29 = invoke Base.copyto!(%27::Array{Tuple{Char,Int64},1}, _2::Base.Iterators.Zip{Tuple{Array{Char,1},Array{Int64,1}}})::Array{Tuple{Char,Int64},1}
└───       goto #10
10 ─       return %29

Yes. I checked the difference between them, and the problem is just that the collected Vector is of a not isbitstype and so each position of it is a pointer to a newly allocated small object in the heap. Memory allocation is not incredibly slow, but comparing the allocation of so many small objects against just writing Ints to a ‘compact’ vector of Ints sums up a lot of difference. I had a similar problem recently, I have an heuristic that runs a loop at least one million times, and it does not a lot of work each iteration. I have reduced its runtime by about 6x by avoiding making any allocations inside the loop.

I think what is meant by ‘small unions are efficient’ (someone correct me if I am wrong) is that they are not much worse than just the allocation needed to box the union, and in some cases (that will not create a new object with fields that are Unions, i.e., that the heap object will not escape the scope) they can be optimized away.

I don’t think this is the case. The memory layout of the “data” portion of the array is identical to that of isbitstype as long as the element type is of small union:

julia> x = Union{Int,Nothing}[1, 2, 3]
3-element Array{Union{Nothing, Int64},1}:
 1
 2
 3

julia> y = unsafe_wrap(Array, Ptr{Int}(pointer(x)), size(x))
3-element Array{Int64,1}:
 1
 2
 3

Besides, it doesn’t explain why collect(Tuple{Char, Int64}, zip($A, $y)) is also slow.


I think the problem is the allocation of intermediate tuples due to union splitting not working well with tuple-of-unions. With manual implementations

function collect_int_char1(xs)
    ys = Vector{Tuple{Char,Int}}(undef, length(xs))
    a, b = xs.is
    for i in eachindex(ys, xs.is...)
        x1 = @inbounds a[i]
        x2 = @inbounds b[i]
        if x1 isa Char
            if x2 isa Int
                @inbounds ys[i] = (x1, x2)
            end
        end
    end
    return ys
end

function collect_int_char2(xs)
    ys = Vector{Tuple{Char,Int}}(undef, length(xs))
    a, b = xs.is
    for i in eachindex(ys, xs.is...)
        x1 = @inbounds a[i]
        x2 = @inbounds b[i]
        x = (x1, x2)
        if x isa Tuple{Char,Int}
            @inbounds ys[i] = x
        end
    end
    return ys
end

I get

julia> @btime collect_int_char1(zip($A, $y));
  11.582 μs (4 allocations: 156.38 KiB)

julia> @btime collect_int_char2(zip($A, $y));
  591.180 μs (10004 allocations: 468.88 KiB)

while

julia> @btime collect(Tuple{Char, Int64}, zip($A, $y));
  795.722 μs (19493 allocations: 617.14 KiB)

julia> @btime collect(zip($A, $y_pure));
  18.759 μs (4 allocations: 156.38 KiB)

So, type-asserting before creating a tuple helps:

julia> @btime collect(zip($A, (x::Int for x in $y)));
  23.899 μs (5 allocations: 156.39 KiB)
1 Like