Custom flatten version is faster than `flatten()`

Consider an array of matrices i.e. a = [rand(3, 3) for i = 1:1000]. I have two ways to “flatten” this array:

function naive_flatten(a)
    ret = Array{Float64}(undef, 0)
    @inbounds for i=1:length(a) 
        append!(ret, vec(a[i]))

using BenchmarkTools; using Base.Iterators; 
@btime aa = naive_flatten($a)
## built in function
@btime aa = collect(flatten($a))

  61.487 μs (2012 allocations: 222.64 KiB)
  204.570 μs (18015 allocations: 819.22 KiB)

Why is there such a big difference? My function shouldn’t even do well since I am not preallocating anything…

I think that the problem is the current implementation of Iterators.flatten does not take advantage of many informations: first of all it treats arrays in a generic way, as we didn’t know their actual length, in fact if you see

flatten_iteratorsize does not specialize on Union{HasLength, HasShape} unless the iterator element type is a tuple or a number.

I don’t know if there is a deeper motivation about that.

This is not so much about flatten_iteratorsize but about the performance difference between

julia> function naive_flatten2(a)
           ret = Array{Float64}(undef, 0)
           for ai in a
               for aij in ai
                   push!(ret, aij)
naive_flatten2 (generic function with 1 method)

julia> function naive_flatten3(a)
           ret = Array{Float64}(undef, 0)
           for ai in Iterators.flatten(a)
               push!(ret, ai)
naive_flatten3 (generic function with 1 method)

julia> @btime aa = naive_flatten2($a);
  100.138 μs (14 allocations: 256.70 KiB)

julia> @btime aa = naive_flatten3($a);
  239.802 μs (18015 allocations: 819.22 KiB)

Is this something worth opening an issue for?

FWIW, I’d think it is.