Iterating over two vectors with vcat is faster than Iterators.flatten and CatViews?

Hey,

I have two independant vectors, which change in size over time, but sometimes I need to iterate over them both. I tried vcat, Iterators.flatten and CatViews and vcat needs the least amount of runtime and memory, quite the opposite of what I was expecting.

using BenchmarkTools, CatViews

function mysum(x)
    s = zero(eltype(x))
    for e in x
        s += e # only a dummy operation
    end
    s
end

a = rand(1_000);
b = rand(1_000);

@btime vcatvec = vcat(a, b);
  1.809 μs (1 allocation: 15.75 KiB)
@btime mysum(vcatvec)
  2.093 μs (1 allocation: 16 bytes)

@btime flatted = Base.Iterators.flatten((a, b));
  268.347 ns (2 allocations: 48 bytes)
@btime mysum(flatted)
  30.454 μs (4001 allocations: 125.02 KiB)

@btime catview = CatView(a, b);
  492.593 ns (7 allocations: 256 bytes)
@btime mysum(catview)
  911.768 μs (12989 allocations: 765.66 KiB)

I used Julia 1.0.0 and triggered the compilation bevor all measurments.
Is there a faster way for iterating over two vectors?

Make sure to interpolate the variables, always.

1 Like

Apparently, adding $ doesn’t help much here. It seems that sum(CatViews(...)) at least is type unstable, not sure what’s causing it.

Yes,
computers don’t work the way we like to think they do.

Cache-friendlyness and Branch-prediction friendlness matters an immense amount.
when you do a vcat you pay the one off price for the coping of everything to contagious memory,
but then you get to be doing an operation that the CPU is really well optimized for.
(This is a hardware thing, not a julia thing)

For large enough arrays CatViews should overtake

@btime vcatvec = vcat(a, b);
  1.809 μs (1 allocation: 15.75 KiB)
@btime mysum(vcatvec)
  2.093 μs (1 allocation: 16 bytes)

What about this?

julia>  function broadcast_sum(x,y)
    output = zero(eltype(x)) 
    for i in eachindex(x) 
        output += x[i]+y[i]
    end
    output
end
broadcast_sum (generic function with 1 method)

julia> @btime broadcast_sum(a,b)
  1.007 μs (1 allocation: 16 bytes)
julia> @btime broadcast_sum(a,b)
  996.286 ns (1 allocation: 16 bytes)

Another version

 @btime sum(broadcast((x,y)->x+y,a,b))
  509.019 ns (2 allocations: 7.95 KiB)

But it’s not just a bit slower, it’s couple of orders of magnitude, with lots of allocations.

I tried to write my own custom sum for CatViews:

a = rand(1_000);
b = rand(1_000);
v = CatView(a, b)
mysum(v::CatView) = sum(sum.(v.arr))

Then

julia> @btime mysum($v);
  915.714 ns (4 allocations: 80 bytes)

vs

julia> @btime sum($v);
  191.408 μs (6995 allocations: 109.78 KiB)
1 Like

And here’s a very barebones toy implementation of something similar (but of course less general):

struct MyCatView{N, T}
    arr::NTuple{N, Vector{T}}    
end
MyCatView(arrs::Vector{T}...) where {T} = MyCatView(arrs)
mysum(v::MyCatView) = sum(sum.(v.arr))

Then

julia> a, b = rand(1000), rand(1000);

julia> v = MyCatView(a, b);

julia> @btime mysum($v)
  192.796 ns (0 allocations: 0 bytes)

That’s three orders of magnitude faster(!) I think there must be something else going on.

1 Like

I don’t understand a couple of implementation details in CatViews:

  • How should getindex be fast at all? It’s doesn’t properly propagate bounds checking and a for-loop is called for each access. Using the tuple getindex accessor makes it faster, but its still type unstable.
  • As already noticed it isn’t typestable - maybe this came with the 1.0 update, where a lot of type params declarations were removed.
  • The perf.jl file doesn’t actually contain usable comparison benchmarks.
  • What’s the purpose of the inner field in CatView? It’s never used?

I’d stay away from CatViews until it has proven that the original benefits are still valid in Julia 1.0.

Maybe @rdeits can chime in on performance problems with CatView, as he upgrade CatViews for Julia 1.0.

For anyone who isn’t aware,
it is worth noting CatViews is quiet old.
It was designed for Julia 0.3 IIRC.
I think it’s type stability issues actually came from the 0.3->0.4 transition.
(or maybe 0.4->0.5)

RecurisveArrayTools.jl is a good alternative, IIRC.
But I’ve not benchmarked it on this.

Thx to you all for the quick answers.
@oxinabox RecurisveArrayTools.jl seems to be very promising. I tried this:

vofp = ArrayPartition(a, b)
@btime mysum(vofp)
  15.247 μs (2001 allocations: 31.27 KiB)
@btime sum(vofp)
  929.750 ns (7 allocations: 112 bytes)

In mysum I had to adjust this:

s = zero(eltype(eltype(x)))

They report “Broadcasting like f.(A) is efficient.” on their github page. I’ll try this later in my project.

Imo the really annoying thing is that the flattened iterator allocates during iteration, which makes mysum(flattened) so slow. Looking at @code_warntype, the bad object is of type ::Union{Nothing, Tuple{Float64,Tuple{Int64,Array{Float64,1},Int64}}}.

In the case of flattening over a tuple of iterators or an array of iterators (anything that can be indexed with Int), there is no good reason for having an iterator-state that is not bitstype (it should be a tuple of integers, such that the next element is iterate(flattened_iterators[i], j)). So I’d say that this is either a problem of Base.Iterators.flatten or yet another manifestation of the problem with heap-allocating small immutable non-bitstypes.

1 Like