Iterating over two vectors with vcat is faster than Iterators.flatten and CatViews?

control13 · September 30, 2018, 2:48pm

Hey,

I have two independant vectors, which change in size over time, but sometimes I need to iterate over them both. I tried vcat, Iterators.flatten and CatViews and vcat needs the least amount of runtime and memory, quite the opposite of what I was expecting.

using BenchmarkTools, CatViews

function mysum(x)
    s = zero(eltype(x))
    for e in x
        s += e # only a dummy operation
    end
    s
end

a = rand(1_000);
b = rand(1_000);

@btime vcatvec = vcat(a, b);
  1.809 μs (1 allocation: 15.75 KiB)
@btime mysum(vcatvec)
  2.093 μs (1 allocation: 16 bytes)

@btime flatted = Base.Iterators.flatten((a, b));
  268.347 ns (2 allocations: 48 bytes)
@btime mysum(flatted)
  30.454 μs (4001 allocations: 125.02 KiB)

@btime catview = CatView(a, b);
  492.593 ns (7 allocations: 256 bytes)
@btime mysum(catview)
  911.768 μs (12989 allocations: 765.66 KiB)

I used Julia 1.0.0 and triggered the compilation bevor all measurments.
Is there a faster way for iterating over two vectors?

DNF · September 30, 2018, 4:34pm

Make sure to interpolate the variables, always.

DNF · September 30, 2018, 4:54pm

Apparently, adding $ doesn’t help much here. It seems that sum(CatViews(...)) at least is type unstable, not sure what’s causing it.

oxinabox · October 1, 2018, 3:38am

Yes,
computers don’t work the way we like to think they do.

Cache-friendlyness and Branch-prediction friendlness matters an immense amount.
when you do a vcat you pay the one off price for the coping of everything to contagious memory,
but then you get to be doing an operation that the CPU is really well optimized for.
(This is a hardware thing, not a julia thing)

For large enough arrays CatViews should overtake

Ajaychat3 · October 1, 2018, 6:20am

@btime vcatvec = vcat(a, b);
  1.809 μs (1 allocation: 15.75 KiB)
@btime mysum(vcatvec)
  2.093 μs (1 allocation: 16 bytes)

What about this?

julia>  function broadcast_sum(x,y)
    output = zero(eltype(x)) 
    for i in eachindex(x) 
        output += x[i]+y[i]
    end
    output
end
broadcast_sum (generic function with 1 method)

julia> @btime broadcast_sum(a,b)
  1.007 μs (1 allocation: 16 bytes)
julia> @btime broadcast_sum(a,b)
  996.286 ns (1 allocation: 16 bytes)

Another version

 @btime sum(broadcast((x,y)->x+y,a,b))
  509.019 ns (2 allocations: 7.95 KiB)

DNF · October 1, 2018, 6:29am

But it’s not just a bit slower, it’s couple of orders of magnitude, with lots of allocations.

I tried to write my own custom sum for CatViews:

a = rand(1_000);
b = rand(1_000);
v = CatView(a, b)
mysum(v::CatView) = sum(sum.(v.arr))

Then

julia> @btime mysum($v);
  915.714 ns (4 allocations: 80 bytes)

vs

julia> @btime sum($v);
  191.408 μs (6995 allocations: 109.78 KiB)

DNF · October 1, 2018, 6:48am

And here’s a very barebones toy implementation of something similar (but of course less general):

struct MyCatView{N, T}
    arr::NTuple{N, Vector{T}}    
end
MyCatView(arrs::Vector{T}...) where {T} = MyCatView(arrs)
mysum(v::MyCatView) = sum(sum.(v.arr))

Then

julia> a, b = rand(1000), rand(1000);

julia> v = MyCatView(a, b);

julia> @btime mysum($v)
  192.796 ns (0 allocations: 0 bytes)

That’s three orders of magnitude faster(!) I think there must be something else going on.

laborg · October 1, 2018, 7:46am

I don’t understand a couple of implementation details in CatViews:

How should getindex be fast at all? It’s doesn’t properly propagate bounds checking and a for-loop is called for each access. Using the tuple getindex accessor makes it faster, but its still type unstable.
As already noticed it isn’t typestable - maybe this came with the 1.0 update, where a lot of type params declarations were removed.
The perf.jl file doesn’t actually contain usable comparison benchmarks.
What’s the purpose of the inner field in CatView? It’s never used?

I’d stay away from CatViews until it has proven that the original benefits are still valid in Julia 1.0.

Maybe @rdeits can chime in on performance problems with CatView, as he upgrade CatViews for Julia 1.0.

oxinabox · October 1, 2018, 8:21am

For anyone who isn’t aware,
it is worth noting CatViews is quiet old.
It was designed for Julia 0.3 IIRC.
I think it’s type stability issues actually came from the 0.3->0.4 transition.
(or maybe 0.4->0.5)

RecurisveArrayTools.jl is a good alternative, IIRC.
But I’ve not benchmarked it on this.

control13 · October 1, 2018, 2:04pm

Thx to you all for the quick answers.
@oxinabox RecurisveArrayTools.jl seems to be very promising. I tried this:

vofp = ArrayPartition(a, b)
@btime mysum(vofp)
  15.247 μs (2001 allocations: 31.27 KiB)
@btime sum(vofp)
  929.750 ns (7 allocations: 112 bytes)

In mysum I had to adjust this:

s = zero(eltype(eltype(x)))

They report “Broadcasting like f.(A) is efficient.” on their github page. I’ll try this later in my project.

foobar_lv2 · October 1, 2018, 4:07pm

Imo the really annoying thing is that the flattened iterator allocates during iteration, which makes mysum(flattened) so slow. Looking at @code_warntype, the bad object is of type ::Union{Nothing, Tuple{Float64,Tuple{Int64,Array{Float64,1},Int64}}}.

In the case of flattening over a tuple of iterators or an array of iterators (anything that can be indexed with Int), there is no good reason for having an iterator-state that is not bitstype (it should be a tuple of integers, such that the next element is iterate(flattened_iterators[i], j)). So I’d say that this is either a problem of Base.Iterators.flatten or yet another manifestation of the problem with heap-allocating small immutable non-bitstypes.

Topic		Replies	Views
Slow `reduce(vcat, itr)` Performance cat	26	1392	September 13, 2022
How to concatenate matrices views without copying memory? Performance question , matrices , views	3	182	August 29, 2024
Vcat slow? General Usage	12	1682	June 15, 2020
Concatenating iterables without allocating memory Performance	34	7318	January 16, 2020
For-loop over two arrays, sequentially General Usage	3	1381	May 26, 2022

Iterating over two vectors with vcat is faster than Iterators.flatten and CatViews?

Related topics