Very best way to concatenate an array of arrays

jlapeyre · January 19, 2019, 2:07pm

Sometimes not so marginal. A while ago I wrote this (simplified here)

function merge_dict(ds...)
    t = eltype(ds)
    merge_dict!(t(),ds)
end

function merge_dict!(d::AbstractDict, others)
    for other in others
        @inbounds for (k,v) in other
            d[k] = v
        end
    end
    return d
end

julia> v = [Dict(rand(1:1000) => rand() for _ in 1:10) for _ in 1:10^4];

julia> @btime merge_dict(v...);
  2.191 ms (21 allocations: 327.00 KiB)

julia> @btime merge(v...);
ERROR: StackOverflowError:

julia> v = [Dict(rand(1:1000) => rand() for _ in 1:10) for _ in 1:10^3];

julia> @btime merge_dict(v...);
  236.527 μs (21 allocations: 116.06 KiB)

julia> @btime merge(v...);
  49.820 ms (11963 allocations: 27.52 MiB)

I submitted a PR about a year ago making this a method for merge. But, my version breaks type-stability, important in some cases, obviously not here. Still my code is useful in many situations (and I actually a use a version that takes a Vector rather than splatting. However splatting 10^3 and 10^4 arguments entails very little performance penalty.)

So I changed the PR, upon suggestion, to improve the efficiency of reduce(merge, dicts) instead. There were many choices, but getting the basic implementation was fast. Pushing the PR through was an odyssey. Because reduce, Dicts, and merge are so widely used, things have to be done carefully. I forgot, just looked now, that the PR still has not been merged. It was bumped less than a day ago.

It was a great learning experience, but quite a bit of effort. This depends also on how experienced you are at making PRs to base.

Liso · January 20, 2019, 7:23pm

just a curious: what is the reason to use @inbounds in merge_dict! function?

(BTW I needed to add word “function” to avoid Interrobang )

jlapeyre · January 20, 2019, 9:23pm

That’s a good question. It doesn’t seem to make sense. My best recollection is that I copied it from code in base. Or maybe I tried it and got a performance improvement.

kolia · January 22, 2019, 7:03pm

https://github.com/JuliaDiffEq/RecursiveArrayTools.jl/issues/49

DavidKMaslen · October 11, 2019, 5:59pm

FWIW the relevant concept to implement here is possibly flatmap. Then flatten(vv) = flatmap(x->x, vv)

Tamas_Papp · October 12, 2019, 7:05am

My understanding is that with the PR above having been merged, this is now handled nicely in Base with reduce(vcat, ...).

vvjn · October 12, 2019, 2:41pm

This is the one function that I keep wanting to be in core Julia because I keep reimplementing or googling this function so much. I don’t think splatting huge arrays gives you good performance. Something like concatenate or vconcatenate and hconcatenate, analogous to maximum and max, would be really nice.

baggepinnen · October 12, 2019, 3:05pm

Tamas solution is the one implemented in Base intended to solve this exact problem

DavidKMaslen · October 12, 2019, 3:22pm

[ Scratch the flatmap suggestion (for now). ]

reduce(vcat, vector_of_vectors) seems to imply iteratively applying vcat — which seems like a possibly an inefficient way of doing this — and certainly there is nothing in the documentation to suggest otherwise. You can look at the code, but we shouldn’t want people to have to dive through reduce.jl just to understand what their code is doing. There is are some hints about reduce having specific implementations for different operations — which I mostly read as “this will do what you want, just trust us, but you will never know what your code does”.

I suppose it would be nice to know (i.e. be true and have documented), that there wasn’t one vector allocation happening per vector in vector_of_vectors. The other thing one might hope to happen would be for there to be just one allocation, though I suppose log(N=total size) might not be that bad (e.g. for appending to a growing vector). In any case someone using reduce(vcat, vector_of_vectors) is going to find it hard to know whether there are N allocations happening, or log(N), or 1 allocation, and whether either of the two “efficient” flatten approaches (filling up one pre-allocated vector or appending to a vector) is being used, or something else altogether.

DavidKMaslen · October 12, 2019, 3:27pm

The documentation for reduce says:

" Reductions for certain commonly-used operators may have special implementations, and should be used instead: maximum(itr) , minimum(itr) , sum(itr) , prod(itr) , any(itr) , all(itr) ."

So I suppose to be consistent with the design intent expressed in the documentation that there should be a new function for the special implementation needed for reduce with the vcat operator.

Tamas_Papp · October 12, 2019, 4:08pm

You seem to be misunderstanding something — there is a specialized implementation of reduce(vcat, ...), implemented by the PR linked above. The entry point is here.

It does the right thing, no extra allocations. This should be both efficient and clear, a win-win.

This is not true — of course you can learn what the code does, rather easily. Julia is open source.

If you think this could be documented better, please make a pull request:

github.com

JuliaLang/julia/blob/master/CONTRIBUTING.md#improving-documentation

# Notes for Julia Contributors

Hi! If you are new to the Julia community: welcome, and thanks for trying Julia. Please be sure to respect our [community standards](https://julialang.org/community/standards) in all interactions.

If you are already familiar with Julia itself, this blog post by Katharine Hyatt on [Making your first Julia pull request](https://kshyatt.github.io/post/firstjuliapr/) is a great way to get started.

## Learning Julia

[The learning page](https://julialang.org/learning) has a great list of resources for new and experienced users alike.

## Before filing an issue

- Reporting a potential bug? Please read the "[How to file a bug report](https://github.com/JuliaLang/julia/blob/master/CONTRIBUTING.md#how-to-file-a-bug-report)" section to make sure that all necessary information is included.

- Contributing code? Be sure to review the [contributor checklist](https://github.com/JuliaLang/julia/blob/master/CONTRIBUTING.md#contributor-checklist) for helpful tips on the tools we use to build Julia.

- Library feature requests are generally not accepted on this issue tracker. New libraries should be developed as [packages](https://julialang.github.io/Pkg.jl/v1/creating-packages/). Discuss ideas for libraries at the [Julia Discourse forum](https://discourse.julialang.org). Doing so will often lead to pointers to existing projects and bring together collaborators with common interests.

## Contributor Checklist

This file has been truncated. show original

(Also, generally, it is not recommended to resurrect old topics unless there is a good reason.)

DavidKMaslen · October 12, 2019, 6:31pm

Thanks. I also just used @benchmark to see how many allocations were being made and observed just 1.

Where would the documentation of this feature go? It’s a feature of reduce, but someone looking for advice on how to properly flatten a vector of vectors would be unlikely to look there.

On a completely personal note - I personally find myself not wanting to use reduce because the documentation explicitly states there are no guarantees either now, or in the future, on what it does.

nalimilan · October 13, 2019, 8:14pm

I guess the docstring for reduce should mention this. Anyway somebody looking for this is more likely to look at ?reduce than at the docstring for a function… whose name he’s precisely looking for!

I’m surprised by the recommendation to use special methods like maximum, as these are defined precisely in terms of reduce. Maybe these are out of date. Can you file an issue in GitHub?

c42f · October 14, 2019, 1:18am

If the docs are giving you that impression we should tweak them a bit. Do you refer to the following?

If provided, the initial value init must be a neutral element for op that will be returned for empty collections. It is unspecified whether init is used for non-empty collections.

This does seem strange especially given that one of the examples violates this rule by using something other than the multiplicative identity in reduce(*, [2; 3; 4]; init=-1) == -24, demonstrating that the value of init is used in this case.

c42f · October 14, 2019, 3:59am

The wording looks somewhat out of date but there’s also the matter of what’s easier to read. Personally I’d prefer to see code using sum rather than reduce(+, xs). Having said that, I do wonder whether a sufficiently nice syntax for reduction could let us delete some of the special purpose reduction verbs in the future. There was some interesting speculation about this at Julep: Syntax for reduction (overloadable splatting?) · Issue #32860 · JuliaLang/julia · GitHub (read further than the initial proposal — there’s a range of options discussed there).

I think it’s interesting that we currently conflate binary operators like vcat(a,b), with a lifted version of the binary operator vcat_(xs...) = reduce(vcat, xs). This seems similar to the conflation of vectorized functions with their scalar versions prior to broadcast syntax being introduced. The introduction of broadcast syntax let us delete all the “broadcast-lifted” (ie, vectorized) methods. Similarly, having a reduction syntax would let us delete all the “reduction-lifted” methods.

Per · October 14, 2019, 6:19am

This is not just a matter of being easier to read. reduce(+, xs) must return the result of applying + repeatedly. The sum function on the other hand is free to have a different return type than that of +, reducing the risk of integer overflow.

julia> xs = Int8[100 27 1];

julia> sum(xs)
128

julia> reduce(+, xs)
-128

foobar_lv2 · October 14, 2019, 8:28am

On this topic, I think that a flatmap verb would still be useful:

Different typing of results
allows to map
verb that is obviously correct+optimized
directly maps from concept that people know from other languages

(3) is not just docs, it is the fact that it is non-apparent from the name. Hence, it would often be better to spell this like reduce(vcat, stuff) #special cased in Base, faster than it looks. Being correct is not enough, code should try to be obviously correct (which is a much higher bar). That is a question of coding style, though.

tkf · October 15, 2019, 1:05am

I’ve been thinking that this is natural, but maybe this is because of my positive experience in writing lisp where (+), (+ x), (+ x y), (+ x y z), … are all defined. But even without mentioning lisp, I think the unary + x being an identity function in math already motivates that +(args...) should be reduce(+, args).

tkf · October 15, 2019, 1:36am

Would it be crazy to use init = 0 :: Int for reduce(+, xs) when eltype(xs) <: Union{Int8,Int16,Int32}? From how reduce is specified in the documentation, this is a valid definition AFAICT. Though maybe it was decided that it was too much of a magic, given that sum is implemented differently (i.e., using add_sum instead of +)?

c42f · October 15, 2019, 3:20am

Yes, when I say I think it’s “interesting” I really mean it: I don’t know whether it’s good or bad overall However I can think of two ways it leads to inconsistency:

The existence of a “reduction-lifted” method depends on whether the author of a given binary operator saw fit to include it. The user of the reduction can’t directly control that.
The “reduction-lifted” methods are inefficient when combined with splatting, so vcat(xs...) is very natural but nevertheless not recommended.

A reduction syntax could solve both these issues at once. Whether it’s worthwhile depends on whether it can really bring a lot of value. If it could be made to compose really nicely with broadcast that might be fairly compelling.

Topic		Replies	Views
Slow `reduce(vcat, itr)` Performance cat	26	1436	September 13, 2022
Very best way to concatenate array of arrays, while applying a function New to Julia question	10	978	October 4, 2022
Vcat slow? General Usage	12	1700	June 15, 2020
`cat` allocates too much memory Performance question	11	1372	November 19, 2020
Problem with cat() General Usage	9	2690	October 2, 2018

Very best way to concatenate an array of arrays

Related topics