Best way to append vectors of various types

pdeffebach · August 2, 2018, 9:05pm

lets say you have a list (in vector form, or a set or something) of N vectors and you want to concatenate all of them. However the vectors might have various lengths and types.

An efficient way to do this would be to find the length of the new vector first and allocate an empty version.

x = [rand(rand(1:5) for i in 1:10]
newvec = Vector{Float64}(sum(length(ix) for ix in x)))
... fill it in

But if you don’t know the types, what do you do? I was thinking of using reduce.

newvec = reduce(append!, x)

However I am guessing there is a better way. Is there?

For context, I am trying to work on DataFramesMeta’s transform(g::GroupedDataFrame, ...) function.

using DataFramesMeta
df = DataFrame(a = [1, 1, 2, 2], b = [4,5, missing, missing])
df2 = @linq df |>
    groupby(df, :a) |>
    transform(t = a - mean(a))

This will throw an error, because DataFramesMeta first allocates an empty array with the type of the first returned vector, for the first group. So it will create Vector{Float64} and throw an error when missings are added.

However I think this question applies more generally to a number contexts.

bkamins · August 2, 2018, 10:51pm

Your code for the @linq operation seems broken. I guess you wanted:

df2 = @linq df |>
    groupby(:a) |>
    transform(t = :b .- mean(:b))

right? And it fails exactly for the reason you mention.

The way of concatenation you propose is most efficient AFAIK. And it is implemented in vcat (a bit differently in Base and in DataFrames - the difference is how common type is identified, in Base it is more standard and uses promote_type function).

EDIT: actually here is a recent relevant issue Grouped by issue with type conversion · Issue #98 · JuliaData/DataFramesMeta.jl · GitHub.

pdeffebach · August 2, 2018, 11:53pm

yeah I’ve been exploring this issue.

Weirdly, I get type promotion errors in the DataFramesMeta approach but can’t replicate it elsewhere.

x = [[1.0, 2.0, 3.0], [missing, missing, missing]]
t = x -> x .+ 1
reduce(append!, t(ix) for ix in x) # no error

But when I have similar code in DataFramesMeta, it throws an error.

We can continue this discussion in the PR, but maybe someone has some insight on the source of the error.

stevengj · August 3, 2018, 1:29am

You could compute combined type, e.g.

using Compat # for julia 0.6
function joinvecs(X)
  T = mapreduce(eltype, promote_type, Union{}, X)
  Y = Vector{T}(undef, mapreduce(length, +, 0, X))
  i = 1
  for x in X
    copyto!(Y, i, x, firstindex(x), length(x))
    i += length(x)
  end
  return Y
end

(Alternatively, you could use typejoin instead of promote_type for the type computation, depending on what behavior you want.)

pdeffebach · August 3, 2018, 2:14am

Thanks, this definitely works.

I am also more confident now that my append! example above doesn’t work. Though I was sure it was working earlier.

I am going to search for this machinery in DataFrames and see if there is a function from there I can call.

Tamas_Papp · August 3, 2018, 9:49am

I would take a step back, and think about whether a Vector is the best data structure for heterogeneous collections. It is surely the most convenient, but if you have a large number of types, the speed of concatenation may be the least of your worries.

If you just have a small number of types that work with the Union optimization in v0.7, I would just leave it to vcat, and hope that implements the best solution (and if not, open an issue). In particular, if you are mixing T and Union{Missing,T}, the broadening of the type will just happen once, which is relatively low-cost.

pdeffebach · August 3, 2018, 11:36am

Ultimately I think this concatenation will be done on a generator, looping through subdataframes of a grouped dataframe.

Here is the relevant PR. I think I found some promote_type code in DataFrames to use that handles some edge cases better.

pdeffebach · August 3, 2018, 5:00pm

Could you comment on the performance of that function you proposed? It seems to loop through the vector X a three times. Once for the types, a second for the length of x, and a third time to input the values themselves. Is that performant, or something to be avoided? Thanks.

stevengj · August 3, 2018, 5:40pm

The cost of the first two loops should be negligible compared to the third (assuming the vectors have non-negligible lengths on average), because only the third loop over X also loops over the individual elements.

pdeffebach · August 3, 2018, 6:18pm

Thanks!

assuming the vectors have non-negligible lengths on average

I’m not sure thats true in this context. I am imagining a dataset with 3 time periods per person and 1 million people, and you want to make a by-group variable without collapsing the dataset.

That said, I’m sure there are some optimizations to be done, and some benchmarking.

Topic		Replies	Views
Stack overflow in DataFrames group by Data	16	4072	October 15, 2017
Concatenating vectors of different length into a dataframe or table General Usage question	9	4920	October 6, 2020
Push! is very slow on Vector{Union{...}} Internals & Design	2	843	July 12, 2018
Concatenating two collections into a new one General Usage question	9	844	December 6, 2016
Create a vector from n different vectors New to Julia question	9	1981	February 27, 2017

Best way to append vectors of various types

Related topics