Best way to append vectors of various types


#1

lets say you have a list (in vector form, or a set or something) of N vectors and you want to concatenate all of them. However the vectors might have various lengths and types.

An efficient way to do this would be to find the length of the new vector first and allocate an empty version.

x = [rand(rand(1:5) for i in 1:10]
newvec = Vector{Float64}(sum(length(ix) for ix in x)))
... fill it in

But if you don’t know the types, what do you do? I was thinking of using reduce.

newvec = reduce(append!, x)

However I am guessing there is a better way. Is there?

For context, I am trying to work on DataFramesMeta’s transform(g::GroupedDataFrame, ...) function.

using DataFramesMeta
df = DataFrame(a = [1, 1, 2, 2], b = [4,5, missing, missing])
df2 = @linq df |>
    groupby(df, :a) |>
    transform(t = a - mean(a)) 

This will throw an error, because DataFramesMeta first allocates an empty array with the type of the first returned vector, for the first group. So it will create Vector{Float64} and throw an error when missings are added.

However I think this question applies more generally to a number contexts.


#2

Your code for the @linq operation seems broken. I guess you wanted:

df2 = @linq df |>
    groupby(:a) |>
    transform(t = :b .- mean(:b))

right? And it fails exactly for the reason you mention.

The way of concatenation you propose is most efficient AFAIK. And it is implemented in vcat (a bit differently in Base and in DataFrames - the difference is how common type is identified, in Base it is more standard and uses promote_type function).

EDIT: actually here is a recent relevant issue https://github.com/JuliaStats/DataFramesMeta.jl/issues/98.


#3

yeah I’ve been exploring this issue.

Weirdly, I get type promotion errors in the DataFramesMeta approach but can’t replicate it elsewhere.

x = [[1.0, 2.0, 3.0], [missing, missing, missing]]
t = x -> x .+ 1
reduce(append!, t(ix) for ix in x) # no error

But when I have similar code in DataFramesMeta, it throws an error.

We can continue this discussion in the PR, but maybe someone has some insight on the source of the error.


#4

You could compute combined type, e.g.

using Compat # for julia 0.6
function joinvecs(X)
  T = mapreduce(eltype, promote_type, Union{}, X)
  Y = Vector{T}(undef, mapreduce(length, +, 0, X))
  i = 1
  for x in X
    copyto!(Y, i, x, firstindex(x), length(x))
    i += length(x)
  end
  return Y
end

(Alternatively, you could use typejoin instead of promote_type for the type computation, depending on what behavior you want.)


#5

Thanks, this definitely works.

I am also more confident now that my append! example above doesn’t work. Though I was sure it was working earlier.

I am going to search for this machinery in DataFrames and see if there is a function from there I can call.


#6

I would take a step back, and think about whether a Vector is the best data structure for heterogeneous collections. It is surely the most convenient, but if you have a large number of types, the speed of concatenation may be the least of your worries.

If you just have a small number of types that work with the Union optimization in v0.7, I would just leave it to vcat, and hope that implements the best solution (and if not, open an issue). In particular, if you are mixing T and Union{Missing,T}, the broadening of the type will just happen once, which is relatively low-cost.


#7

Ultimately I think this concatenation will be done on a generator, looping through subdataframes of a grouped dataframe.

Here is the relevant PR. I think I found some promote_type code in DataFrames to use that handles some edge cases better.


#8

Could you comment on the performance of that function you proposed? It seems to loop through the vector X a three times. Once for the types, a second for the length of x, and a third time to input the values themselves. Is that performant, or something to be avoided? Thanks.


#9

The cost of the first two loops should be negligible compared to the third (assuming the vectors have non-negligible lengths on average), because only the third loop over X also loops over the individual elements.


#10

Thanks!

assuming the vectors have non-negligible lengths on average

I’m not sure thats true in this context. I am imagining a dataset with 3 time periods per person and 1 million people, and you want to make a by-group variable without collapsing the dataset.

That said, I’m sure there are some optimizations to be done, and some benchmarking.