lets say you have a list (in vector form, or a set or something) of N vectors and you want to concatenate all of them. However the vectors might have various lengths and types.
An efficient way to do this would be to find the length of the new vector first and allocate an empty version.
x = [rand(rand(1:5) for i in 1:10]
newvec = Vector{Float64}(sum(length(ix) for ix in x)))
... fill it in
But if you don’t know the types, what do you do? I was thinking of using reduce.
newvec = reduce(append!, x)
However I am guessing there is a better way. Is there?
For context, I am trying to work on DataFramesMeta’s transform(g::GroupedDataFrame, ...) function.
using DataFramesMeta
df = DataFrame(a = [1, 1, 2, 2], b = [4,5, missing, missing])
df2 = @linq df |>
groupby(df, :a) |>
transform(t = a - mean(a))
This will throw an error, because DataFramesMeta first allocates an empty array with the type of the first returned vector, for the first group. So it will create Vector{Float64} and throw an error when missings are added.
However I think this question applies more generally to a number contexts.
right? And it fails exactly for the reason you mention.
The way of concatenation you propose is most efficient AFAIK. And it is implemented in vcat (a bit differently in Base and in DataFrames - the difference is how common type is identified, in Base it is more standard and uses promote_type function).
using Compat # for julia 0.6
function joinvecs(X)
T = mapreduce(eltype, promote_type, Union{}, X)
Y = Vector{T}(undef, mapreduce(length, +, 0, X))
i = 1
for x in X
copyto!(Y, i, x, firstindex(x), length(x))
i += length(x)
end
return Y
end
(Alternatively, you could use typejoin instead of promote_type for the type computation, depending on what behavior you want.)
I would take a step back, and think about whether a Vector is the best data structure for heterogeneous collections. It is surely the most convenient, but if you have a large number of types, the speed of concatenation may be the least of your worries.
If you just have a small number of types that work with the Union optimization in v0.7, I would just leave it to vcat, and hope that implements the best solution (and if not, open an issue). In particular, if you are mixing T and Union{Missing,T}, the broadening of the type will just happen once, which is relatively low-cost.
Could you comment on the performance of that function you proposed? It seems to loop through the vector X a three times. Once for the types, a second for the length of x, and a third time to input the values themselves. Is that performant, or something to be avoided? Thanks.
The cost of the first two loops should be negligible compared to the third (assuming the vectors have non-negligible lengths on average), because only the third loop over X also loops over the individual elements.
assuming the vectors have non-negligible lengths on average
I’m not sure thats true in this context. I am imagining a dataset with 3 time periods per person and 1 million people, and you want to make a by-group variable without collapsing the dataset.
That said, I’m sure there are some optimizations to be done, and some benchmarking.