Best way to iteratively add to a DataFrame?

dataframes
#1

I would like to do something like this:

using DataFrames
dflong = DataFrame()
for i = 1:3
    df = DataFrame(a = rand(i))
    vcat(dflong, df)
end

I understand that this doesn’t work for two reasons:

  1. dflong cannot be modified inside the local for scope
  2. Even if it could, dflong and df have a different number of columns.

I have devised a solution that works, but seems very ugly, inelegant, and perhaps inefficient:

using DataFrames

dflong = DataFrame()
first = true

for i = 1:3
    df = DataFrame(a = rand(i))

    global dflong
    global first

    if first
        dflong = similar(df, 0)
        first = false
    else
        dflong = vcat(dflong, df)
    end
end

Can you suggest a better way to do this?
I am new to Julia so probably just not getting something basic here about the proper way to adapt to for loops with local scope.

#2
> reduce(vcat, [DataFrame(a = rand(i)) for i in 1:5])
> 15Γ—1 DataFrame
β”‚ Row β”‚ a         β”‚
β”‚     β”‚ Float64   β”‚
β”œβ”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚ 1   β”‚ 0.0250787 β”‚
β”‚ 2   β”‚ 0.144394  β”‚
β”‚ 3   β”‚ 0.216657  β”‚
β”‚ 4   β”‚ 0.761747  β”‚
β”‚ 5   β”‚ 0.351675  β”‚
β”‚ 6   β”‚ 0.284681  β”‚
β”‚ 7   β”‚ 0.106181  β”‚
β”‚ 8   β”‚ 0.551472  β”‚
β”‚ 9   β”‚ 0.523894  β”‚
β”‚ 10  β”‚ 0.51445   β”‚
β”‚ 11  β”‚ 0.587754  β”‚
β”‚ 12  β”‚ 0.878151  β”‚
β”‚ 13  β”‚ 0.985698  β”‚
β”‚ 14  β”‚ 0.504822  β”‚
β”‚ 15  β”‚ 0.788035  β”‚
#3

Thanks this is helpful.
In practice (outside of my simple example) I would like to do many operations inside the for loop before concatenating the data frame, so that I can’t use a constructor.
What’s a good solution for those types of situations?

#4

The solution is correct, but I have some minor additional notes.

reduce(vcat, [DataFrame(a = rand(i)) for i in 1:5])

is only minimally faster than

vcat([DataFrame(a = rand(i)) for i in 1:5]...)

(the change was merged yesterday to master and has not been released yet (earlier splatting was the recommended approach).

Also creating intermediate data frames is not efficient. The recommended way to add rows to a data frame is:

using DataFrames
dflong = DataFrame(a=Float64[])
for i = 1:3
    push!(dflong, (rand(i),))
end

(you can read the documentation of push! to find the accepted types of rows, in particular you can push! a NamedTuple, a dictionary, a vector or a tuple)

If you really have to create intermediate DataFrames then you can also do it with append! which will also be relatively fast (and you do not have to store all the data frames in the memory before vcat-ing):

using DataFrames
dflong = DataFrame(a=Float64[])
for i = 1:3
    append!(dflong, DataFrame(a=rand(i)))
end
3 Likes
#5

This is a situation where allowing push! and append! to add new columns if the data frame has zero columns would be convenient. Not sure whether that justifies this exception.

3 Likes
#6

You should also be able to vcat a DataFrame with a Dict provided the symbols are the same as the DataFrame’s columns. Since a Dict is lighter weight (I think) this might be a solution depending on the details of your problem.

#7

push!ing onto TypedTables is possible with an issue i created.

1 Like
#8

append! should be OK, but push! is problematic, because:

  1. if what we push is a vector/tuple we do not have column names
  2. if what we push is a dict/named tuple the current behavior of push! is to add only a selection of columns that already exist in a DataFrame, so we would add no columns.
#9

Yeah, that would only work when pushing a named tuple or DataFrameRow…