Appending rows to a dataframe is seemingly inconsistent and confusing

Hi Everybody

Coming from python and R, I find Julia a breath of fresh air. Usually so much better.

But, I find dataframes, and especially adding rows to dataframes is very confusing, and much harder than R or python. Here is a link with 3 methods, Adding a new row to a DataFrame push!, append! and vcat

Suppose I have a df and try

vcat(d,last(d)) I get a 2 element array of df
vcat(d,last(d,1)) appends the row

push!(d,last(d)) appends
push!(d,last(d,1)) fails

append!(d,last(d,1)) appends
append!(d,last(d)) fails

I am sure there is a very good reason for this seeming inconsistency, and wanted to ask you all what is is.

And ultimately, is there a canonical writeup of how to add rows to dataframes?

Thanks a lot for this wonderful language

best, Jack

I’d say what you’re mainly observing is the difference between calling last(df) and last(df, 1):

julia> df = DataFrame(rand(5, 2), :auto)
5Γ—2 DataFrame
 Row β”‚ x1        x2       
     β”‚ Float64   Float64  
─────┼────────────────────
   1 β”‚ 0.792271  0.51255
   2 β”‚ 0.924156  0.913761
   3 β”‚ 0.488379  0.635471
   4 β”‚ 0.333049  0.366065
   5 β”‚ 0.707473  0.660831

julia> last(df)
DataFrameRow
 Row β”‚ x1        x2       
     β”‚ Float64   Float64  
─────┼────────────────────
   5 β”‚ 0.707473  0.660831

julia> last(df, 1)
1Γ—2 DataFrame
 Row β”‚ x1        x2       
     β”‚ Float64   Float64  
─────┼────────────────────
   1 β”‚ 0.707473  0.660831

i.e. last(df) returns a DataFrameRow, while last(df, 1) returns a DataFrame. This is because last(df, n) is normally called with n > 1 (because for n = 1 there’s the single argument method).

Hopefully that makes the behaviour you’re seeing more intuitive. In particular

julia> vcat(df, last(df))
2-element Vector{Any}:
 5Γ—2 DataFrame
 Row β”‚ x1        x2       
     β”‚ Float64   Float64  
─────┼────────────────────
   1 β”‚ 0.792271  0.51255
   2 β”‚ 0.924156  0.913761
   3 β”‚ 0.488379  0.635471
   4 β”‚ 0.333049  0.366065
   5 β”‚ 0.707473  0.660831
 DataFrameRow
 Row β”‚ x1        x2       
     β”‚ Float64   Float64  
─────┼────────────────────
   5 β”‚ 0.707473  0.660831

as you see does not as stated in your post give you a 2-element array of DataFrames, it gives you a 2-element Vector{Any}, where the first element is a DataFrame and the second element is a DataFrameRow. That’s because there isn’t a method in DataFrames to concatenate a DataFrame and a row:

julia> @which vcat(df, df)
vcat(dfs::AbstractDataFrame...; cols, source) in DataFrames at .../.julia/packages/DataFrames/ORSVA/src/abstractdataframe/abstractdataframe.jl:1679

julia> @which vcat(df, last(df))
vcat(X...) in Base at abstractarray.jl:1772

so it falls back onto generic vcat which then creates a heterogeneous array.

Similarly, push! is meant to add a single row to a collection, not one collection to another collection. This is consistent with Julia Base:

julia> push!([1, 2, 3], 4)
4-element Vector{Int64}:
 1
 2
 3
 4

julia> push!([1, 2], 3)
3-element Vector{Int64}:
 1
 2
 3

julia> push!([1, 2], [3, 4])
ERROR: MethodError: Cannot `convert` an object of type Vector{Int64} to an object of type Int64

append! on the other hand happily appends one collection to another one:

julia> append!([1,2], [3, 4])
4-element Vector{Int64}:
 1
 2
 3
 4
4 Likes

Just some general thoughts:

From the linked thread:

It’s quite surprising that the DataFrames package documentation doesn’t provide a canonical way of
adding a new record to a df.

But now it is documented (perhaps it wasn’t 2017):
https://dataframes.juliadata.org/stable/lib/functions/#Mutating-and-transforming-data-frames-and-grouped-data-frames

Providing a canonical way depends on the question, asking for β€œadding a record” is not specific enough for a canonical answer as record is not well defined.

For what you ask β€œadding rows to dataframes” I would say the canonical way is push! as the documentation says: Use push! to add individual rows to a data frame.
And a row is of type DataFrameRow which is also important here when talking about canonical ways.

2 Likes

Thanks!

its much clearer now. I had not appreciated that DataFrameRow is different than a row in a dataframe, and how push differs from append and vcat

thanks, all Jack

1 Like

To sum up the discussion. The design of DataFrames.jl is about consistency. In order to understand the design of DataFrames.jl you first need to understand how functions in Julia Base work.

julia> x = [1,2,3]
3-element Vector{Int64}:
 1
 2
 3

julia> last(x)
3

julia> last(x, 1)
1-element Vector{Int64}:
 3

So as you can see writing last(x) drops a dimension and last(x, 1) does not drop the dimension.

The same is with DataFrames.jl. If you write last(df) a dimension is dropped (from 2-dimensional DataFrame to 1-dimensional DataFrameRow). If you write last(df, 1) then dimension is not dropped and you get a 2-dimensional DataFrame with one row.

Now regarding push!, append! and vcat.

See what happens in Julia Base:

julia> x = [(a="1",), (a="2",), (a="3",)]
3-element Vector{NamedTuple{(:a,), Tuple{String}}}:
 (a = "1",)
 (a = "2",)
 (a = "3",)

julia> push!(x, last(x))
4-element Vector{NamedTuple{(:a,), Tuple{String}}}:
 (a = "1",)
 (a = "2",)
 (a = "3",)
 (a = "3",)

julia> append!(x, last(x))
ERROR: MethodError: Cannot `convert` an object of type String to an object of type NamedTuple{(:a,), Tuple{String}}

So you can push! but cannot in general append! the value of last(x) to x.

Now the reverse:

julia> x = [(a="1",), (a="2",), (a="3",)]
3-element Vector{NamedTuple{(:a,), Tuple{String}}}:
 (a = "1",)
 (a = "2",)
 (a = "3",)

julia> append!(x, last(x, 1))
4-element Vector{NamedTuple{(:a,), Tuple{String}}}:
 (a = "1",)
 (a = "2",)
 (a = "3",)
 (a = "3",)

julia> push!(x, last(x, 1))
ERROR: MethodError: Cannot `convert` an object of type Vector{NamedTuple{(:a,), Tuple{String}}} to an object of type NamedTuple{(:a,), Tuple{String}}

so you can append! the value of last(x, 1) but in general cannot push! it.

As for vcat consider the following:

julia> a = [1 2; 3 4]
2Γ—2 Matrix{Int64}:
 1  2
 3  4

julia> b = [1, 2]
2-element Vector{Int64}:
 1
 2

julia> vcat(a, b)
ERROR: ArgumentError: number of columns of each array must match (got (2, 1))

so you are not allowed to vcat a 1-dimensional and 2-dimensional object.

What is allowed in Julia Base is:

julia> a = [1, 2][:, 1:1]
2Γ—1 Matrix{Int64}:
 1
 2

julia> b = [3, 4]
2-element Vector{Int64}:
 3
 4

julia> vcat(a, b)
4Γ—1 Matrix{Int64}:
 1
 2
 3
 4

but I would say that no-one would want to allow vcat of 1-column data frame with a multi-column DataFrameRow like this, so this is not allowed.

In summary Julia Base and DataFrames.jl work in exactly the same way (except for the last case where the behavior of Julia Base is clearly not desirable). Additionally this design is made to be logically consistent with the notion of dimensionality of different objects.

Indeed I know that R and Python are much more flexible in allowing combination of objects of different dimensions, but I personally do not like it as most of the time it leads to hard-to-catch logical bugs in user’s code. On the other hand Julia provides you all the tools you might need to explicitly control the dimension of objects you produce, e.g. last(df) drops a dimension and produces a DataFrameRow and last(df, 1) does not drop a dimension and produces a DataFrame.

16 Likes

as for a description and performance comparison of most common ways of adding a row to a data frame see Benchmarking push! in DataFrames.jl | Blog by BogumiΕ‚ KamiΕ„ski.

In DataFrames.jl 1.4 release we will add insert! and pushfirst! to give you more flexibility where the additional row should be added, see here.

2 Likes

Bkamins

Thanks for the clear discussion. I agree with your last paragraph on flexibility, both Python and R (and Matlab) are more flexible, but they also give you many ways to shoot yourself in the foot, and the strictness of Julia is much better. The day I will fully understand Julia scope and types, and implicit type conversions (which the Parquet package and SQLite annoyingly do), I will be a happy camper.

The reason I was using last(df) to append to a df is that I could not find another way. My df has 50 columns, each with a Float64 except one Int64 and one string, and the data I wanted to append came in a vector + the string and int. last() + copying the data in the right place was the best way to add it. But it is clumsy. In this case performance is irrelevant.

I am looking forward to DataFrames 1.4!

Thanks all for making Julia so wonderful.

all the best, Jack

2 Likes

Since performance is irrelevant maybe you can use something like this:

julia> using DataFrames

julia> df = DataFrame(x1=0.0, x2=0.0, x3=0.0, i=0, s="text")
1Γ—5 DataFrame
 Row β”‚ x1       x2       x3       i      s      
     β”‚ Float64  Float64  Float64  Int64  String 
─────┼──────────────────────────────────────────
   1 β”‚     0.0      0.0      0.0      0  text

julia> v=[1.0, 2.0, 3.0]; int=4; str="more text";

julia> push!(df, [v; int; str])
2Γ—5 DataFrame
 Row β”‚ x1       x2       x3       i      s         
     β”‚ Float64  Float64  Float64  Int64  String    
─────┼─────────────────────────────────────────────
   1 β”‚     0.0      0.0      0.0      0  text
   2 β”‚     1.0      2.0      3.0      4  more text

4 Likes

Thanks Sudete,

looks really simple and useful, will give it a try.

best, Jack

I have read that your DB has many columns. In this case it might be useful (it depends on how your data is organized) to use a namedtuple so as not to have to pay much attention to the order of the values.
Even if the construction of a namedtuple is not the most intuitive operation that julia makes available.


using DataFrames

df = DataFrame(x1=0.0, x2=0.0, x3=0.0, i=0, s="text")

v=[1.0, 2.0, 3.0]; int=4; str="more text";

push!(df, [v; int; str])

# to add (as last) a new row, you can use a namedtuple. 

# Defined in this way, it would not be convenient in your case 
push!(df,(;i=5, s="yet anhoter text",x1=11,x3=13,x2=12))

# Here defined as a merge of three different tuples 

rowtoadd=merge((i=5,),NamedTuple{fnames}(fvalues), (;s="yet anhoter text"))
# where
fnames=tuple(Symbol.(names(df,Float64))...)
fvalues=[21,22,23]
# so
push!(df,rowtoadd)
1 Like

You can push! a Dict, which is better with many columns.

1 Like