How to get one dataframe from many dataframes in Julia (cf. rbindlist in R)

bertlhf · July 19, 2022, 12:36pm

Below schematically how I’m doing this in R. How can I (Julia newby) do this in Julia

# file abc.R
df <- rbindlist(list(df,
data.frame(A = "abc1", B = 1),
data.frame(A = "abc2", B = 2)))

# file bcd.R
df <- rbindlist(list(df,
data.frame(A = "bcd1", B = 1),
data.frame(A = "bcd2", B = 2)))

# file df.R
source("abc.R")
source("bcd.R")

# R repl
source("df.R")

# dataframe df
# A    B
# abc1 1
# abc2 2
# bcd1 1
# bcd2 2

nilshg · July 19, 2022, 12:56pm

julia> df1 = DataFrame(A = ["abc1", "abc2"], B = 1:2)
2×2 DataFrame
 Row │ A       B
     │ String  Int64
─────┼───────────────
   1 │ abc1        1
   2 │ abc2        2

julia> df2 = DataFrame(A = ["bcd1", "bcd2"], B = 1:2)
2×2 DataFrame
 Row │ A       B
     │ String  Int64
─────┼───────────────
   1 │ bcd1        1
   2 │ bcd2        2

julia> vcat(df1, df2)
4×2 DataFrame
 Row │ A       B
     │ String  Int64
─────┼───────────────
   1 │ abc1        1
   2 │ abc2        2
   3 │ bcd1        1
   4 │ bcd2        2

If you have a vector of many DataFrames do:

julia> reduce(vcat, [df1, df2])
4×2 DataFrame
 Row │ A       B
     │ String  Int64
─────┼───────────────
   1 │ abc1        1
   2 │ abc2        2
   3 │ bcd1        1
   4 │ bcd2        2

bertlhf · July 19, 2022, 1:32pm

Thanks for your quick reply and solution! I have two points to this.

I want to have the dataframe input in “row format”, eg. DataFrame(A = “abc1”, B = 1), DataFrame(A = “abc2”, B = 2). This is easier and more intuitive for an enduser. You see this in my R example above.
I only want to define a name for the final dataframe (in my R example “df”). I don’t want to give names to intermediate dataframes (in your solution “df1” and “df2”) since I have many thousands of them in my R application.

Can these points be satisfied in a Julia solution?

nilshg · July 19, 2022, 1:48pm

Yes you can simply push! a tuple onto an existing DataFrame:

julia> push!(df, ("cde1", 1))
5×2 DataFrame
 Row │ A       B
     │ String  Int64
─────┼───────────────
   1 │ abc1        1
   2 │ abc2        2
   3 │ bcd1        1
   4 │ bcd2        2
   5 │ cde1        1

or if you want column names so as not to rely on the ordering of fields in the tuple, you can use a NamedTuple:

julia> push!(df, (B = 3, A = "cde1"))
6×2 DataFrame
 Row │ A       B
     │ String  Int64
─────┼───────────────
   1 │ abc1        1
   2 │ abc2        2
   3 │ bcd1        1
   4 │ bcd2        2
   5 │ cde1        1
   6 │ cde1        3

bertlhf · July 19, 2022, 2:05pm

Thanks, I think this is what I was looking for. I will try this out. I’m very curious about the performance comparison R-Julia for this functionality.

bkamins · July 19, 2022, 8:14pm

If you push! it is faster to use Tuple rather than NamedTuple as then column names are not checked.

bertlhf · July 28, 2022, 7:27am

Based upon the tips I got from this thread, I tried out Julia dataframes and compared it with my R dataframes-with-rbindlist application.
I want to give feedback on the result here.
A little script was used to convert the R data source files to .jl files, at present 9749 files covering 102433 rows with 6 columns.
A first conclusion is that both solutions are about equally fast in loading the data: 28.3 sec (R) vs 26.5 sec (Julia).
At a later moment I want to benchmark the queries on the data and report on that as well.
At present my lack of knowledge of the Julia query commands causes a delay.
I have a feeling, based on a few small trials, that here Julia will have a better performance.

R

> tb_load()
   user  system elapsed 
 28.303   1.984  31.529 

> nrow(triple)
[1] 102433

> names(triple)
[1] "SUBJECT"   "PRIMES"    "PREDICATE" "PRIMEP"    "OBJECT"    "PRIMEO"

> tb_count_out()
Triples: 102433, entities: 36750, pairs: 10076, files: 9749

Julia

julia> tb_load()
 26.549103 seconds (7.53 M allocations: 372.071 MiB, 0.60% gc time)
102433×6 DataFrame

julia> nrow(triple)
102433

julia> names(triple)
6-element Vector{String}:
 "SUBJECT"
 "PRIMES"
 "PREDICATE"
 "PRIMEP"
 "OBJECT"
 "PRIMEO"

nilshg · July 28, 2022, 3:30pm

People on here are generally very helpful to provide help with optimizing code for speed, so if you can create a minimal working example (i.e. a bit of code that generates the files you’re reading in with dummy data, and produces the data set you’re looking to produce) you’ll probably get helpful comments on it.

Topic		Replies	Views
How do I append (row-bind) a collection of DataFrames together into one? New to Julia data	1	1802	September 6, 2019
Construct Julia Dataframe from row data New to Julia question , dataframes , data_structures	11	6212	March 21, 2020
Appending rows to a dataframe is seemingly inconsistent and confusing Data	11	4697	December 24, 2021
JuliaDB.flatten equivalent in DataFrames Data dataframes	7	1147	November 14, 2019
Best way to iteratively add to a DataFrame? General Usage dataframes	8	4509	February 23, 2019

How to get one dataframe from many dataframes in Julia (cf. rbindlist in R)

R

Julia

Related topics