How to get one dataframe from many dataframes in Julia (cf. rbindlist in R)

Below schematically how I’m doing this in R. How can I (Julia newby) do this in Julia

# file abc.R
df <- rbindlist(list(df,
data.frame(A = "abc1", B = 1),
data.frame(A = "abc2", B = 2)))

# file bcd.R
df <- rbindlist(list(df,
data.frame(A = "bcd1", B = 1),
data.frame(A = "bcd2", B = 2)))

# file df.R
source("abc.R")
source("bcd.R")

# R repl
source("df.R")

# dataframe df
# A    B
# abc1 1
# abc2 2
# bcd1 1
# bcd2 2
julia> df1 = DataFrame(A = ["abc1", "abc2"], B = 1:2)
2Γ—2 DataFrame
 Row β”‚ A       B
     β”‚ String  Int64
─────┼───────────────
   1 β”‚ abc1        1
   2 β”‚ abc2        2

julia> df2 = DataFrame(A = ["bcd1", "bcd2"], B = 1:2)
2Γ—2 DataFrame
 Row β”‚ A       B
     β”‚ String  Int64
─────┼───────────────
   1 β”‚ bcd1        1
   2 β”‚ bcd2        2

julia> vcat(df1, df2)
4Γ—2 DataFrame
 Row β”‚ A       B
     β”‚ String  Int64
─────┼───────────────
   1 β”‚ abc1        1
   2 β”‚ abc2        2
   3 β”‚ bcd1        1
   4 β”‚ bcd2        2

If you have a vector of many DataFrames do:

julia> reduce(vcat, [df1, df2])
4Γ—2 DataFrame
 Row β”‚ A       B
     β”‚ String  Int64
─────┼───────────────
   1 β”‚ abc1        1
   2 β”‚ abc2        2
   3 β”‚ bcd1        1
   4 β”‚ bcd2        2
5 Likes

Thanks for your quick reply and solution! I have two points to this.

  1. I want to have the dataframe input in β€œrow format”, eg. DataFrame(A = β€œabc1”, B = 1), DataFrame(A = β€œabc2”, B = 2). This is easier and more intuitive for an enduser. You see this in my R example above.

  2. I only want to define a name for the final dataframe (in my R example β€œdf”). I don’t want to give names to intermediate dataframes (in your solution β€œdf1” and β€œdf2”) since I have many thousands of them in my R application.

Can these points be satisfied in a Julia solution?

Yes you can simply push! a tuple onto an existing DataFrame:

julia> push!(df, ("cde1", 1))
5Γ—2 DataFrame
 Row β”‚ A       B
     β”‚ String  Int64
─────┼───────────────
   1 β”‚ abc1        1
   2 β”‚ abc2        2
   3 β”‚ bcd1        1
   4 β”‚ bcd2        2
   5 β”‚ cde1        1

or if you want column names so as not to rely on the ordering of fields in the tuple, you can use a NamedTuple:

julia> push!(df, (B = 3, A = "cde1"))
6Γ—2 DataFrame
 Row β”‚ A       B
     β”‚ String  Int64
─────┼───────────────
   1 β”‚ abc1        1
   2 β”‚ abc2        2
   3 β”‚ bcd1        1
   4 β”‚ bcd2        2
   5 β”‚ cde1        1
   6 β”‚ cde1        3
3 Likes

Thanks, I think this is what I was looking for. I will try this out. I’m very curious about the performance comparison R-Julia for this functionality.

If you push! it is faster to use Tuple rather than NamedTuple as then column names are not checked.

  • Based upon the tips I got from this thread, I tried out Julia dataframes and compared it with my R dataframes-with-rbindlist application.
  • I want to give feedback on the result here.
  • A little script was used to convert the R data source files to .jl files, at present 9749 files covering 102433 rows with 6 columns.
  • A first conclusion is that both solutions are about equally fast in loading the data: 28.3 sec (R) vs 26.5 sec (Julia).
  • At a later moment I want to benchmark the queries on the data and report on that as well.
  • At present my lack of knowledge of the Julia query commands causes a delay.
  • I have a feeling, based on a few small trials, that here Julia will have a better performance.

R

> tb_load()
   user  system elapsed 
 28.303   1.984  31.529 

> nrow(triple)
[1] 102433

> names(triple)
[1] "SUBJECT"   "PRIMES"    "PREDICATE" "PRIMEP"    "OBJECT"    "PRIMEO"

> tb_count_out()
Triples: 102433, entities: 36750, pairs: 10076, files: 9749

Julia

julia> tb_load()
 26.549103 seconds (7.53 M allocations: 372.071 MiB, 0.60% gc time)
102433Γ—6 DataFrame

julia> nrow(triple)
102433

julia> names(triple)
6-element Vector{String}:
 "SUBJECT"
 "PRIMES"
 "PREDICATE"
 "PRIMEP"
 "OBJECT"
 "PRIMEO"

People on here are generally very helpful to provide help with optimizing code for speed, so if you can create a minimal working example (i.e. a bit of code that generates the files you’re reading in with dummy data, and produces the data set you’re looking to produce) you’ll probably get helpful comments on it.