Below schematically how Iβm doing this in R. How can I (Julia newby) do this in Julia
# file abc.R
df <- rbindlist(list(df,
data.frame(A = "abc1", B = 1),
data.frame(A = "abc2", B = 2)))
# file bcd.R
df <- rbindlist(list(df,
data.frame(A = "bcd1", B = 1),
data.frame(A = "bcd2", B = 2)))
# file df.R
source("abc.R")
source("bcd.R")
# R repl
source("df.R")
# dataframe df
# A B
# abc1 1
# abc2 2
# bcd1 1
# bcd2 2
nilshg
July 19, 2022, 12:56pm
2
julia> df1 = DataFrame(A = ["abc1", "abc2"], B = 1:2)
2Γ2 DataFrame
Row β A B
β String Int64
ββββββΌβββββββββββββββ
1 β abc1 1
2 β abc2 2
julia> df2 = DataFrame(A = ["bcd1", "bcd2"], B = 1:2)
2Γ2 DataFrame
Row β A B
β String Int64
ββββββΌβββββββββββββββ
1 β bcd1 1
2 β bcd2 2
julia> vcat(df1, df2)
4Γ2 DataFrame
Row β A B
β String Int64
ββββββΌβββββββββββββββ
1 β abc1 1
2 β abc2 2
3 β bcd1 1
4 β bcd2 2
If you have a vector of many DataFrames do:
julia> reduce(vcat, [df1, df2])
4Γ2 DataFrame
Row β A B
β String Int64
ββββββΌβββββββββββββββ
1 β abc1 1
2 β abc2 2
3 β bcd1 1
4 β bcd2 2
5 Likes
Thanks for your quick reply and solution! I have two points to this.
I want to have the dataframe input in βrow formatβ, eg. DataFrame(A = βabc1β, B = 1), DataFrame(A = βabc2β, B = 2). This is easier and more intuitive for an enduser. You see this in my R example above.
I only want to define a name for the final dataframe (in my R example βdfβ). I donβt want to give names to intermediate dataframes (in your solution βdf1β and βdf2β) since I have many thousands of them in my R application.
Can these points be satisfied in a Julia solution?
nilshg
July 19, 2022, 1:48pm
4
Yes you can simply push!
a tuple onto an existing DataFrame:
julia> push!(df, ("cde1", 1))
5Γ2 DataFrame
Row β A B
β String Int64
ββββββΌβββββββββββββββ
1 β abc1 1
2 β abc2 2
3 β bcd1 1
4 β bcd2 2
5 β cde1 1
or if you want column names so as not to rely on the ordering of fields in the tuple, you can use a NamedTuple
:
julia> push!(df, (B = 3, A = "cde1"))
6Γ2 DataFrame
Row β A B
β String Int64
ββββββΌβββββββββββββββ
1 β abc1 1
2 β abc2 2
3 β bcd1 1
4 β bcd2 2
5 β cde1 1
6 β cde1 3
3 Likes
Thanks, I think this is what I was looking for. I will try this out. Iβm very curious about the performance comparison R-Julia for this functionality.
If you push!
it is faster to use Tuple
rather than NamedTuple
as then column names are not checked.
Based upon the tips I got from this thread, I tried out Julia dataframes and compared it with my R dataframes-with-rbindlist application.
I want to give feedback on the result here.
A little script was used to convert the R data source files to .jl files, at present 9749 files covering 102433 rows with 6 columns.
A first conclusion is that both solutions are about equally fast in loading the data: 28.3 sec (R) vs 26.5 sec (Julia).
At a later moment I want to benchmark the queries on the data and report on that as well.
At present my lack of knowledge of the Julia query commands causes a delay.
I have a feeling, based on a few small trials, that here Julia will have a better performance.
R
> tb_load()
user system elapsed
28.303 1.984 31.529
> nrow(triple)
[1] 102433
> names(triple)
[1] "SUBJECT" "PRIMES" "PREDICATE" "PRIMEP" "OBJECT" "PRIMEO"
> tb_count_out()
Triples: 102433, entities: 36750, pairs: 10076, files: 9749
Julia
julia> tb_load()
26.549103 seconds (7.53 M allocations: 372.071 MiB, 0.60% gc time)
102433Γ6 DataFrame
julia> nrow(triple)
102433
julia> names(triple)
6-element Vector{String}:
"SUBJECT"
"PRIMES"
"PREDICATE"
"PRIMEP"
"OBJECT"
"PRIMEO"
nilshg
July 28, 2022, 3:30pm
8
People on here are generally very helpful to provide help with optimizing code for speed, so if you can create a minimal working example (i.e. a bit of code that generates the files youβre reading in with dummy data, and produces the data set youβre looking to produce) youβll probably get helpful comments on it.