What I mean is that it is better to populate the vectors first, and then create a data frame from them, e.g.:
julia> using DataFrames, BenchmarkTools
julia>
julia> function test1()
df = DataFrame(a=Vector{Int}(undef, 10^6),
b=Vector{String}(undef, 10^6),
c=Vector{Int}(undef, 10^6), copycols=false)
for i in 1:10^6
df.a[i] = 1
df.b[i] = "1"
df.c[i] = 1.0
end
return df
end
test1 (generic function with 1 method)
julia>
julia> function test2()
nt = (a=Vector{Int}(undef, 10^6),
b=Vector{String}(undef, 10^6),
c=Vector{Int}(undef, 10^6))
for i in 1:10^6
nt.a[i] = 1
nt.b[i] = "1"
nt.c[i] = 1.0
end
return DataFrame(nt, copycols=false)
end
test2 (generic function with 1 method)
julia>
julia> @btime test1();
244.934 ms (5998506 allocations: 160.20 MiB)
julia> @btime test2();
4.881 ms (27 allocations: 22.89 MiB)
Since DataFrame
object is not type stable it is best suited for operations that work on whole-columns, as then type instability is not an issue.
The benefit of not being type stable is that we can accommodate very wide data frames without huge compilation overhead + you can easily change the schema of a DataFrame
.