How to initialize empty dataframe of specified size

What I mean is that it is better to populate the vectors first, and then create a data frame from them, e.g.:

julia> using DataFrames, BenchmarkTools

julia> 

julia> function test1()
           df = DataFrame(a=Vector{Int}(undef, 10^6),
                          b=Vector{String}(undef, 10^6),
                          c=Vector{Int}(undef, 10^6), copycols=false)
           for i in 1:10^6
               df.a[i] = 1
               df.b[i] = "1"
               df.c[i] = 1.0
           end
           return df
       end
test1 (generic function with 1 method)

julia> 

julia> function test2()
           nt = (a=Vector{Int}(undef, 10^6),
                 b=Vector{String}(undef, 10^6),
                 c=Vector{Int}(undef, 10^6))
           for i in 1:10^6
               nt.a[i] = 1
               nt.b[i] = "1"
               nt.c[i] = 1.0
           end
           return DataFrame(nt, copycols=false)
       end
test2 (generic function with 1 method)

julia> 

julia> @btime test1();
  244.934 ms (5998506 allocations: 160.20 MiB)

julia> @btime test2();
  4.881 ms (27 allocations: 22.89 MiB)

Since DataFrame object is not type stable it is best suited for operations that work on whole-columns, as then type instability is not an issue.

The benefit of not being type stable is that we can accommodate very wide data frames without huge compilation overhead + you can easily change the schema of a DataFrame.

7 Likes