How to initialize empty dataframe of specified size

Hi my question is as stated in the title. So I want to know what is the easiest way to create a dataframe with the column names, the type of the column and the number of rows specified in the beginning. I know you can do the following:

using DataFrames

testDF = DataFrame(columnA = String[], columnB = Int64[], columnC = Float64[])
push!(testDF,["hi", 3, 78.9])

But this is very inefficient if you already know what the size of the dataframe is going to be.

So how could I also specify the size of the dataframe when initializing it?

Most likely this is going to be inefficient. How are you planning to use this data frame later?
Such a method existed in the past, but it was dropped because it was later used in a way that lead to inefficient code.

Having said that you can do e.g. (I have put 10 as a numer of rows):

DataFrame(["column" * i => Vector{T}(undef, 10) for (i, T) in zip('A':'C', [String, Int, Float64])], copycols=false)

or

DataFrame([Vector{T}(undef, 10) for T in [String, Int, Float64]], "column" .* ('A':'C'), copycols=false)
3 Likes

Thanks this worked! I am filling the dataframe with some calculated values in a for loop. But since I already know the data I am going to loop over I already know what the size of the dataframe will be. Isn’t it then more efficient to specify the size of the dataframe beforehand instead of appending a new row in the for loop?

What I mean is that it is better to populate the vectors first, and then create a data frame from them, e.g.:

julia> using DataFrames, BenchmarkTools

julia> 

julia> function test1()
           df = DataFrame(a=Vector{Int}(undef, 10^6),
                          b=Vector{String}(undef, 10^6),
                          c=Vector{Int}(undef, 10^6), copycols=false)
           for i in 1:10^6
               df.a[i] = 1
               df.b[i] = "1"
               df.c[i] = 1.0
           end
           return df
       end
test1 (generic function with 1 method)

julia> 

julia> function test2()
           nt = (a=Vector{Int}(undef, 10^6),
                 b=Vector{String}(undef, 10^6),
                 c=Vector{Int}(undef, 10^6))
           for i in 1:10^6
               nt.a[i] = 1
               nt.b[i] = "1"
               nt.c[i] = 1.0
           end
           return DataFrame(nt, copycols=false)
       end
test2 (generic function with 1 method)

julia> 

julia> @btime test1();
  244.934 ms (5998506 allocations: 160.20 MiB)

julia> @btime test2();
  4.881 ms (27 allocations: 22.89 MiB)

Since DataFrame object is not type stable it is best suited for operations that work on whole-columns, as then type instability is not an issue.

The benefit of not being type stable is that we can accommodate very wide data frames without huge compilation overhead + you can easily change the schema of a DataFrame.

6 Likes

Also note that what I have written above is relevant ONLY IF storing the data in the data frame is the part of the code that is expensive (as in my examples).

Usually processing the data is much more expensive than storing/retrieving it. In such a case it does not matter that much how you do it, so I normally use push! as it is really nice to work with IMO (it has some overhead, but is still fast enough not to cause computational bottleneck assuming other operations you do are expensive). Note that in examples above storing 10^6 rows is a sub-second operation (with push! it would be more expensive but still under 1 second).

2 Likes