How to initialize empty dataframe of specified size

Cevheriferd · August 31, 2021, 11:55am

Hi my question is as stated in the title. So I want to know what is the easiest way to create a dataframe with the column names, the type of the column and the number of rows specified in the beginning. I know you can do the following:

using DataFrames

testDF = DataFrame(columnA = String[], columnB = Int64[], columnC = Float64[])
push!(testDF,["hi", 3, 78.9])

But this is very inefficient if you already know what the size of the dataframe is going to be.

So how could I also specify the size of the dataframe when initializing it?

bkamins · August 31, 2021, 12:08pm

Most likely this is going to be inefficient. How are you planning to use this data frame later?
Such a method existed in the past, but it was dropped because it was later used in a way that lead to inefficient code.

Having said that you can do e.g. (I have put 10 as a numer of rows):

DataFrame(["column" * i => Vector{T}(undef, 10) for (i, T) in zip('A':'C', [String, Int, Float64])], copycols=false)

or

DataFrame([Vector{T}(undef, 10) for T in [String, Int, Float64]], "column" .* ('A':'C'), copycols=false)

Cevheriferd · August 31, 2021, 12:24pm

Thanks this worked! I am filling the dataframe with some calculated values in a for loop. But since I already know the data I am going to loop over I already know what the size of the dataframe will be. Isn’t it then more efficient to specify the size of the dataframe beforehand instead of appending a new row in the for loop?

bkamins · August 31, 2021, 12:50pm

What I mean is that it is better to populate the vectors first, and then create a data frame from them, e.g.:

julia> using DataFrames, BenchmarkTools

julia> 

julia> function test1()
           df = DataFrame(a=Vector{Int}(undef, 10^6),
                          b=Vector{String}(undef, 10^6),
                          c=Vector{Int}(undef, 10^6), copycols=false)
           for i in 1:10^6
               df.a[i] = 1
               df.b[i] = "1"
               df.c[i] = 1.0
           end
           return df
       end
test1 (generic function with 1 method)

julia> 

julia> function test2()
           nt = (a=Vector{Int}(undef, 10^6),
                 b=Vector{String}(undef, 10^6),
                 c=Vector{Int}(undef, 10^6))
           for i in 1:10^6
               nt.a[i] = 1
               nt.b[i] = "1"
               nt.c[i] = 1.0
           end
           return DataFrame(nt, copycols=false)
       end
test2 (generic function with 1 method)

julia> 

julia> @btime test1();
  244.934 ms (5998506 allocations: 160.20 MiB)

julia> @btime test2();
  4.881 ms (27 allocations: 22.89 MiB)

Since DataFrame object is not type stable it is best suited for operations that work on whole-columns, as then type instability is not an issue.

The benefit of not being type stable is that we can accommodate very wide data frames without huge compilation overhead + you can easily change the schema of a DataFrame.

bkamins · August 31, 2021, 1:16pm

Also note that what I have written above is relevant ONLY IF storing the data in the data frame is the part of the code that is expensive (as in my examples).

Usually processing the data is much more expensive than storing/retrieving it. In such a case it does not matter that much how you do it, so I normally use push! as it is really nice to work with IMO (it has some overhead, but is still fast enough not to cause computational bottleneck assuming other operations you do are expensive). Note that in examples above storing 10^6 rows is a sub-second operation (with push! it would be more expensive but still under 1 second).

Topic		Replies	Views
Initializing a dataframe New to Julia	23	10766	March 15, 2020
Problem initializing dataframe column types General Usage dataframes	6	470	July 30, 2021
Create dataframe with n columns of strings General Usage	6	1962	February 4, 2021
Sequentially add data to a DataFrame New to Julia question , dataframes	4	779	January 9, 2025
How to create a DataFrame with specific number of columns and rows and fill it with zeros? General Usage dataframes	7	1089	February 22, 2024

How to initialize empty dataframe of specified size

Related topics