Efficiently creating a data frame that is made up of smaller data frames

Hi all,

I’m trying to create a “long” version data frame (e.g. several observations per individual as in a longitudinal study). However, the data frame for each individual will depend on certain parameters - not the same operation for all N individuals; some will have more rows, some fewer. I was able to do it by creating a blank data frame and then continually updating it within a for loop using vcat.

In terms of pseudocode, this is how it currently looks:

df = DataFrame(ID = Int64[], nᵢ = Int64[], ...) # Define the actual data frame I want

for i in 1:N 
    dftmp = DataFrame(ID = i, nᵢ = Int64[], ...) # Create a temporary data frame (same args as df)
    .
    .
    .
    df = vcat(df, dftmp) # Combine dftmp with df (i.e. Update df)
end

However, I feel like there must be a more efficient way of doing this and wanted guidance. I come from an R background and have manipulated data frames using tidyverse.

Thanks in advance.

What you do is going to be slow.

Instead use almost the same (essentially: create a vector of data frames and then vcat them in one shot next):

reduce(vcat, [DataFrame(ID = i , nᵢ = ...) for i in 1:N])

If your source data frames have different sets of columns please read vcat documentation about options to handle this case.

1 Like

Thank you for the response.

However, I am not sure how the command below works:

[DataFrame(ID = i], , nᵢ = ...) for i in 1:N])

In addition, the source data frames will have the same rows but different sets of columns.
Pseudocode:

for i in 1:N 
    dftmp = DataFrame(ID = i, nᵢ = Int64[], ...) # Create a temporary data frame (same args as df)
    .
    if (arg is true)
        dftmp = ...
    else
        dftmp = ...
    end

    df = vcat(df, dftmp) # Combine dftmp with df (i.e. Update df)
end

Is there a better way to do the inside ifelse loop?

OP used ... as a placeholder for some other operations so I re-used it. The code was not runnable clearly.

the source data frames will have the same rows but different sets of columns.

As I have commented above please read vcat documentation. I am copying part of the documentation that is relevant:

The cols keyword argument determines the columns of the returned data frame:
• :setequal: require all data frames to have the same column names disregarding order. If they
appear in different orders, the order of the first provided data frame is used.
• :orderequal: require all data frames to have the same column names and in the same order.
• :intersect: only the columns present in all provided data frames are kept. If the
intersection is empty, an empty data frame is returned.
• :union: columns present in at least one of the provided data frames are kept. Columns not
present in some data frames are filled with missing where necessary.
• A vector of Symbols or strings: only listed columns are kept. Columns not present in some
data frames are filled with missing where necessary.

Now regarding your question:

Pseudocode:

Please share full code if you want a full working code in response.

[DataFrame(ID = i], , nᵢ = ...) for i in 1:N] is a comprehension.

Alternatively you can define e.g. a function taking one argument (i):

function gen_df(i)
...
end

and then use broadcasting like this:

reduce(vcat, gen_df.(1:N), cols=:union) # I use :union as I understand you want union of the columns
1 Like

Just a quick note, I was referring to the part " ID = i], " which seems to be a ], too much. (Anyway, not really relevant ;))

1 Like

Ah - OK. Fixed