Is there a better way to do this? many calculated columns

Lincoln_Hannah · September 23, 2020, 5:33am

Creating a DataFrame with a large number of calculated columns.
To make the logic as clear as possible, avoiding the vectorised operators and instead putting all logic within an eachrow loop.

new_DF = DataFrame()

for r = eachrow(base_DF)
    d = DotMap()

    d.Col1   =    somelogic
    d.Col2   =    somelogic
     ...
    d.Col20  =    somelogic

    append!( new_DF,  d ) 
end

I’d like to avoid the procedural elements:

        new_DF = DataFrame()      `and`
        append!(new_DF, d)`

For a single vector there is the python style list comprehension [x for x in list]

Is there an equivalent for DataFrames ?

Can you create all the fields in a loop like this and they automatically output a DataFrame ?

remi-garcia · September 23, 2020, 6:21am

Do you mean something like:

julia> DataFrame([[i*2 for i in 12:20] [i+1 for i in 12:20]], ["name_1", "name_2"])
9×2 DataFrame
│ Row │ name_1 │ name_2 │
│     │ Int64  │ Int64  │
├─────┼────────┼────────┤
│ 1   │ 24     │ 13     │
│ 2   │ 26     │ 14     │
│ 3   │ 28     │ 15     │
│ 4   │ 30     │ 16     │
│ 5   │ 32     │ 17     │
│ 6   │ 34     │ 18     │
│ 7   │ 36     │ 19     │
│ 8   │ 38     │ 20     │
│ 9   │ 40     │ 21     │

[A B] corresponds to hcat. You could write hcat([i*2 for i in 12:20], [i+1 for i in 12:20]) instead.

Lincoln_Hannah · September 23, 2020, 6:34am

More like thes:

DataFrame(
       col_1 = 1:100,
       col_2 = col_1^2,
       col_3 = col_2 + 10
)

can you just define all the fields, with dependence on eachother

remi-garcia · September 23, 2020, 7:40am

I don’t think that you can make a dependence on fields in one line.

That become harder and harder to read but:

julia> DataFrame([collect(1:10) collect(1:10).^2 (collect(1:10).^2).+10])
10×3 DataFrame
│ Row │ x1    │ x2    │ x3    │
│     │ Int64 │ Int64 │ Int64 │
├─────┼───────┼───────┼───────┤
│ 1   │ 1     │ 1     │ 11    │
│ 2   │ 2     │ 4     │ 14    │
│ 3   │ 3     │ 9     │ 19    │
│ 4   │ 4     │ 16    │ 26    │
│ 5   │ 5     │ 25    │ 35    │
│ 6   │ 6     │ 36    │ 46    │
│ 7   │ 7     │ 49    │ 59    │
│ 8   │ 8     │ 64    │ 74    │
│ 9   │ 9     │ 81    │ 91    │
│ 10  │ 10    │ 100   │ 110   │

works. Now you have to hope that the compiler optimize the computations.

Maybe this could be a solution:

julia> col_1 = 1:10; col_2 = col_1.^2; col_3 = col_2 .+ 10;

julia> DataFrame([col_1, col_2, col_3])
10×3 DataFrame
│ Row │ x1    │ x2    │ x3    │
│     │ Int64 │ Int64 │ Int64 │
├─────┼───────┼───────┼───────┤
│ 1   │ 1     │ 1     │ 11    │
│ 2   │ 2     │ 4     │ 14    │
│ 3   │ 3     │ 9     │ 19    │
│ 4   │ 4     │ 16    │ 26    │
│ 5   │ 5     │ 25    │ 35    │
│ 6   │ 6     │ 36    │ 46    │
│ 7   │ 7     │ 49    │ 59    │
│ 8   │ 8     │ 64    │ 74    │
│ 9   │ 9     │ 81    │ 91    │
│ 10  │ 10    │ 100   │ 110   │

julia> col_1 = col_2 = col_3 = nothing;

mariok90 · September 23, 2020, 8:11am

You can also construct DataFrames with NamedTuples:

DataFrame(
       (col_1 = i,
       col_2 = i^2,
       col_3 = i^2 + 10) for i in 1:100
)

For more complex operations you could define functions beforehand.

func1(x) = x^2
func2(x) = func1(x) + 10

DataFrame(
       (col_1 = i,
       col_2 = func1(i),
       col_3 =func2(i)) for i in 1:100
)

nalimilan · September 23, 2020, 11:55am

Depending on exactly what the operations do, you could use select(base_DF, ...) or combine(base_DF, ...).

Lincoln_Hannah · September 24, 2020, 12:40am

Thanks for you help everyone.
I think I like this approach best: Vectorise an anonymous function that outputs a DotMap, then
DataFrame the result. This allows each new column to use previous columns as input.
It would be nice if you could just add new columns to the DataFrameRow variable but I don’t think this is possible.

new_DF = DataFrame(  ( function( row );     d = DotMap()

    d.Col_1   =    somelogic( row )
    d.Col_2   =    somelogic( row, d.Col_1 )
     ...
    d.Col_20  =    somelogic( row, d.Col_1, ... d.Col_19 )

    d end).(eachrow(base_DF)))

Topic		Replies	Views
How can I create a DataFrame with many columns programatically? New to Julia	5	872	November 4, 2020
question about creating new columns in data frame from existing columns, New to Julia question	4	1719	June 26, 2018
Efficient way to add column to dataframe computed from prior columns New to Julia question	6	2309	August 12, 2021
Concatenate DataFrame columns dynamically General Usage dataframes	9	4366	September 24, 2019
Rewriting dplyr code which uses a function of columns in Julia -style using DataFrames.jl General Usage dataframes	5	609	March 25, 2021

Is there a better way to do this? many calculated columns

Related topics