Is there a better way to do this? many calculated columns

Creating a DataFrame with a large number of calculated columns.
To make the logic as clear as possible, avoiding the vectorised operators and instead putting all logic within an eachrow loop.

new_DF = DataFrame()

for r = eachrow(base_DF)
    d = DotMap()

    d.Col1   =    somelogic
    d.Col2   =    somelogic
     ...
    d.Col20  =    somelogic

    append!( new_DF,  d ) 
end

I’d like to avoid the procedural elements:

        new_DF = DataFrame()      `and`
        append!(new_DF, d)`

For a single vector there is the python style list comprehension [x for x in list]

Is there an equivalent for DataFrames ?

Can you create all the fields in a loop like this and they automatically output a DataFrame ?

Do you mean something like:

julia> DataFrame([[i*2 for i in 12:20] [i+1 for i in 12:20]], ["name_1", "name_2"])
9Γ—2 DataFrame
β”‚ Row β”‚ name_1 β”‚ name_2 β”‚
β”‚     β”‚ Int64  β”‚ Int64  β”‚
β”œβ”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚ 1   β”‚ 24     β”‚ 13     β”‚
β”‚ 2   β”‚ 26     β”‚ 14     β”‚
β”‚ 3   β”‚ 28     β”‚ 15     β”‚
β”‚ 4   β”‚ 30     β”‚ 16     β”‚
β”‚ 5   β”‚ 32     β”‚ 17     β”‚
β”‚ 6   β”‚ 34     β”‚ 18     β”‚
β”‚ 7   β”‚ 36     β”‚ 19     β”‚
β”‚ 8   β”‚ 38     β”‚ 20     β”‚
β”‚ 9   β”‚ 40     β”‚ 21     β”‚

[A B] corresponds to hcat. You could write hcat([i*2 for i in 12:20], [i+1 for i in 12:20]) instead.

More like thes:

DataFrame(
       col_1 = 1:100,
       col_2 = col_1^2,
       col_3 = col_2 + 10
)

can you just define all the fields, with dependence on eachother

I don’t think that you can make a dependence on fields in one line.

That become harder and harder to read but:

julia> DataFrame([collect(1:10) collect(1:10).^2 (collect(1:10).^2).+10])
10Γ—3 DataFrame
β”‚ Row β”‚ x1    β”‚ x2    β”‚ x3    β”‚
β”‚     β”‚ Int64 β”‚ Int64 β”‚ Int64 β”‚
β”œβ”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€
β”‚ 1   β”‚ 1     β”‚ 1     β”‚ 11    β”‚
β”‚ 2   β”‚ 2     β”‚ 4     β”‚ 14    β”‚
β”‚ 3   β”‚ 3     β”‚ 9     β”‚ 19    β”‚
β”‚ 4   β”‚ 4     β”‚ 16    β”‚ 26    β”‚
β”‚ 5   β”‚ 5     β”‚ 25    β”‚ 35    β”‚
β”‚ 6   β”‚ 6     β”‚ 36    β”‚ 46    β”‚
β”‚ 7   β”‚ 7     β”‚ 49    β”‚ 59    β”‚
β”‚ 8   β”‚ 8     β”‚ 64    β”‚ 74    β”‚
β”‚ 9   β”‚ 9     β”‚ 81    β”‚ 91    β”‚
β”‚ 10  β”‚ 10    β”‚ 100   β”‚ 110   β”‚

works. Now you have to hope that the compiler optimize the computations.

Maybe this could be a solution:

julia> col_1 = 1:10; col_2 = col_1.^2; col_3 = col_2 .+ 10;

julia> DataFrame([col_1, col_2, col_3])
10Γ—3 DataFrame
β”‚ Row β”‚ x1    β”‚ x2    β”‚ x3    β”‚
β”‚     β”‚ Int64 β”‚ Int64 β”‚ Int64 β”‚
β”œβ”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€
β”‚ 1   β”‚ 1     β”‚ 1     β”‚ 11    β”‚
β”‚ 2   β”‚ 2     β”‚ 4     β”‚ 14    β”‚
β”‚ 3   β”‚ 3     β”‚ 9     β”‚ 19    β”‚
β”‚ 4   β”‚ 4     β”‚ 16    β”‚ 26    β”‚
β”‚ 5   β”‚ 5     β”‚ 25    β”‚ 35    β”‚
β”‚ 6   β”‚ 6     β”‚ 36    β”‚ 46    β”‚
β”‚ 7   β”‚ 7     β”‚ 49    β”‚ 59    β”‚
β”‚ 8   β”‚ 8     β”‚ 64    β”‚ 74    β”‚
β”‚ 9   β”‚ 9     β”‚ 81    β”‚ 91    β”‚
β”‚ 10  β”‚ 10    β”‚ 100   β”‚ 110   β”‚

julia> col_1 = col_2 = col_3 = nothing;

You can also construct DataFrames with NamedTuples:

DataFrame(
       (col_1 = i,
       col_2 = i^2,
       col_3 = i^2 + 10) for i in 1:100
)

For more complex operations you could define functions beforehand.

func1(x) = x^2
func2(x) = func1(x) + 10

DataFrame(
       (col_1 = i,
       col_2 = func1(i),
       col_3 =func2(i)) for i in 1:100
)
1 Like

Depending on exactly what the operations do, you could use select(base_DF, ...) or combine(base_DF, ...).

Thanks for you help everyone.
I think I like this approach best: Vectorise an anonymous function that outputs a DotMap, then
DataFrame the result. This allows each new column to use previous columns as input.
It would be nice if you could just add new columns to the DataFrameRow variable but I don’t think this is possible.

new_DF = DataFrame(  ( function( row );     d = DotMap()

    d.Col_1   =    somelogic( row )
    d.Col_2   =    somelogic( row, d.Col_1 )
     ...
    d.Col_20  =    somelogic( row, d.Col_1, ... d.Col_19 )

    d end).(eachrow(base_DF)))
1 Like