Creating a DataFrame with a large number of calculated columns.
To make the logic as clear as possible, avoiding the vectorised operators and instead putting all logic within an eachrow loop.
new_DF = DataFrame()
for r = eachrow(base_DF)
d = DotMap()
d.Col1 = somelogic
d.Col2 = somelogic
...
d.Col20 = somelogic
append!( new_DF, d )
end
Iβd like to avoid the procedural elements:
new_DF = DataFrame() `and`
append!(new_DF, d)`
For a single vector there is the python style list comprehension [x for x in list]
Is there an equivalent for DataFrames ?
Can you create all the fields in a loop like this and they automatically output a DataFrame ?
Do you mean something like:
julia> DataFrame([[i*2 for i in 12:20] [i+1 for i in 12:20]], ["name_1", "name_2"])
9Γ2 DataFrame
β Row β name_1 β name_2 β
β β Int64 β Int64 β
βββββββΌβββββββββΌβββββββββ€
β 1 β 24 β 13 β
β 2 β 26 β 14 β
β 3 β 28 β 15 β
β 4 β 30 β 16 β
β 5 β 32 β 17 β
β 6 β 34 β 18 β
β 7 β 36 β 19 β
β 8 β 38 β 20 β
β 9 β 40 β 21 β
[A B] corresponds to hcat
. You could write hcat([i*2 for i in 12:20], [i+1 for i in 12:20])
instead.
More like thes:
DataFrame(
col_1 = 1:100,
col_2 = col_1^2,
col_3 = col_2 + 10
)
can you just define all the fields, with dependence on eachother
I donβt think that you can make a dependence on fields in one line.
That become harder and harder to read but:
julia> DataFrame([collect(1:10) collect(1:10).^2 (collect(1:10).^2).+10])
10Γ3 DataFrame
β Row β x1 β x2 β x3 β
β β Int64 β Int64 β Int64 β
βββββββΌββββββββΌββββββββΌββββββββ€
β 1 β 1 β 1 β 11 β
β 2 β 2 β 4 β 14 β
β 3 β 3 β 9 β 19 β
β 4 β 4 β 16 β 26 β
β 5 β 5 β 25 β 35 β
β 6 β 6 β 36 β 46 β
β 7 β 7 β 49 β 59 β
β 8 β 8 β 64 β 74 β
β 9 β 9 β 81 β 91 β
β 10 β 10 β 100 β 110 β
works. Now you have to hope that the compiler optimize the computations.
Maybe this could be a solution:
julia> col_1 = 1:10; col_2 = col_1.^2; col_3 = col_2 .+ 10;
julia> DataFrame([col_1, col_2, col_3])
10Γ3 DataFrame
β Row β x1 β x2 β x3 β
β β Int64 β Int64 β Int64 β
βββββββΌββββββββΌββββββββΌββββββββ€
β 1 β 1 β 1 β 11 β
β 2 β 2 β 4 β 14 β
β 3 β 3 β 9 β 19 β
β 4 β 4 β 16 β 26 β
β 5 β 5 β 25 β 35 β
β 6 β 6 β 36 β 46 β
β 7 β 7 β 49 β 59 β
β 8 β 8 β 64 β 74 β
β 9 β 9 β 81 β 91 β
β 10 β 10 β 100 β 110 β
julia> col_1 = col_2 = col_3 = nothing;
You can also construct DataFrames with NamedTuples:
DataFrame(
(col_1 = i,
col_2 = i^2,
col_3 = i^2 + 10) for i in 1:100
)
For more complex operations you could define functions beforehand.
func1(x) = x^2
func2(x) = func1(x) + 10
DataFrame(
(col_1 = i,
col_2 = func1(i),
col_3 =func2(i)) for i in 1:100
)
1 Like
Depending on exactly what the operations do, you could use select(base_DF, ...)
or combine(base_DF, ...)
.
Thanks for you help everyone.
I think I like this approach best: Vectorise an anonymous function that outputs a DotMap, then
DataFrame the result. This allows each new column to use previous columns as input.
It would be nice if you could just add new columns to the DataFrameRow variable but I donβt think this is possible.
new_DF = DataFrame( ( function( row ); d = DotMap()
d.Col_1 = somelogic( row )
d.Col_2 = somelogic( row, d.Col_1 )
...
d.Col_20 = somelogic( row, d.Col_1, ... d.Col_19 )
d end).(eachrow(base_DF)))
1 Like