When I think about using DataFrames
or similar data science packages, I always worry about potential to calculate column C
from an existing dataset, then calculate column D
from C
, then I can recalculate C
and have D
be inconsistent. Here is one way to avoid that with DataFrames
, and I wonder what people think of it. Are there any footguns here? Would this be useful for other workflows?
The idea is to make a LazyColumn
with a reference to the dataframe and columns itβs calculated from and a function to calculate the values. The values arenβt instantiated until queried. I believe something like this is common in databases as βcomputed columnβ, but here we have the full power of Julia rather than a database specific set of functions to work with.
using DataFrames
struct LazyCol <: AbstractVector{Float64}
parent
arg_cols
f
end
function Base.getindex(v::LazyCol, i)
arg_df = v.parent[i, v.arg_cols]
return float.(v.f.(eachcol(arg_df)...))
end
Base.size(v::LazyCol) = (size(v.parent)[1],)
DataFrames.eachcol(dfr::DataFrameRow) = values(NamedTuple(dfr))
df = DataFrame(A=1:4, B=["M", "F", "F", "M"], C=5:8);
df.D = LazyCol(df, ["A","C"], (A,C)->A+C);
df.E = LazyCol(df, ["D"], D->D*2);
output
julia> df
4Γ5 DataFrame
Row β A B C D E
β Int64 String Int64 Float64 Float64
ββββββΌββββββββββββββββββββββββββββββββββββββββ
1 β 1 M 5 6.0 10.0
2 β 2 F 6 8.0 12.0
3 β 3 F 7 10.0 14.0
4 β 4 M 8 12.0 16.0
Then if I re-define D
, E
is still 2 times D
βs value. Also, if I use a callable struct for the function I have access to a complete graph of the way each column was calculated.
julia> df.D = LazyCol(df, ["A","C"], (A,C)->A-C);
julia> df
4Γ5 DataFrame
Row β A B C D E
β Int64 String Int64 Float64 Float64
ββββββΌββββββββββββββββββββββββββββββββββββββββ
1 β 1 M 5 -4.0 -8.0
2 β 2 F 6 -4.0 -8.0
3 β 3 F 7 -4.0 -8.0
4 β 4 M 8 -4.0 -8.0