When I think about using DataFrames or similar data science packages, I always worry about potential to calculate column C from an existing dataset, then calculate column D from C, then I can recalculate C and have D be inconsistent. Here is one way to avoid that with DataFrames, and I wonder what people think of it. Are there any footguns here? Would this be useful for other workflows?
The idea is to make a LazyColumn with a reference to the dataframe and columns itβs calculated from and a function to calculate the values. The values arenβt instantiated until queried. I believe something like this is common in databases as βcomputed columnβ, but here we have the full power of Julia rather than a database specific set of functions to work with.
using DataFrames
struct LazyCol <: AbstractVector{Float64}
parent
arg_cols
f
end
function Base.getindex(v::LazyCol, i)
arg_df = v.parent[i, v.arg_cols]
return float.(v.f.(eachcol(arg_df)...))
end
Base.size(v::LazyCol) = (size(v.parent)[1],)
DataFrames.eachcol(dfr::DataFrameRow) = values(NamedTuple(dfr))
df = DataFrame(A=1:4, B=["M", "F", "F", "M"], C=5:8);
df.D = LazyCol(df, ["A","C"], (A,C)->A+C);
df.E = LazyCol(df, ["D"], D->D*2);
output
julia> df
4Γ5 DataFrame
Row β A B C D E
β Int64 String Int64 Float64 Float64
ββββββΌββββββββββββββββββββββββββββββββββββββββ
1 β 1 M 5 6.0 10.0
2 β 2 F 6 8.0 12.0
3 β 3 F 7 10.0 14.0
4 β 4 M 8 12.0 16.0
Then if I re-define D, E is still 2 times Dβs value. Also, if I use a callable struct for the function I have access to a complete graph of the way each column was calculated.
julia> df.D = LazyCol(df, ["A","C"], (A,C)->A-C);
julia> df
4Γ5 DataFrame
Row β A B C D E
β Int64 String Int64 Float64 Float64
ββββββΌββββββββββββββββββββββββββββββββββββββββ
1 β 1 M 5 -4.0 -8.0
2 β 2 F 6 -4.0 -8.0
3 β 3 F 7 -4.0 -8.0
4 β 4 M 8 -4.0 -8.0