Lazy columns in dataframes?

When I think about using DataFrames or similar data science packages, I always worry about potential to calculate column C from an existing dataset, then calculate column D from C, then I can recalculate C and have D be inconsistent. Here is one way to avoid that with DataFrames, and I wonder what people think of it. Are there any footguns here? Would this be useful for other workflows?

The idea is to make a LazyColumn with a reference to the dataframe and columns it’s calculated from and a function to calculate the values. The values aren’t instantiated until queried. I believe something like this is common in databases as β€œcomputed column”, but here we have the full power of Julia rather than a database specific set of functions to work with.

using DataFrames
struct LazyCol <: AbstractVector{Float64}
	parent
	arg_cols
	f
end
function Base.getindex(v::LazyCol, i)
	arg_df = v.parent[i, v.arg_cols]
	return float.(v.f.(eachcol(arg_df)...))
end
Base.size(v::LazyCol) = (size(v.parent)[1],)
DataFrames.eachcol(dfr::DataFrameRow) = values(NamedTuple(dfr))
df = DataFrame(A=1:4, B=["M", "F", "F", "M"], C=5:8);
df.D = LazyCol(df, ["A","C"], (A,C)->A+C);
df.E = LazyCol(df, ["D"], D->D*2);

output

julia> df
4Γ—5 DataFrame
 Row β”‚ A      B       C      D        E
     β”‚ Int64  String  Int64  Float64  Float64
─────┼────────────────────────────────────────
   1 β”‚     1  M           5      6.0     10.0
   2 β”‚     2  F           6      8.0     12.0
   3 β”‚     3  F           7     10.0     14.0
   4 β”‚     4  M           8     12.0     16.0

Then if I re-define D, E is still 2 times D’s value. Also, if I use a callable struct for the function I have access to a complete graph of the way each column was calculated.

julia> df.D = LazyCol(df, ["A","C"], (A,C)->A-C);

julia> df
4Γ—5 DataFrame
 Row β”‚ A      B       C      D        E
     β”‚ Int64  String  Int64  Float64  Float64
─────┼────────────────────────────────────────
   1 β”‚     1  M           5     -4.0     -8.0
   2 β”‚     2  F           6     -4.0     -8.0
   3 β”‚     3  F           7     -4.0     -8.0
   4 β”‚     4  M           8     -4.0     -8.0
2 Likes

This is a cool idea! It would make a great package that, of course, would not need to rely on DataFrames.jl at all.

I think one thing to consider, though, is what assumptions DataFrames.jl makes about vectors. There might be some methods you need to implement or else DataFrames.jl might make a copy and solidify the array.

Can you elaborate on that? Do you mean it could be generalized to the Tables.jl API or something?

I was imagining something where LazyCol only has vectors in it’s object, and computations are computed on the vectors.

julia> struct LazyCol
           f
           vecs
       end;

julia> function Base.getindex(v::LazyCol, i)
           vecs = v.vecs
           inputs = ntuple(k -> vecs[k][i], length(vecs))
           v.f(inputs...)
       end;

julia> t = LazyCol(-, [[1, 2], [100, 200]]);

julia> t[1]
-99

But that idea doesn’t really work if the input arrays get copied some how. So you do need to know something about the containing data frame.

GitHub - JuliaArrays/MappedArrays.jl: Lazy in-place transformations of arrays ?

notice once you made a mapped array, you can stick it into a DataFrame without problem since it’s still a <: AbstractVector

1 Like

For any example, though, unless you are really careful with copycols = false or doing mutating ! operations, any select(df, ...) which copies vectors is going to break a lazy scheme.

Also support for operations like push!, deleteat! etc. need to be considered (of course it also could be just not supported assuming user does not do such operations)