Lazy columns in dataframes?

ggggggggg · January 26, 2023, 10:35pm

When I think about using DataFrames or similar data science packages, I always worry about potential to calculate column C from an existing dataset, then calculate column D from C, then I can recalculate C and have D be inconsistent. Here is one way to avoid that with DataFrames, and I wonder what people think of it. Are there any footguns here? Would this be useful for other workflows?

The idea is to make a LazyColumn with a reference to the dataframe and columns it’s calculated from and a function to calculate the values. The values aren’t instantiated until queried. I believe something like this is common in databases as “computed column”, but here we have the full power of Julia rather than a database specific set of functions to work with.

using DataFrames
struct LazyCol <: AbstractVector{Float64}
	parent
	arg_cols
	f
end
function Base.getindex(v::LazyCol, i)
	arg_df = v.parent[i, v.arg_cols]
	return float.(v.f.(eachcol(arg_df)...))
end
Base.size(v::LazyCol) = (size(v.parent)[1],)
DataFrames.eachcol(dfr::DataFrameRow) = values(NamedTuple(dfr))
df = DataFrame(A=1:4, B=["M", "F", "F", "M"], C=5:8);
df.D = LazyCol(df, ["A","C"], (A,C)->A+C);
df.E = LazyCol(df, ["D"], D->D*2);

output

julia> df
4×5 DataFrame
 Row │ A      B       C      D        E
     │ Int64  String  Int64  Float64  Float64
─────┼────────────────────────────────────────
   1 │     1  M           5      6.0     10.0
   2 │     2  F           6      8.0     12.0
   3 │     3  F           7     10.0     14.0
   4 │     4  M           8     12.0     16.0

Then if I re-define D, E is still 2 times D’s value. Also, if I use a callable struct for the function I have access to a complete graph of the way each column was calculated.

julia> df.D = LazyCol(df, ["A","C"], (A,C)->A-C);

julia> df
4×5 DataFrame
 Row │ A      B       C      D        E
     │ Int64  String  Int64  Float64  Float64
─────┼────────────────────────────────────────
   1 │     1  M           5     -4.0     -8.0
   2 │     2  F           6     -4.0     -8.0
   3 │     3  F           7     -4.0     -8.0
   4 │     4  M           8     -4.0     -8.0

pdeffebach · January 26, 2023, 10:40pm

This is a cool idea! It would make a great package that, of course, would not need to rely on DataFrames.jl at all.

I think one thing to consider, though, is what assumptions DataFrames.jl makes about vectors. There might be some methods you need to implement or else DataFrames.jl might make a copy and solidify the array.

ggggggggg · January 26, 2023, 10:44pm

Can you elaborate on that? Do you mean it could be generalized to the Tables.jl API or something?

pdeffebach · January 26, 2023, 10:57pm

I was imagining something where LazyCol only has vectors in it’s object, and computations are computed on the vectors.

julia> struct LazyCol
           f
           vecs
       end;

julia> function Base.getindex(v::LazyCol, i)
           vecs = v.vecs
           inputs = ntuple(k -> vecs[k][i], length(vecs))
           v.f(inputs...)
       end;

julia> t = LazyCol(-, [[1, 2], [100, 200]]);

julia> t[1]
-99

But that idea doesn’t really work if the input arrays get copied some how. So you do need to know something about the containing data frame.

jling · January 26, 2023, 11:00pm

GitHub - JuliaArrays/MappedArrays.jl: Lazy in-place transformations of arrays ?

notice once you made a mapped array, you can stick it into a DataFrame without problem since it’s still a <: AbstractVector

pdeffebach · January 26, 2023, 11:03pm

For any example, though, unless you are really careful with copycols = false or doing mutating ! operations, any select(df, ...) which copies vectors is going to break a lazy scheme.

bkamins · January 27, 2023, 7:53am

Also support for operations like push!, deleteat! etc. need to be considered (of course it also could be just not supported assuming user does not do such operations)

Topic		Replies	Views
Using DataFramesMeta and Lazy in a function Data	0	846	June 12, 2018
Is there a better way to do this? many calculated columns General Usage	6	491	September 24, 2020
Save "lazy" dataset that is calculated from another saved datasets General Usage data , dataframes	3	517	December 5, 2018
New Dataframe Column from a View from Another DataFrames General Usage data , dataframes	1	385	January 26, 2021
Efficient way to add column to dataframe computed from prior columns New to Julia question	6	2253	August 12, 2021

Lazy columns in dataframes?

Related topics