Creating an AbstractDataFrame subtype

I wish to create a new subtype of AbstractDataFrame that has a single DataFrame member. I then want to forward the AbstractDataFrame interface to the DataFrame member. Something like:

using DataFrames
using Lazy

struct DFWrapper <: AbstractDataFrame
    df::DataFrame
    other
    stuff
end

@forward DFWrapper.df DataFrames.describe, Base.summary, Base.hcat, Base.vcat,
    Base.repeat, Base.names, DataFrames.rename!, Base.length, Base.size,
    Base.first, Base.last, Base.convert, DataFrames.completecases,
    DataFrames.dropmissing, DataFrames.dropmissing!, DataFrames.nonunique,
    Base.unique, Base.unique!, DataFrames.disallowmissing, DataFrames.disallowmissing!,
    DataFrames.allowmissing, DataFrames.allowmissing!, DataFrames.categorical, DataFrames.categorical!,
    Base.similar, Base.filter, Base.filter!

DFW = DFWrapper(DataFrame(A=1:10), 1.0, false)

I obtained the list of methods being forwarded form here.

Running in the REPL, I get:

Error showing value of type DFWrapper:
ERROR: StackOverflowError:
Stacktrace:
 [1] _check_consistency(::DFWrapper) at /dataframe/dataframe.jl:296 (repeats 80000 times)

line 296 is

_check_consistency(df::AbstractDataFrame) = _check_consistency(parent(df))

Forwarding DataFrames._check_consistency then produces

Error showing value of type DFWrapper:
ERROR: MethodError: no method matching getindex(::DFWrapper, ::typeof(!), ::Symbol)
Closest candidates are:
  getindex(::AbstractDataFrame, ::Integer, ::Union{Regex, AbstractArray{T,1} where T, All, Between, InvertedIndex}) at /Users/gerlacar/.julia/packages/DataFrames/S3ZFo/src/dataframerow/dataframerow.jl:90
  getindex(::AbstractDataFrame, ::Integer, ::Colon) at /Users/gerlacar/.julia/packages/DataFrames/S3ZFo/src/dataframerow/dataframerow.jl:92
  getindex(::AbstractDataFrame, ::CartesianIndex{2}) at /Users/gerlacar/.julia/packages/DataFrames/S3ZFo/src/other/broadcasting.jl:3

I feel like I am entering a rabbit hole and that there has to be a better way to do this. Any suggestions? Thanks.

What are you trying to accomplish here? The DataFrames interface is not trivial it to be honest it’s a lot of work to make a sub-dataframe type.

I think, but am not certain, that you need to implement getproperty yourself, it’s not on the list of forwarded functions. You can see that in the file you linked to they define a lt of functions themselves.

Why does it even need to be a subtype? Your code works if you just drop <: AbstractDataFrame.

EDIT: Well… most of the functions work. Some do not.

You can roughly learn what is needed to define a custom AbstractDataFrame if you look at methods that are implemented for SubDataFrame. In general, as @pdeffebach noted, although this is doable it is really a hard thing to do (AbstractDataFrame has a lot of “convinience” functionalities and “special cases” that were added over the years per requests of the users and this is simply hard to reproduce with 100% accuracy quickly).

This is a great question. I will admit that I took a little bit of random walk to get here, so a coarse correction may be in order. I am writing this for a specific internal application. So, I don’t actually need to implement the entire DataFrame interface.

Maybe its best to explain my path here. I have data stored in CSV files from various tests that I load into a DataFrame. I want to be able to dispatch on this particular data for various plotting recipes so I made a struct like:

struct DFWrapper
    df::DataFrame
    other
    stuff
end

where other and stuff are common meta data for test data, e.g. test date/time…

I would also like the ability to query and filter DFWrapper.df with tools like Query.jl and DataFramesMeta.jl. That led me to this. I thought “that sounds easy enough”… and here I am. At this point, I don’t think its worth doing this for the project at hand given the feedback.

I am open all ears on better ideas. In essence, I want something that stores a DataFrame, acts like a DataFrame, but that I can also dispatch on.

I suggest that when you want to treat your object dw::DataFramesWrapper
as a DataFrame, that you just do:
dw.df and work with the DataFrame field directly.

Invenia has a a package that subtypes DataFrames, and its just a bit much maintance work for too little gain, so we are looking to retire it.
https://github.com/invenia/KeyedFrames.jl/issues/19

2 Likes

100% you want to just work with the DataFrame directly and use x.df when you do things to your data frame.

2 Likes

This makes a lot of sense. I wondered if I was trying to be cute with this.

I saw that the interface for DataFrames for Query was pretty simple. So, as an exercise I tried extending that to my type and this seems pretty workable.

using DataFrames
using IterableTables
using Query
using Lazy

struct DFWrapper
    df::DataFrame
    other
    stuff
end

#source 
IterableTables.IteratorInterfaceExtensions.isiterable(x::DFWrapper) = true
IterableTables.TableTraits.isiterabletable(x::DFWrapper) = true

function IterableTables.TableTraits.getiterator(dfw::DFWrapper)
    return IterableTables.TableTraitsUtils.create_tableiterator(getfield(dfw.df, :columns), names(dfw.df))
end

# Sink
function _DFWrapper(dfw::DFWrapper, x)
    cols, names = IterableTables.create_columns_from_iterabletable(x, na_representation=:missing)
    df = DataFrames.DataFrame(cols, names)
    return DFWrapper(df,dfw.other, dfw.stuff)
end

DFWrapper(dfw::DFWrapper, x::AbstractVector{T}) where {T<:NamedTuple} = _DFWrapper(dfw, x)

function DFWrapper(dfw::DFWrapper, x)
    if IterableTables.TableTraits.isiterabletable(x)
        return _DFWrapper(dfw, x)
    else
        df = convert(DataFrames.DataFrame, x)
        return DFWrapper(df, dfw.other, dfw.stuff)
    end
end

function DFWrapper(dfw::DFWrapper)
    return x->DFWrapper(dfw, x)
end

DFW = DFWrapper(DataFrame(A=1:10), 3.0, true)

x = DFW |>
  @filter(_.A>5) |> DFWrapper(DFW) # produces DFWrapper(DataFrame(A=6:10), 3.0, true)

Yeah that’s definitely a solution. You can also implement the Tables.jl interface and get Query for free.

But still, this seems like overkill a bit.

I’m going to necromance this thread.

It would be useful to define

abstract type SpecializedDataFrame <: AbstractDataFrame end 
get_df(special_df::SpecializedDataFrame) = special_df.df 
# generically forward methods 
# e.g.: 
#           getindex(specdf::Specialized_df, ...)  = getindex(get_df(specdf), ...)

Which would allow people to conveniently define AbstractDataFrame datatypes in which a DataFrame is a field. All methods defined for a DataFrame would be applied to a SpecializedDataFrame by using the get_df() function.

I think this would be very helpful for extending dataframe functionality.

In my case, I want to construct a Panel data frame. A panel has two units of observation: a nominal one (e.g. person id) and an ordinal one (e.g. month, or survey number). Panel data permits many statistical operations that are not feasible with time series, cross-sections, or repeated cross sections.

All functions defined on a dataframe are applicable to a panel. But additional functions are applicable to panels, especially computing lags (julia has a FixedEffectsModels.jl package, which demeans the target variable after grouping by id. This is one way to compute a fixed effects, but it has different asymptotic properties than taking first differences and is not always the correct way to do it. Taking first differences requires computing lags).

If the SpecializedDataFrame abstract type existed, forwarding methods appropriately, it would be nearly trivial to develop a PanelDataFrames package that extends DataFrames to exploit this data structure. It would implement lag operators, allow for lag operators in @formula syntax, and other things. eg:

abstract type AbstractPanel <: SpecializedDataFrame end 
struct PanelDataFrame <: Abstract Panel 
    df::DataFrame 
    panel_indices::Vector{Vector{Int}} 
end 
lag(pdf::PanelDataFrame, var::Symbol, lags=1) = #etc

There are many other data structures that can be similarly exploited, so the SpecializedDataFrame type could get a lot of use helping to implement different packages.

This is a good idea, panel data is an important structure for the Julia data ecosystem to support.

I think the main problem is that there isn’t a well defined AbstractDataFrame API. What list of functions does a type need to implement? No one has compiled this list, which will make it very hard to implement a <:AbstractDataFrame. This is something that should be done eventually, but requires significant work.

I also think a lot of what you want can be done with grouped data frames. See DataFramesMeta, for example

julia> begin
       using DataFramesMeta, Dates, ShiftedArrays
       years = 1985:2000
       state_ids = 1:50
       df = DataFrame(year = Int[], state_id = Int[])
       for i in vec(collect(Base.Iterators.product(years, state_ids)))
           push!(df, i)
       end
       sort!(df, [:year, :state_id])
       df.y = randn(nrow(df))
       @by df :state_id :y_lag = lag(:y)
       end

Obviously it would be nice not to have to re-compute the indices of all the groups every time, which is presumably the whole point of a specialized panel data frame. But if you have a persistent GroupedDataFrame it might make things easier.

As I have commented several times. Creation of such a list is not a problem. Just do:

methodswith(DataFrame)

and you see what requires to be implemented. And you do not have to implement everything if you do not want/need to.

The effort that would have to be made is to track down in the code places (there are several), where we explicitly assume that the only AbstractDataFrame types are DataFrame or SubDataFrame, which might work incorrectly if a third type would be introduced. If you feel it is worthwhile - please open an issue, and we can plan fixing all such places.

ah - and if you wanted SubDataFrame to be stored inside your wrapper you would need to handle it separately.

2 Likes

A follow up question on this.

How is the constructor working (or not working, as it were) in the example OP presented?

As a silly example, I can execute

struct NumberWrapper <: Number 
    my_number::Number
end 
NumberWrapper(1) 

without throwing an exception.

But

struct DataFrameWrapper <: AbstractDataFrame 
    df::DataFrame 
end 
df = DataFrame([1 2; 3 4])
DataFrameWrapper(df) 

generates the stack overflow error OP mentions. Why is _check_consistency even called at all when constructing a DataFrameWrapper? Does the DataFrames.jl package somehow define methods for default constructors on AbstractDataFrames (e.g. something like writing a specialized method of Base.new(df::AbstractDataFrame) for AbstractDataFrame types, or something?

The error message only refers to one line – I don’t know what calls _check_consistency. I cannot find information in the Julia documentation for constructors that can help to explain this. It seems that there is some crucial information about how Julia constructors work under the hood that I do not understand.

1 Like

Thanks!

This is an understandable StackOverFlow.

_check_consistency(df::AbstractDataFrame) = _check_consistency(parent(df))

combined with parent(df::AbstractDataFrame)

julia> df = DataFrame(a = [1, 2])
2Ă—1 DataFrame
 Row │ a
     │ Int64
─────┼───────
   1 │     1
   2 │     2

julia> parent(df)
2Ă—1 DataFrame
 Row │ a
     │ Int64
─────┼───────
   1 │     1
   2 │     2

julia> @which parent(df)
parent(adf::AbstractDataFrame) in DataFrames at C:\Users\Peter\.julia\packages\DataFrames\nxjiD\src\abstractdataframe\abstractdataframe.jl:1898

that’s what’s causing the StackOverFlow.

Defining parent(df::DataFrameWrapper) fixed this problem (and gets you into printing errors which are easily fixable).

Yes, for the custom data frame type you need to define parent if it is a wrapper around a data frame. parent does not need to be defined if your custom data frame would store its columns directly in some way.

Thanks!

I understand where the error comes from now. The inner constructor new(.) does not throw the exception. Rather, when a data type is constructed, base.show may be called on it. Base.show(io::IO, df::AbstractDataFrame) calls _show, which calls _check_consistency(df::AbstractDataFrame). Since _check_consistency(df::AbstractDataFrame) = _check_consistency(parent(df)), and parent(df::AbstractDataFrame)=df, we have an obvious stack overflow.

The calls to show and _show were not showing up on my stacktrace, so I had assumed the inner constructor was to blame and not the functions called to display what was just (successfully) constructed.

Thanks again!

This is indeed what happens.

I may be missing something, but I think that DataFrame should be able to handle panel data just fine. IDs and survey information just become additional columns.

(If this is not the case, an MWE would help focus this discussion.)

I think the point is around making time/unit based operations a bit more convenient in the way they are in say Stata when you’ve used tsset. E.g. you might have lag(::PanelDataFrame, ::String, n) give a lagged column respecting the groups, which currently would require going via groupby (which I personally find unproblematic, but I can understand people doing this a lot and used to working in Stata being a bit put off by some of the verbosity of DataFrames here. It’d certainly be interesting to see some attempts at innovation in this space!

That said this thread also gives me the opportunity to bring up my favourite bit of pandas trivia, which is that pandas is actually named after panel data, and did indeed feature a Panel class as an alternative to the usual DataFrame early on. I remember trying to use it in the early days of my PhD (maybe ~10 years ago?), but found it unnecessarily cumbersome and indeed it was decprecated a few years later (apparently in 2017) - maybe a cautionary tale with respect to the benefits of having panel data as a distinct type, but at the same time Julia isn’t Python of course and might allow for a better design!