Query.jl: How can I write type-preserving methods for iterable tables

question

#1

In MLJ.jl, we are trying to write a data-agnostic machine-learning framework and are presently trying to make this work using Query, to which I am new.

We would like to write methods that take an iterable table as input and that output a table of identical type (assuming the type is sink-supported). For example, a function to standardise the numerical features (columns) of some table, or a function to project the table onto a smaller training set of rows. If I give my method a DataFrame, I want a DataFrame as output. If I give it a TypedTable, then the output should be a TypedTable.

Here’s my attempt at a function to select a subset of columns from some Query utterable table, returning an object of the same type:

function getcols(X::T, c::AbstractArray{I}) where {T,I<:Union{Symbol,Integer}}

    TableTraits.isiterabletable(X) || error("Argument is not an iterable table.")

    row_iterator = @from row in X begin
        @select project(row, c)
        @collect T
    end
                    
end

Here project(row, c) is just the projection of the named tuple row onto a named tuple with only those labels/indices specified by c.

Now getcols works as for a DataFrame but not for, say a TypedTable. The problem is that TypedTable has type parameters which will be different for the output than the input. In other words, T in this case is TypedTables.Table{NamedTuple{(:x1, :x2........ for the signature, but T in the collect statement needs to be just TypedTable to work (I guess).

So what’s a way to do this that works?


#2

BTW: In Tables.jl there is a way to do this using the materialiser method.


#3

Sorry for not catching this earlier, somehow this slipped by.

In general the various Query.jl operators quite consciously avoid this kind of design, where the output type in some way is determined by the input type. The “why” is probably easier to see with the method piping syntax:

source |> @select(:colA, :colB) |> @filter(_.colA=="foo") |> DataFrame

This is a pretty canonical query, expressed as a pipeline. Now, one could of course design this such that @select returns a DataFrame if source is a DataFrame etc. But that would really be quite inefficient: here we want to make sure that the projection and the filter get fused into one loop only, and that we don’t allocate an intermediate DataFrame that then gets passed onto the @filter operation.

So, instead, all operations in Query.jl are lazy. If you just run source |> @select(:colA, :colB) |> @filter(_.colA=="foo"), no data is actually touched, all we are constructing is a chained series of lazy iterators that can be executed at some point, and then everything gets fused. What triggers execution of this query is that it gets passed to DataFrame (or some other table type, or a simple collect call).