Enforcing Schema on Data Frame Passed as Function Argument

I have been working on some code where DataFrame structures seem to apply quite well. However, I was wondering if it is possible to enforce at a minimum a certain pattern of data types, if not both data types and column names, for a DataFrame that is passed as an argument to a function. An example with the current behavior and what I might like to happen below:

using DataFrames
using Dates

# This is what is available, as far as I can tell
function test1(df::DataFrame)
    println(first(df, 1))

# I have tried this syntax but it does not appear to work, for illustrative purposes only
function test2(df::DataFrame{Date, Int64, Int64, Int64, Int64})
    println(first(df, 1))

function main()

    test_df1 = DataFrame("A" => Date("2020"), "B" => 1, "C" => 2,
                        "D" => 3, "E" => 4)
    test_df2 = DataFrame("A" => Date("2020"), "B" => 1, "C" => 2,
                        "D" => 3)
    test1(test_df1) # This should work fine
    test1(test_df2) # This should work fine too
    test2(test_df1) # This should work fine
    test2(test_df2) # This should not work, as the schema does not match



Is this something that is possible with a DataFrame? If not, what are the potential alternatives? I have considered defining a custom type and then requiring an array of holding that type, but I like the capability that Dataframes have in being able to easily pull out ranges of rows based on tests of one column, which is more difficult with other things. Plus, I can add columns as I please.

DataFrame is not typed like that so it won’t work, you will need to check for each column and do something like:

julia> tt = eltype.(eachcol(test_df2))
4-element Array{DataType,1}:

but this is ~ meaningless because you can permute the order of the columns and it will be the same dataframe (to me at least) as long as each col (given a name) has the same type as before.

The compare function in Schemata.jl compares a table to a schema, where the schema can be specified in code but is more easily specified as a YAML file.

Your other option is to use a data structure that carries the full type information of the columns, e.g.

1 Like

I had the same issue:

My solution was to define a Dict with the required column names and types (abstract supertypes, where applicable) and a function to check if they are in the DataFrame (there may be more columns in the DataFrame, which is fine for me):

const INPUT_ELTYPES = Dict(
    :field_a=> AbstractString,
    :field_b => ItemTypes,
    :field_c=> AbstractString,
    :field_x=> Union{AbstractString, Nothing},
    :field_y=> Real,
    :value => Union{Real, Missing},

function check_input_data(df:: AbstractDataFrame)
    @assert COLUMN_NAMES ⊆ names(df)
    for (col_name, col_type) in INPUT_ELTYPES
        @assert eltype(df[!, col_name]) <: col_type

Maybe something like this could be added to DataFrames.jl, or a typed data frame as alternative type. The latter could be just a thin wrapper over the standard DataFrame type (with untyped columns) to avoid recompilations, with the sole purpose of defining schemas.

I would recommend a trait using the Tables.schema interface:

using Tables, DataFrames
f(table) = f_typed(Tables.schema(table), table)
f_typed(::Tables.Schema{(:a,:b),Tuple{Int64,Char}}, table) = "hey"
f(df) # "hey"

There are a couple of other options that haven’t been mentioned.

Wrap a data frame in a struct

You could create various custom types with constructors that represent different tables in your data flow. These custom types with constructors can provide some of the safety you are looking for. However, I’ve tried this before and the thing I didn’t like about it is that all the actions become nouns (the constructors) rather than verbs.

Let it fail

This is the approach I’ve taken. When I’m writing internal code (in other words, code that is not user facing), I can just assume that I, the programmer, am smart enough to provide the right kind of data frame. In fact, this approach applies to a lot of things. When you are writing internal code, you have to assume to some degree that the developer (not the user) knows what they are doing and is providing the correct input to functions. You typically don’t fill your code with assertions (@assert) in every function.

1 Like

I think for now I will be using @CameronBieganek’s suggestion to simply let it fail. The code I am working on now should not be user facing (I think). That said, I think something along the lines of what @lungben described would be best for my particular application, I also don’t care if there is more there, or if they are in a different order, only that a minimum set for the given function is present.

1 Like