I have been working on some code where DataFrame structures seem to apply quite well. However, I was wondering if it is possible to enforce at a minimum a certain pattern of data types, if not both data types and column names, for a DataFrame that is passed as an argument to a function. An example with the current behavior and what I might like to happen below:
using DataFrames
using Dates
# This is what is available, as far as I can tell
function test1(df::DataFrame)
println(first(df, 1))
end
# I have tried this syntax but it does not appear to work, for illustrative purposes only
function test2(df::DataFrame{Date, Int64, Int64, Int64, Int64})
println(first(df, 1))
end
function main()
test_df1 = DataFrame("A" => Date("2020"), "B" => 1, "C" => 2,
"D" => 3, "E" => 4)
test_df2 = DataFrame("A" => Date("2020"), "B" => 1, "C" => 2,
"D" => 3)
test1(test_df1) # This should work fine
test1(test_df2) # This should work fine too
test2(test_df1) # This should work fine
test2(test_df2) # This should not work, as the schema does not match
end
main()
Is this something that is possible with a DataFrame? If not, what are the potential alternatives? I have considered defining a custom type and then requiring an array of holding that type, but I like the capability that Dataframes have in being able to easily pull out ranges of rows based on tests of one column, which is more difficult with other things. Plus, I can add columns as I please.
DataFrame is not typed like that so it won’t work, you will need to check for each column and do something like:
julia> tt = eltype.(eachcol(test_df2))
4-element Array{DataType,1}:
Date
Int64
Int64
Int64
but this is ~ meaningless because you can permute the order of the columns and it will be the same dataframe (to me at least) as long as each col (given a name) has the same type as before.
The compare function in Schemata.jl compares a table to a schema, where the schema can be specified in code but is more easily specified as a YAML file.
My solution was to define a Dict with the required column names and types (abstract supertypes, where applicable) and a function to check if they are in the DataFrame (there may be more columns in the DataFrame, which is fine for me):
const INPUT_ELTYPES = Dict(
:field_a=> AbstractString,
:field_b => ItemTypes,
:field_c=> AbstractString,
:field_x=> Union{AbstractString, Nothing},
:field_y=> Real,
:value => Union{Real, Missing},
)
function check_input_data(df:: AbstractDataFrame)
@assert COLUMN_NAMES ⊆ names(df)
for (col_name, col_type) in INPUT_ELTYPES
@assert eltype(df[!, col_name]) <: col_type
end
end
Maybe something like this could be added to DataFrames.jl, or a typed data frame as alternative type. The latter could be just a thin wrapper over the standard DataFrame type (with untyped columns) to avoid recompilations, with the sole purpose of defining schemas.
There are a couple of other options that haven’t been mentioned.
Wrap a data frame in a struct
You could create various custom types with constructors that represent different tables in your data flow. These custom types with constructors can provide some of the safety you are looking for. However, I’ve tried this before and the thing I didn’t like about it is that all the actions become nouns (the constructors) rather than verbs.
Let it fail
This is the approach I’ve taken. When I’m writing internal code (in other words, code that is not user facing), I can just assume that I, the programmer, am smart enough to provide the right kind of data frame. In fact, this approach applies to a lot of things. When you are writing internal code, you have to assume to some degree that the developer (not the user) knows what they are doing and is providing the correct input to functions. You typically don’t fill your code with assertions (@assert) in every function.
I think for now I will be using @CameronBieganek’s suggestion to simply let it fail. The code I am working on now should not be user facing (I think). That said, I think something along the lines of what @lungben described would be best for my particular application, I also don’t care if there is more there, or if they are in a different order, only that a minimum set for the given function is present.