ANN: IterableTables.jl - a lightweight abstract table interface

Hi all,

I just tagged version v0.0.2 of IterableTables.jl. IterableTables is a lightweight abstract table interface that makes it easier to convert between different table types. It also enables the use of many table types in situations that traditionally required a DataFrame. Finally, it integrates tightly with Query.jl, so that any iterable table can be queried using that package.

Overview

The package implements the iterable tables interface for the following types/packages: DataFrames, DataStreams (including CSV, Feather, SQLite, ODBC), DataTables, IndexedTables,
TimeSeries, TypedTables, DifferentialEquations (any DESolution) and any iterator who produces elements of type NamedTuple. Essentially all of these types are iterable tables once you load IterableTables and can be passed to any function or constructor that expects an iterable table.

Iterable tables can be consumed by a number of functions. In IterableTables the functions that can consume iterable tables can broadly be grouped into a) constructors of table types and b) functions that traditionally expected a DataFrame argument.

IterableTables adds constructors that expect an iterable table to the following types: DataFrames DataTables, IndexedTables, TimeSeries and TypedTables. If you call any of these constructors and pass an iterable table as the argument, the package will create an instance of that respective type and copy the data from the argument into that new instance you just created.

IterableTables adds methods that expect an iterable table to a number of functions in various packages. These include: all the modeling functions in DataFrames and StatsModels (for example you can now run a linear regression on data stored in a DataTable using the GLM package). It also adds methods to Gadfly and VegaLite that allow you to plot any iterable table, not just DataFrames. Finally it integrates with the DataStreams stack so that you can write out any iterable table as either a CSV or Feather file.

The integration with Query allows you to query an iterable table source, and you can materialize any query that creates a table like result into any of the types that have a constructor that accepts an iterable table.

Examples

Lets say you start with a DataFrame:

using DataFrames

df = DataFrame(Name=["John", "Sally", "Jim"], Age=[34.,25.,67.], Children=[2,0,3])

Using IterableTables you can easily convert this DataFrame into many other table types:

using DataTables, TypedTables, IterableTables

# Convert to a DataTable
dt = DataTable(df)

# Convert to a TypedTable
tt = Table(df)

These conversions work in any direction, i.e. you could also have started with a DataTable or any of the other supported table types.

The integration with packages like Gadfly or the statistical functionality is equally simple. For example, to run a regression on the TypedTable we just created, you simple would write

using GLM

# Run a regression on a TypedTable
lm(@formula(Children~Age),tt)

Or to plot the DataTable we created you would do:

using Gadfly

# Plot a DataTable
plot(dt, x=:Age, y=:Children, Geom.line)

Finally, this is completey integrated with Query. For example, the following query starts with a TypedTable and the materializes the results into a DataTable:

new_dt = @from i in tt begin
    @where i.Age > 30
    @select {i.Name, i.Children}
    @collect DataTable
end

Development status

The package should be fairly stable at this point, it has a pretty comprehensive test suite and has a decent documentation. The various integrations have different levels of polish, i.e. some are highly optimized (DataTables) while others could use some performance work. For now I want to keep all the integration code inside the IterableTables package, mainly because there might be one big change coming in how I handle missing data and it will be a lot easier if the integration code is not spread out over many different packages. Long term I hope to convince the maintainers of the various integrated packages to move the code that is specific to their packages into their package code base.

If you have a package where you either want to consume an iterable table or you would like to expose your own type as an iterable table, please open an issue in the IterableTables repository, so that we can figure out the best way to achieve that.

And of course, as always, any help with the package would be greatly appreciated!

14 Likes

Wow. Thanks for the package. I have for a while been vaguely aware that the data ecosystem is more than just DataFrames and I should check out the “anthoffverse” but reading the overview here has really moved that to the top of the priority list!