Hi all,
I just tagged version v0.0.2 of IterableTables.jl. IterableTables is a lightweight abstract table interface that makes it easier to convert between different table types. It also enables the use of many table types in situations that traditionally required a DataFrame
. Finally, it integrates tightly with Query.jl, so that any iterable table can be queried using that package.
Overview
The package implements the iterable tables interface for the following types/packages: DataFrames, DataStreams (including CSV, Feather, SQLite, ODBC), DataTables, IndexedTables,
TimeSeries, TypedTables, DifferentialEquations (any DESolution
) and any iterator who produces elements of type NamedTuple. Essentially all of these types are iterable tables once you load IterableTables
and can be passed to any function or constructor that expects an iterable table.
Iterable tables can be consumed by a number of functions. In IterableTables
the functions that can consume iterable tables can broadly be grouped into a) constructors of table types and b) functions that traditionally expected a DataFrame
argument.
IterableTables
adds constructors that expect an iterable table to the following types: DataFrames DataTables, IndexedTables, TimeSeries and TypedTables. If you call any of these constructors and pass an iterable table as the argument, the package will create an instance of that respective type and copy the data from the argument into that new instance you just created.
IterableTables
adds methods that expect an iterable table to a number of functions in various packages. These include: all the modeling functions in DataFrames and StatsModels (for example you can now run a linear regression on data stored in a DataTable
using the GLM package). It also adds methods to Gadfly and VegaLite that allow you to plot any iterable table, not just DataFrame
s. Finally it integrates with the DataStreams stack so that you can write out any iterable table as either a CSV or Feather file.
The integration with Query allows you to query an iterable table source, and you can materialize any query that creates a table like result into any of the types that have a constructor that accepts an iterable table.
Examples
Lets say you start with a DataFrame
:
using DataFrames
df = DataFrame(Name=["John", "Sally", "Jim"], Age=[34.,25.,67.], Children=[2,0,3])
Using IterableTables
you can easily convert this DataFrame
into many other table types:
using DataTables, TypedTables, IterableTables
# Convert to a DataTable
dt = DataTable(df)
# Convert to a TypedTable
tt = Table(df)
These conversions work in any direction, i.e. you could also have started with a DataTable
or any of the other supported table types.
The integration with packages like Gadfly or the statistical functionality is equally simple. For example, to run a regression on the TypedTable
we just created, you simple would write
using GLM
# Run a regression on a TypedTable
lm(@formula(Children~Age),tt)
Or to plot the DataTable
we created you would do:
using Gadfly
# Plot a DataTable
plot(dt, x=:Age, y=:Children, Geom.line)
Finally, this is completey integrated with Query. For example, the following query starts with a TypedTable
and the materializes the results into a DataTable
:
new_dt = @from i in tt begin
@where i.Age > 30
@select {i.Name, i.Children}
@collect DataTable
end
Development status
The package should be fairly stable at this point, it has a pretty comprehensive test suite and has a decent documentation. The various integrations have different levels of polish, i.e. some are highly optimized (DataTables
) while others could use some performance work. For now I want to keep all the integration code inside the IterableTables
package, mainly because there might be one big change coming in how I handle missing data and it will be a lot easier if the integration code is not spread out over many different packages. Long term I hope to convince the maintainers of the various integrated packages to move the code that is specific to their packages into their package code base.
If you have a package where you either want to consume an iterable table or you would like to expose your own type as an iterable table, please open an issue in the IterableTables repository, so that we can figure out the best way to achieve that.
And of course, as always, any help with the package would be greatly appreciated!