[ANN] TableIO - simple reading and writing of tabular data (experimental)

Hi,

I created a small “glue” package for reading and writing tabular data. It aims to provide a uniform api for multiple sources.
This package is “intelligent” in this sense that it automatically selects the right reading / writing methods depending on the given file extension.

Supported file / database formats:

  • CSV
  • Zipped CSV
  • JDF
  • Parquet
  • Excel (xlsx)
  • SQLite
  • PostgreSQL

As data source / data sink, the Tables.jl interface is used (e.g. supporting DataFrames.jl).

The package is still very experimental and not registered yet.
Especially, the dependencies of this package are not modular yet, i.e. it is required to install the dependencies for all supported file / data base formats, even though you may need only one or a few formats.

Comments are welcome!

11 Likes

I think it is a great idea, because it is easier to read and work in different formats. In many apps you want to be able to read/write in different formats, and being able to do it in a transparent way, could simplify it a lot.

1 Like

Compare and contrast with FileIO.jl?

Can you support fst via https://github.com/xiaodaigh/FstFileFormat.jl?

Also check out R’s {rio}, as format parity/supremacy vs {rio} would be a nice goal.

Compared to FileIO, the differences are (as far as I can see):

  • I am concentrating on table formats, whereas FileIO supports a large variety of different file formats (not intrinsically tabular). The supported file formats are orthogonal (c.f. https://github.com/JuliaIO/FileIO.jl/blob/master/docs/registry.md).
  • TableIO is just a (very thin) wrapper over existing Julia libraries for reading / writing tabular data. It is more for user convenience than implementing “new” logic. FileIO is by far the more sophisticated package :wink:

I am happy to add Fst support (I am not an R user therefore it was not on my radar).
JDF is already supported.
I would also like to add JLD and HDF (ideally reading HDFs created in Python) support in the future, but for there are to my knowledge no “ready-to-use” tabular interface packages available for them.

Thanks to the pointer to rio, I was not aware of it, but seems to have the same intention as I had in my mind. Feature parity however will still quite be some work :wink:

1 Like

Installing RCall.jl (required for FstFileFormat.jl) does not work for me out-of-the-box (I do not have R installed). Therefore I am reluctant to make it a hard dependency of this package.
It could be an optional dependency using ideally Proposal for first class support of conditional dependencies in Pkg · Issue #1285 · JuliaLang/Pkg.jl · GitHub (if this comes for Julia 1.6) or Requires.jl.

Cool. Are there plans to support SAS files, as in here

https://github.com/queryverse/StatFiles.jl

Done - https://github.com/lungben/TableIO.jl/pull/8

It is just a wrapper around StatFiles.jl load, please let me know if you encounter any issues.

Whats the compilation time for this package? It must be enormous, right? Wouldnt a better solution be something like

  1. Make a special type that allows you to dispatch on the file extension
  2. Write an interface for that type along the lines of
read(Sink,  SpecialType)

Then a package will do

read(sink, SpecialType{:csv}) = sink(CSV.File, SpecialType.filepath)
1 Like

I think this is the ideal situation, but would not work out-of-the-box; would basically require PRs to be approved in all of the "parent"packages.

But a longer term project might be to open those PRs, and then if/as they are accepted, drop the dependencies :man_shrugging:

This could definitely live in DataAPI, though I think. So I think we already have a “parent package” for this given the implementation is probably pretty simple.

1 Like

The read / write functions are already organized in a similar way:

struct CSVFormat <: AbstractFormat end

function read_table(filename:: AbstractString, args...; kwargs...)
    data_type = _get_file_type(filename)()
    read_table(data_type, filename, args...; kwargs...)
end

function read_table(::CSVFormat, filename:: AbstractString; kwargs...)
    return CSV.File(filename; kwargs...)
end

The main performance issue is the import time - it takes currently ca. 30s to import TableIO, the vast majority of it is for importing the upstream packages.
Is there any way to “delay” the import of a package until it is really needed? I think most users will not need more than 1 or 2 file formats at a time. I tried calling @eval import CSV (etc.) directly from the read / write functions, but this did not work due to world age “(method too new to be called from this world context.)”.

It’s more than just import time. When a package has this many dependencies it’s easy to end up in “dependency hell” where you can’t find overlapping available versions of other packages.

1 Like

I am experimenting with Requires.jl to make most dependencies optional:

https://github.com/lungben/TableIO.jl/pull/9

This makes this package much less heavy-weight and its import faster, but requires installation and importing of the underlying packages by the user.
Nevertheless, I think this is the better way to go for now - what do you think?

I’m not familiar with how Requires works.

I think you should petition DataAPI to support something akin to what I described above. That way these packages can import the lightweight package DataAPI and things will “just work”

Your suggestion is to include the generic functions read_table and write_table (and probably the file type identifier?) to DataAPI.jl and then letting the data parsing packages implement this interface, correct?
This would be definitively helpful for making the package APIs more uniform.

Some of the slightly more complex operations (like zipped csv files or fast Postgres data upload via csv) probably still require dedicated packages, but they could also use the DataAPI interface.

1 Like

The package is now registered.
JSON is now included, Feather and HDF are still on my ToDo List. For Fst, I am still struggeling with the R installation part (never used it before).

Please let me know if you agree to agree to the chosen api:

read_table(filename / db connection, [tablename]) # -> Tables.jl table
write_table!(filename / db connection, [tablename], input_table)

If this turns out to be helpful, we could then try to align the APIs of the various packages itself, as @pdeffebach suggested, using DataAPI.

Hasn’t FileIO.jl found a solution to avoid loading all dependencies until they are actually needed? I guess so or it would be unusable.

1 Like

U mean =>

This was meant to be just pseudo-code, edited above to make it more clear.

Yes, I am currently using Requires.jl for this purpose.
My problem is currently that installing binary dependencies of Fst on R-side does not work out-of-the-box (the CRAN binary handling is unfortunately not as smooth as the one of Julia Pkg).