I created a small “glue” package for reading and writing tabular data. It aims to provide a uniform api for multiple sources.
This package is “intelligent” in this sense that it automatically selects the right reading / writing methods depending on the given file extension.
Supported file / database formats:
CSV
Zipped CSV
JDF
Parquet
Excel (xlsx)
SQLite
PostgreSQL
As data source / data sink, the Tables.jl interface is used (e.g. supporting DataFrames.jl).
The package is still very experimental and not registered yet.
Especially, the dependencies of this package are not modular yet, i.e. it is required to install the dependencies for all supported file / data base formats, even though you may need only one or a few formats.
I think it is a great idea, because it is easier to read and work in different formats. In many apps you want to be able to read/write in different formats, and being able to do it in a transparent way, could simplify it a lot.
TableIO is just a (very thin) wrapper over existing Julia libraries for reading / writing tabular data. It is more for user convenience than implementing “new” logic. FileIO is by far the more sophisticated package
I am happy to add Fst support (I am not an R user therefore it was not on my radar).
JDF is already supported.
I would also like to add JLD and HDF (ideally reading HDFs created in Python) support in the future, but for there are to my knowledge no “ready-to-use” tabular interface packages available for them.
Thanks to the pointer to rio, I was not aware of it, but seems to have the same intention as I had in my mind. Feature parity however will still quite be some work
This could definitely live in DataAPI, though I think. So I think we already have a “parent package” for this given the implementation is probably pretty simple.
The read / write functions are already organized in a similar way:
struct CSVFormat <: AbstractFormat end
function read_table(filename:: AbstractString, args...; kwargs...)
data_type = _get_file_type(filename)()
read_table(data_type, filename, args...; kwargs...)
end
function read_table(::CSVFormat, filename:: AbstractString; kwargs...)
return CSV.File(filename; kwargs...)
end
The main performance issue is the import time - it takes currently ca. 30s to import TableIO, the vast majority of it is for importing the upstream packages.
Is there any way to “delay” the import of a package until it is really needed? I think most users will not need more than 1 or 2 file formats at a time. I tried calling @eval import CSV (etc.) directly from the read / write functions, but this did not work due to world age “(method too new to be called from this world context.)”.
It’s more than just import time. When a package has this many dependencies it’s easy to end up in “dependency hell” where you can’t find overlapping available versions of other packages.
This makes this package much less heavy-weight and its import faster, but requires installation and importing of the underlying packages by the user.
Nevertheless, I think this is the better way to go for now - what do you think?
I think you should petition DataAPI to support something akin to what I described above. That way these packages can import the lightweight package DataAPI and things will “just work”
Your suggestion is to include the generic functions read_table and write_table (and probably the file type identifier?) to DataAPI.jl and then letting the data parsing packages implement this interface, correct?
This would be definitively helpful for making the package APIs more uniform.
Some of the slightly more complex operations (like zipped csv files or fast Postgres data upload via csv) probably still require dedicated packages, but they could also use the DataAPI interface.
The package is now registered.
JSON is now included, Feather and HDF are still on my ToDo List. For Fst, I am still struggeling with the R installation part (never used it before).
Please let me know if you agree to agree to the chosen api:
read_table(filename / db connection, [tablename]) # -> Tables.jl table
write_table!(filename / db connection, [tablename], input_table)
If this turns out to be helpful, we could then try to align the APIs of the various packages itself, as @pdeffebach suggested, using DataAPI.
Yes, I am currently using Requires.jl for this purpose.
My problem is currently that installing binary dependencies of Fst on R-side does not work out-of-the-box (the CRAN binary handling is unfortunately not as smooth as the one of Julia Pkg).