Dataframe functional input and output formats in julia 1.0, august 2018

I am trying to find out in what data formats Julia 1.0 can currently [august 2018] use to read and write data frames. for a sample, I am using something simple like

julia> df
6×4 DataFrame
│ Row │ n1 │ n2   │ n3        │ n4  │
├─────┼────┼──────┼───────────┼─────┤
│ 1   │ 99 │ 9801 │ -0.999207 │ 'a' │
│ 2   │ 1  │ 1    │ 0.841471  │ 'b' │
│ 3   │ 3  │ 9    │ 0.14112   │ 'c' │
│ 4   │ 5  │ 25   │ -0.958924 │ 'd' │
│ 5   │ 7  │ 49   │ 0.656987  │ 'e' │
│ 6   │ 9  │ 81   │ 0.412118  │ 'f' │

I plan to retry this with missing once I know I have the basics working.

  • I know that Serializer works, but it is not a long-term storage format.

  • The most important I/O format may well be csv files.

    • I do not know if DelimitedFiles works. I have confirmed that some readdlm() work well. But the following throws an error about ‘ERROR: MethodError: no method matching iterate(::DataFrame)’:
fo= open("sample-df.tab", "w"); writedlm(fo, df, '\t'); close(fo);
    • or is CSV now the preferred csv reader/writer?

    • is there native .csv.gz support, or do I write to a pipe?

    • CSV reading and writing, esp from gzip compressed files, needs to be fast.

    • ideally, I get one great package, rather than a few almost-working ones.

  • JLD does not work. the using JLD dies.

  • is there an SQLite writer ?

  • are there other important input/output formats for data frames?

pointers appreciated.

regards, /iaw

The DataStreams.jl package defines a common “table IO” interface that packages can implement to get automatic integration w/ one another, of which DataFrames.jl is an implementor, so CSV.jl, SQLite.jl, Feather.jl, ODBC.jl, MySQL.jl, LibPQ.jl, and others all automatically work w/ DataFrames.

There’s also the TableTraits.jl package which defines a similar, more minimal interface for tabular formats that also provides easy interop between formats (see IterableTables.jl) for a list of implementations.

A bunch of us hacked together at JuliaCon in London a few weeks ago to come up with ideas in a new Tables.jl package that aims to unite these two common table interface packages into a single, simplified interface that should provide even wider integration and ease of use. It’s not registered yet as there are a few details being worked out, but stay tuned for more to come soon!

7 Likes

thx, q. I was looking for a high-level description of DataStreams.jl, but could only find the lower-level function docs.

Is DataStreams.jl something that is exposed to the user (like Plots.jl), and handles the handing off to reader/writer packages; or is it something that is used internally by package writers, e.g., for CSV.jl, etc.?

If it is the former, are there a few use examples?