ANN File IO for tabular data

announcement

#1

I created a family of new packages that provide a nice user API for doing file IO with tabular data. These packages all integrate with FileIO.jl and IterableTables.jl. Currently I support loading of CSV files (CSVFiles.jl), Excel files (ExcelFiles.jl), Feather files (FeatherFiles.jl) and various statistical software file formats (Stata, SPSS, or SAS, StatFiles.jl). I also support saving of tabular data as CSV files (CSVFiles.jl) and Feather files (FeatherFiles.jl). Both loading and saving of tabular files works with any of the more than a dozen tabular data structures that are integrated with IterableTables.jl, e.g. all your favorites like DataFrames, DataTables etc. work, but also more recent additions to tabular data like IndexedTables or JuliaDB and more.

The basic usage looks like this:

df = DataFrame(load("data.csv"))

save("output.csv", df)

This pattern works with any of the supported file formats (i.e. if you use load("data.feather") it will load a feather file etc.), and the return value of the load function can be passed to any function that accepts an iterable table, which includes all the constructors of the data structures that are integrated with iterable tables. The save function accepts any iterable table as its second argument, and can save tables as CSV or feather files. If you want to use a different data structure than DataFrame, you can just use the same type of code from above by replacing the call to DataFrame with a call to some other data structure.

I also recently added a new piping syntax to the whole IterableTables.jl and Query.jl family of packages, which enables things like this:

load("data.csv") |>
@query(i, begin
    @where i.age > 30
    @select {i.name, i.age}
end) |>
save("result.feather")

To learn more about additional arguments for the load and save function for specific file formats, take a look at the documentation for the four packages that implement this functionality: CSVFiles.jl, ExcelFiles.jl, FeatherFiles.jl and StatFiles.jl.


What fileformat to use to load data for high performance computing
#2

This is great! I really enjoy the IterableTables.jl interface over DiffEq because then it allows the results of simulations to truly act like generated tabular data. These tools really show off how abstraction can really make tools much nicer to work with. Keep up the good work!


#3

That’s really cool that StatFiles will load Stata .dta files. I wouldn’t have expected Julia to have support for something that exotic yet.


#4

Looks really nice.

@davidanthoff, there are currently 3 things that are preventing me from fully embracing your very nice ecosystem right now, I wonder if you could address these

  1. Query is usually very slow on “large” (~10GB) datasets. We’ve discussed this before, but I certainly never came to any understanding of why it is slow.

  2. We need some sort of batching for datasets which are too large to fit in memory. My approach has been to use my PR for Feather to pull only chunks of feather files at a time. Even that, however, does not allow me to easily do Query in batches (as far as I know).

  3. It’s still really hard to get things into actual machine-ingestable formats, and then back into datasets from those formats. Essentially one needs to spit out a rank-N tensor of floats and then take a rank-M tensor of floats and stick it back into the same (or similar) dataset. This is still the problem I spend most of my time dealing with. There are millions of tools for manipulating tables, but practically none for achieving this final step.


#5

Is StatFiles supposed to work automatically? I get

INFO: Changing Directory to /home/andrew/.julia/v0.6/ReadStat/deps/src/readstat-0.1.1
/usr/bin/ld: cannot find -llzma
collect2: error: ld returned 1 exit status
make: *** [libreadstat.la] Error 1
==============================[ ERROR: ReadStat ]===============================

LoadError: failed process: Process(`make`, ProcessExited(2)) [2]
while loading /home/andrew/.julia/v0.6/ReadStat/deps/build.jl, in expression starting on line 34

I see this note in ReadStat.jl saying “To use the module, you first need to ensure that libreadstat.dylib is in Julia’s load path. You can compile it from ReadStat”. Isn’t the package trying to do this automatically?


#6

It might well be because I haven’t spent much time on performance work. There are a few query operations that I tried to make somewhat performant so far (@select, @where and some @group statements), the rest has not been optimized at all. I think (hope?) that the overal design doesn’t have any problem with very large datasets, but I think we will really only know once I or someone else has made sure what we have now is actually optimized code end-to-end. I should also add that there are many layers at which one can work on performance. Right now I’m mostly focusing on an iteration based backend for Query.jl, but the overall design should allow for many alternative backends. Those are a lot more work to implement, but if we run into a hard wall with the iteration based backend, we should still be able to achieve much better performance by moving to some other backend approach.

In general, if you have a concrete example of a query that is slow, please open an issue. That tends to trigger my curiosity and get me to try to optimize that :slight_smile:

For some queries, Query.jl should work in a full streaming mode. I think the data source that is being queried could implement a batching scheme if that is required and then things should just work. For other queries, for example those that sort, or group or join, things get a lot more complicated. At the end of the day one would reimplement much of a traditional database if those operations should work with data that doesn’t fit into memory. I think in general Query.jl should be compatible with such a design, if one leaves the iteration based backend and writes a custom backend, but it would be a lot of work.

I’m not very familiar with that use case, if you could post a more concrete example, it would help me a lot!


#7

Yes, it should just work automatically. I think you are running into some problem with ReadStat.jl, the package that is used under the hood to read those files. I spent a lot of time getting that package to work on Windows last year, but I’m a lot less familiar with Linux/Mac systems, and I think you are running one of those? Probably best if you open an issue over in ReadStat.jl about this.


#8

At some point I’ll have an example you can go to, but for now let’s try the following description.

Suppose I have a table that has a DateTime column, a String column and two Float64 columns. I may, for example, need to get this into the form of an Matrix{Float32}. The String may represent categorical data, so it may have to be mapped to integer designations, which may then be converted to floats (in some cases this will be further transformed into a “one hot” representation, but that can usually be achieved fairly easily within the machine learning framework itself). The DateTime might have to be converted to Float32's representing, for instance, the number of seconds past a reference time. After feeding this Matrix into some machine learning, I’ll get back a Matrix that I’ll need to append to the original dataset in some way.

This gives a rough idea of the most basic problem. Things get way more complicated when you start doing stuff with time series and require rank-3 tensors, but even this most basic case often requires a surprising amount of manual work.

Addendum By the way, I did a little messing around today, and Query was not being nearly as slow as I remember. As I’ve said in the past, because of my points 2 and 3 above, I haven’t had that much incentive to look carefully at performance issues in Query. If it some point I think it’s a good path forward for me, I’ll get a much better understanding of the circumstances in which it is slow. In general I have found the performance rather unpredictable, this may well be my fault.


#9

I think unpredictable is what I would expect at this point :wink: Given that I optimized random parts of it, but not others, and no one other than me knows which parts that would be, it almost must appear unpredictable to folks :slight_smile: I hope to find some time this summer to do one systematic performance sweep over all parts.