Suggested formats for saving and serialization

package
data

#1

I was wondering what the suggested formats for saving and serializing are these days. I am really confused about the landscape of DataTables, DataFrames, databases, etc. I was hoping to write a few functions for the DiffEq solution type to save to some common data formats, but am not sure what I should be targeting. It should be something that would work well with statistics, plotting, and machine learning libraries. Or maybe the approach is just to go generic: I know there are things like DataStreams which are “independent readers”, is there something reverse that I can target so the user can choose which data type they want out? Would that even be necessary? I am hoping someone could guide me in the right direction here. Thanks!

For completeness, I opened an issue on DifferentialEquations.jl related to this topic (and it shows how idea-less I am, except I have had requests for something of this nature):

Thanks in advance for any ideas.

Also, is there by any chance a form of serialization for types which hold functions? I know JLD hasn’t worked since v0.5, and am wondering if there’s anything along these lines.


How to handle and store large amounts of (distributed) generated data?
#2

csv and HDF5 are probably the most widespread open formats for numerical data.


#3

DataStreams are not just for reading, they are also for writing. I would suggest implementing a DataStream Source a let the user choose what output format they want to use (possibly with a default format if you want).


#4

Are there any examples for how to setup a DataStream Source anywhere? Are there examples how how to take an arbitrary source and write it to a DataFrame?


#5

You wouldn’t deal with DataFrame at all, you would just implement the Source interface, and the code living in DataFrames would take care of creating the object. For an example, you can have a look at the DataFrames code implementing a Source: https://github.com/JuliaStats/DataFrames.jl/pull/1174 CSV.jl is another possibly useful example.


#6

I know I don’t need to write that code. But I was wondering what the code looks like for generating a DataFrame from an arbitrary Source.


#7

I just put together a quick and dirty integration with IterableTables in this PR https://github.com/davidanthoff/IterableTables.jl/pull/22. With that you can easily convert a DESolution into any of the supported table sink types, e.g. things like DataTable(sol) will work to create a DataTable from a DESolution instance. Essentially you get support for all the sinks that are listed in the README, plus of course full Query integration, i.e. one can easily run queries against a DESolution instance. You also get integration with DataStreams from that “for free”: you can use IterableTables.get_datastreams_source(sol) to create a DataStreams.Source from your solution (I’m still trying to figure out an easier way to handle that particular integration from a user point of view).

I’m currently just waiting that the package to be registered: https://github.com/JuliaLang/METADATA.jl/pull/8878. And then we would have to clean out that PR a bit more before I could merge it.


#8

That sounds great! The API looks very simple too. That looks like the solution I was needing. Thanks! I’ll comment on the PR


#9

Don’t say that, it will just discourage me from writing the documentation I really should be writing for this :wink: