JDF is the Julia DataFrames serialization format. It’s a specialised serialization format and hence doesn’t support arbitrary objects like JLD2 and JLSO. This loss of generality is more than made up for in gains in speed and reliability for saving and loading DataFrames.
JDF now supports DataFrames containing these types
WeakRefStrings.StringVector
Vector{T}
CategoricalArrays.CategoricalVetors{T}
where T can be String, Bool, and isbits types i.e. UInt*, Int*,
and Float*Date* types etc.
RLEVectors support will be considered in the future when missing support
arrives for RLEVectors.jl.
Also, there is now the ability to load only the columns you select. For example
This is interesting for us. We use DataFrame in DrWatson’s collect results functionality: Running & Listing Simulations · DrWatson . What we do is we scan your directory and make all your simulations a DataFrame and then save it. We re-use existing dataframes, which also can get big, so performance gains in terms of read and write are important.
At the moment though I don’t think we can move into JDF, because of the type limitations. I’ll keep watching this post to see if there are less restrictions on types as time progresses. ( @JonasIsensee I’m tagging you this may be interesting for you )
What types do you need? If it’s a small list I can try to prioritise them.
In JDF.jl v0.3 (the next version) there won’t be type restrictions, but some types might be slow to safe though as there may not be specialised algorithms for saving them.
Yeah, that’s the problem: I don’t know in advance what types users may have created in their simulations that they want to save. Seems like 0.3 does exactly what we need though!
By the way, I think Symbol should also have a “fast” implementation. AT the moment I use Symbols as parameters to represent complicated functions that I don’t want to save in my DataFrame, and during my simulations I @eval those symbols.
In that case, JDF.jl may only yield speeed benefit if you use Julia 1.3 because of multithreading, because JDF would need to rely on JLSO.jl or the like for serialization arbitray format anyway.
Funny that! I was just thinking about Symbols. JDF.jl v0.2.1 it is!
I have one comment on the types:
At the moment we have columns with custom types (or collections of types - > Array of Any)
in our aggregated DataFrames but i suppose this is not really necessary if we find a different proper representation.
Strings are probably not effective as you can’t query into the values easily anymore.
Converting to namestuples could be an option or would that also not work fast?
But you have to make sure that every element in your Vector{NameTuple} has the same NameTuple structure. Or it will fail, which I need better error messages for.
In fact, all isbits type are supported. Also structs whose elements are all isbits types are supported as well. TimeZones.jl support is coming in an upcoming release, too.