[ANN] JDF.jl v0.2.0 - Julia DataFrames serialization format

JDF is the Julia DataFrames serialization format. It’s a specialised serialization format and hence doesn’t support arbitrary objects like JLD2 and JLSO. This loss of generality is more than made up for in gains in speed and reliability for saving and loading DataFrames.

JDF now supports DataFrames containing these types

  • WeakRefStrings.StringVector
  • Vector{T}
  • CategoricalArrays.CategoricalVetors{T}

where T can be String, Bool, and isbits types i.e. UInt*, Int*,
and Float* Date* types etc.

RLEVectors support will be considered in the future when missing support
arrives for RLEVectors.jl.

Also, there is now the ability to load only the columns you select. For example

a2_selected = loadjdf("iris.jdf", cols = [:species, :sepalLength, :petalWidth])

From JDF.jl v0.2, I am committed to making all JDF files loadable in ALL future version of JDF.jl.

Please see Github

If you find JDF.jl useful, please do Star the github repo. It keeps me going :slight_smile:

6 Likes

This is interesting for us. We use DataFrame in DrWatson’s collect results functionality: Running & Listing Simulations · DrWatson . What we do is we scan your directory and make all your simulations a DataFrame and then save it. We re-use existing dataframes, which also can get big, so performance gains in terms of read and write are important.

At the moment though I don’t think we can move into JDF, because of the type limitations. I’ll keep watching this post to see if there are less restrictions on types as time progresses. ( @JonasIsensee I’m tagging you this may be interesting for you )

What types do you need? If it’s a small list I can try to prioritise them.

In JDF.jl v0.3 (the next version) there won’t be type restrictions, but some types might be slow to safe though as there may not be specialised algorithms for saving them.

1 Like

Yeah, that’s the problem: I don’t know in advance what types users may have created in their simulations that they want to save. Seems like 0.3 does exactly what we need though!

By the way, I think Symbol should also have a “fast” implementation. AT the moment I use Symbols as parameters to represent complicated functions that I don’t want to save in my DataFrame, and during my simulations I @eval those symbols.

In that case, JDF.jl may only yield speeed benefit if you use Julia 1.3 because of multithreading, because JDF would need to rely on JLSO.jl or the like for serialization arbitray format anyway.

Funny that! I was just thinking about Symbols. JDF.jl v0.2.1 it is!

Hey, this is cool work!

I have one comment on the types:
At the moment we have columns with custom types (or collections of types - > Array of Any)
in our aggregated DataFrames but i suppose this is not really necessary if we find a different proper representation.
Strings are probably not effective as you can’t query into the values easily anymore.
Converting to namestuples could be an option or would that also not work fast?

You can save NamedTuple already provided that all the variable inside the NamedTuple is isbits and are of the same structure. See example:

using DataFrames

adf = DataFrame(a = [(ok = 2, lah = 2), (ok = 3, lah = 3)])

savejdf(adf, "c:/plsdel.jdf")
adf_copy = loadjdf("c:/plsdel.jdf") # same as adf

adf_copy == adf # true

and it’s pretty fast and that’s because

isbits((ok = 2, lah = 2)) # is true

But you have to make sure that every element in your Vector{NameTuple} has the same NameTuple structure. Or it will fail, which I need better error messages for.

Just tagged JDF.jl 0.2.1 with Symbol support. Doesn’t satisfying your use-case yet. But it’s something

See New version: JDF v0.2.1 by JuliaRegistrator · Pull Request #4657 · JuliaRegistries/General · GitHub

This is great. Can it handle Dates and DateTimes?

Yes. See example

using DataFrames, JDF, Dates

df = DataFrame(d = DateTime.(2013:2014), d1  = Date.(2013:2014))

savejdf(df, "date.jdf")

loadjdf("date.jdf")

In fact, all isbits type are supported. Also structs whose elements are all isbits types are supported as well. TimeZones.jl support is coming in an upcoming release, too.

I just tried it out - very fast and small file sizes.
Great, thanks for your work!

2 Likes