[ANN] JDF.jl v0.2.0 - Julia DataFrames serialization format

xiaodai · October 23, 2019, 12:18pm

JDF is the Julia DataFrames serialization format. It’s a specialised serialization format and hence doesn’t support arbitrary objects like JLD2 and JLSO. This loss of generality is more than made up for in gains in speed and reliability for saving and loading DataFrames.

JDF now supports DataFrames containing these types

WeakRefStrings.StringVector
Vector{T}
CategoricalArrays.CategoricalVetors{T}

where T can be String, Bool, and isbits types i.e. UInt*, Int*,
and Float* Date* types etc.

RLEVectors support will be considered in the future when missing support
arrives for RLEVectors.jl.

Also, there is now the ability to load only the columns you select. For example

a2_selected = loadjdf("iris.jdf", cols = [:species, :sepalLength, :petalWidth])

From JDF.jl v0.2, I am committed to making all JDF files loadable in ALL future version of JDF.jl.

Please see Github

If you find JDF.jl useful, please do Star the github repo. It keeps me going

Datseris · October 23, 2019, 12:32pm

This is interesting for us. We use DataFrame in DrWatson’s collect results functionality: Running & Listing Simulations · DrWatson . What we do is we scan your directory and make all your simulations a DataFrame and then save it. We re-use existing dataframes, which also can get big, so performance gains in terms of read and write are important.

At the moment though I don’t think we can move into JDF, because of the type limitations. I’ll keep watching this post to see if there are less restrictions on types as time progresses. ( @JonasIsensee I’m tagging you this may be interesting for you )

xiaodai · October 23, 2019, 12:34pm

What types do you need? If it’s a small list I can try to prioritise them.

In JDF.jl v0.3 (the next version) there won’t be type restrictions, but some types might be slow to safe though as there may not be specialised algorithms for saving them.

Datseris · October 23, 2019, 12:40pm

Yeah, that’s the problem: I don’t know in advance what types users may have created in their simulations that they want to save. Seems like 0.3 does exactly what we need though!

Datseris · October 23, 2019, 12:47pm

By the way, I think Symbol should also have a “fast” implementation. AT the moment I use Symbols as parameters to represent complicated functions that I don’t want to save in my DataFrame, and during my simulations I @eval those symbols.

xiaodai · October 23, 2019, 12:57pm

In that case, JDF.jl may only yield speeed benefit if you use Julia 1.3 because of multithreading, because JDF would need to rely on JLSO.jl or the like for serialization arbitray format anyway.

Funny that! I was just thinking about Symbols. JDF.jl v0.2.1 it is!

JonasIsensee · October 23, 2019, 1:01pm

Hey, this is cool work!

I have one comment on the types:
At the moment we have columns with custom types (or collections of types - > Array of Any)
in our aggregated DataFrames but i suppose this is not really necessary if we find a different proper representation.
Strings are probably not effective as you can’t query into the values easily anymore.
Converting to namestuples could be an option or would that also not work fast?

xiaodai · October 23, 2019, 1:07pm

You can save NamedTuple already provided that all the variable inside the NamedTuple is isbits and are of the same structure. See example:

using DataFrames

adf = DataFrame(a = [(ok = 2, lah = 2), (ok = 3, lah = 3)])

savejdf(adf, "c:/plsdel.jdf")
adf_copy = loadjdf("c:/plsdel.jdf") # same as adf

adf_copy == adf # true

and it’s pretty fast and that’s because

isbits((ok = 2, lah = 2)) # is true

But you have to make sure that every element in your Vector{NameTuple} has the same NameTuple structure. Or it will fail, which I need better error messages for.

xiaodai · October 23, 2019, 2:00pm

Just tagged JDF.jl 0.2.1 with Symbol support. Doesn’t satisfying your use-case yet. But it’s something

See New version: JDF v0.2.1 by JuliaRegistrator · Pull Request #4657 · JuliaRegistries/General · GitHub

mwsohn · October 25, 2019, 12:09am

This is great. Can it handle Dates and DateTimes?

xiaodai · October 25, 2019, 12:22am

Yes. See example

using DataFrames, JDF, Dates

df = DataFrame(d = DateTime.(2013:2014), d1  = Date.(2013:2014))

savejdf(df, "date.jdf")

loadjdf("date.jdf")

In fact, all isbits type are supported. Also structs whose elements are all isbits types are supported as well. TimeZones.jl support is coming in an upcoming release, too.

lungben · May 19, 2020, 8:38pm

I just tried it out - very fast and small file sizes.
Great, thanks for your work!

Topic		Replies	Views
[ANN]: JDF.jl v0.2.3 - DataFrames serialization format for Julia Package Announcements	0	569	November 3, 2019
[ANN] JDF.jl - Experimental Julia DataFrames serialization format Package Announcements	3	1428	January 19, 2020
Save and restore DataFrame, and serialize()/deserialize() General Usage	13	5382	September 13, 2019
JDF - an experimental DataFrame serialization format is ready for beta testing Data	8	2003	September 15, 2019
Convert DataFrames 1.3 DataFrame to DataFrames 1.4 General Usage dataframes	14	497	November 2, 2022

[ANN] JDF.jl v0.2.0 - Julia DataFrames serialization format

Related topics