Save and restore DataFrame, and serialize()/deserialize()


#1

dear j wizardsβ€”is there a mechanism to save and restore a DataFrame in binary format to disk, with all type information? serialize() and deserialize() do not work for DataFrame in DataFrames.jl 0.11.5 .

incidentally, serialize() and deserialize() only have an iostream but not a filename based interface. would be nice to be built in and easy to implement.

regards, /iaw


#2

#3

Also worth trying: https://github.com/MikeInnes/BSON.jl


#4

JLD2 seems to work:

julia> using DataFrames

julia> df = DataFrame(A = [1,2,3], B=[1.0, 2.0, 3.0], C=["1", "2", "3"])
3Γ—3 DataFrames.DataFrame
β”‚ Row β”‚ A β”‚ B   β”‚ C   β”‚
β”œβ”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€
β”‚ 1   β”‚ 1 β”‚ 1.0 β”‚ "1" β”‚
β”‚ 2   β”‚ 2 β”‚ 2.0 β”‚ "2" β”‚
β”‚ 3   β”‚ 3 β”‚ 3.0 β”‚ "3" β”‚

Save:

julia> using JLD2

julia> jldopen("df.jld2", "w") do file
         file["df"] = df
       end;

Load:

julia> jldopen("df.jld2") do file
           file["df"]
       end
3Γ—3 DataFrames.DataFrame
β”‚ Row β”‚ A β”‚ B   β”‚ C   β”‚
β”œβ”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€
β”‚ 1   β”‚ 1 β”‚ 1.0 β”‚ "1" β”‚
β”‚ 2   β”‚ 2 β”‚ 2.0 β”‚ "2" β”‚
β”‚ 3   β”‚ 3 β”‚ 3.0 β”‚ "3" β”‚

#5

Feather is also quite nice, and is a standard format that is also available in other languages. I have a PR to it which drastically speeds it up and offers some really nice random access features. I think it is having to wait on some other things in the ecosystem to mature a little, but I hope to get it merged at some point in the not so distant future. The existing master should work fine though.


#6

thanks, everyone. I think the *SON formats lose type information.

JLD2 looks good. alas, is this specialized and possibly slower overkill? (it seems designed as an interchange format, rather than as a β€œnative” representation.)

Would it not be desirable/easy(?) to have a standard function, like serialize (or extend serialize), that is the generic equivalent of copy() (or deepcopy()), that would binary-write the full internal julia object representation to disk rather than to another memory location? (Could it then be read again?) . I do not know the julia internals, so this may be an absurd and ignorant suggestion.

related question: Is it possible to read back and evaluate the output of dump()?

regards,

/iaw


#7

Isn’t that exactly what serialize already does?

julia> serialize(open("my_data", "w"), df)

julia> deserialize(open("my_data"))
3Γ—3 DataFrames.DataFrame
β”‚ Row β”‚ A β”‚ B   β”‚ C   β”‚
β”œβ”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€
β”‚ 1   β”‚ 1 β”‚ 1.0 β”‚ "1" β”‚
β”‚ 2   β”‚ 2 β”‚ 2.0 β”‚ "2" β”‚
β”‚ 3   β”‚ 3 β”‚ 3.0 β”‚ "3" β”‚

The disadvantage of serialize is that it’s doing exactly as you suggest and writing a fairly literal dump of Julia’s model of an object to disk. That means that loading that file on a different Julia version is likely to be impossible.

JLD2, JSON, BSON, Feather, etc. all use widely-supported formats so that it’s more likely you can read your data in the future or in another language. As for performance, despite the fact that it’s not as literal as serialize, JLD2 is quite fast, and Simon showed in his JuliaCon 2017 presentation that JLD2 can even be faster than Julia’s built-in serialize in some cases: https://www.youtube.com/watch?v=qc0dw-cAmLY

I would suggest trying out one of the approaches and measuring the performance for yourself. As for reading the output of dump(),you could write one, but I would not recommend it. There’s no guarantee that dump() prints something that you could eval (it generally doesn’t) and there’s no guarantee that type you’re using has a constructor that matches the way data is printed by dump(). It really doesn’t seem worthwhile.


#8

thx, rdeits. embarrassingly, I had omitted the first argument to serialize, and then misinterpreted the resulting error as serialize not being able to work on dataframes.

I was also surprised about your sticking the open into the first argument. interesting. does it flush and close the file at IOStream() destruction time?

regards,

/iaw


#9

does it flush and close the file at IOStream() destruction time?

No, the file may not be closed until Julia decides to finalize it. For a more robust version, you can use the open() do...end construction:

julia> open("my_data", "w") do file
         serialize(file, df)
       end

which does guarantee that the file will be flushed and closed at the end of the do...end block. This is in the help for ?open:

open(f::Function, command, mode::AbstractString="r", stdio=DevNull)

  Similar to open(command, mode, stdio), but calls f(stream) on the resulting read or write stream, then closes the
  stream and waits for the process to complete. Returns the value returned by f.

#10

pity. the do end construction is a little more verbose. :frowning:

regards,

/iaw


#11

True, but you can write the same thing as a one-liner if you want:

open(file -> serialize(file, df), "my_data", "w")

and keep the guarantees that the file will be immediately closed.

There’s also a proposed shorthand anonymous function syntax that might make it into v1.0 or v1.1 that would make serialize(_, df) equivalent to file -> serialize(file, df), so you could do:

open(serialize(_, df), "my_data", "w")

which is pretty good, I think :smile:


#12

ah, I see. this is what the open(f::Function was for. something that had not clicked. (julia innovation, I think.)

and the _ anonymous notation is even nicer.

/iaw


#13

With Julia moving so rapidly, the serialization format changes accordingly. If you dig up these files in a few months, your future self will thank you for choosing JLD2 (or a similar format, like BSON).