dear j wizardsβis there a mechanism to save and restore a DataFrame in binary format to disk, with all type information? serialize() and deserialize() do not work for DataFrame in DataFrames.jl 0.11.5 .
incidentally, serialize() and deserialize() only have an iostream but not a filename based interface. would be nice to be built in and easy to implement.
Feather is also quite nice, and is a standard format that is also available in other languages. I have a PR to it which drastically speeds it up and offers some really nice random access features. I think it is having to wait on some other things in the ecosystem to mature a little, but I hope to get it merged at some point in the not so distant future. The existing master should work fine though.
thanks, everyone. I think the *SON formats lose type information.
JLD2 looks good. alas, is this specialized and possibly slower overkill? (it seems designed as an interchange format, rather than as a βnativeβ representation.)
Would it not be desirable/easy(?) to have a standard function, like serialize (or extend serialize), that is the generic equivalent of copy() (or deepcopy()), that would binary-write the full internal julia object representation to disk rather than to another memory location? (Could it then be read again?) . I do not know the julia internals, so this may be an absurd and ignorant suggestion.
related question: Is it possible to read back and evaluate the output of dump()?
The disadvantage of serialize is that itβs doing exactly as you suggest and writing a fairly literal dump of Juliaβs model of an object to disk. That means that loading that file on a different Julia version is likely to be impossible.
JLD2, JSON, BSON, Feather, etc. all use widely-supported formats so that itβs more likely you can read your data in the future or in another language. As for performance, despite the fact that itβs not as literal as serialize, JLD2 is quite fast, and Simon showed in his JuliaCon 2017 presentation that JLD2 can even be faster than Juliaβs built-in serialize in some cases: JuliaCon 2017 | JLD2: High-Performance Serialization | Simon Kornblith - YouTube
I would suggest trying out one of the approaches and measuring the performance for yourself. As for reading the output of dump(),you could write one, but I would not recommend it. Thereβs no guarantee that dump() prints something that you could eval (it generally doesnβt) and thereβs no guarantee that type youβre using has a constructor that matches the way data is printed by dump(). It really doesnβt seem worthwhile.
thx, rdeits. embarrassingly, I had omitted the first argument to serialize, and then misinterpreted the resulting error as serialize not being able to work on dataframes.
I was also surprised about your sticking the open into the first argument. interesting. does it flush and close the file at IOStream() destruction time?
does it flush and close the file at IOStream() destruction time?
No, the file may not be closed until Julia decides to finalize it. For a more robust version, you can use the open() do...end construction:
julia> open("my_data", "w") do file
serialize(file, df)
end
which does guarantee that the file will be flushed and closed at the end of the do...end block. This is in the help for ?open:
open(f::Function, command, mode::AbstractString="r", stdio=DevNull)
Similar to open(command, mode, stdio), but calls f(stream) on the resulting read or write stream, then closes the
stream and waits for the process to complete. Returns the value returned by f.
True, but you can write the same thing as a one-liner if you want:
open(file -> serialize(file, df), "my_data", "w")
and keep the guarantees that the file will be immediately closed.
Thereβs also a proposed shorthand anonymous function syntax that might make it into v1.0 or v1.1 that would make serialize(_, df) equivalent to file -> serialize(file, df), so you could do:
With Julia moving so rapidly, the serialization format changes accordingly. If you dig up these files in a few months, your future self will thank you for choosing JLD2 (or a similar format, like BSON).