I serialized a DataFrame from DataFrames.jl@1.3. Can I convert it to be compatible with DataFrames.jl@1.4 so I can load it in an environment with the new version?
One option is to read it in and write it back out as an arrow file, then read it with newer DataFrames.
This dataframe has complicated column types that aren’t natively supported by Arrow, so I’d have to write a lot of extra code to make that work.
Arrow should support self-descriptive structs, they have a metadata thingy in it, as long as your struct definition can be loaded again, you can load the same arrow file again
Do the following:
- convert data frame using
Tables.columntable
- serialize produced
NamedTuple
- update to DataFrames.jl 1.4
- deserialize produced
NamedTuple
- convert
NamedTuple
to aDataFrame
@bkamins is there a programmatic way to do this, if I have a couple of dozen JLD files based on DataFrames 1.3?
An example.
Under DataFrames.jl 1.3 for all files you have
- read
filename.jld
JLD file todf
data frame - run
serialize("filename.bin", Tables.columntable(df))
Now update to DataFrames.jl 1.4 and for all files you have:
- run
df = DataFrame(deserialize("filename.bin"))
- write
df
as JLD file
Thanks. Do I need to restart Julia after/before updating the DataFrames pkg? I think so.
EDIT: I note that I disabled Revise, as it gave me ton’s of complaints.
Yes, you need to restart Julia.
While this is your solution, that one (or converting to Arrow) seem very inconvenient, even with the programmatic way. Can’t it be done in DataFrame.jl 1.4 (or 1.5?) to do it for you when you open older (by default, or if problematic, as opt-in)? What’s done in other languages, e.g. pandas (or for pickle) do the just support saved data from older versions?
Along your idea: I keep thinking, that a smart person would be able to start up multiple julia sessions that talk to each other (most primitive variant would be through binary files on disk). One using DataFrames 1.3 and the other using DataFrames 1.4
What you ask for cannot be done in DataFrames.jl as it is not DataFrames.jl that “opens” the stored data. It is JLD.jl in this case that would need to add such an extension (which is not very likely I assume).
Indeed the process is a bit cumbersome, but it is a one-time action. Here are scripts that do all what is needed. I assume that you want to convert "old.jld"
to "new.jld"
(not tested - I have written if from my head)
main.sh
mkdir old
mkdir new
cd old
julia old.jl
julia new.jl
old.jl
using Pkg
Pkg.activate("old")
Pkg.add("DataFrames", version="1.3.6")
Pkg.add("JLD", version="XXXX") # JLD version you used to save data
using DataFrames
using JLD
using Serialization
df = load("old.jld")
serialize("df.bin", Tables.columntable(df))
new.jl
using Pkg
Pkg.activate("new")
Pkg.add("DataFrames", version="1.4.2")
Pkg.add("JLD", version="XXXX") # JLD version you want to use to save data with
using DataFrames
using JLD
using Serialization
df = DataFrame(deserialize("df.bin")) # or whatever name you want to use
save("new.jld", "df")
you really should be using Arrow.jl and just get back DataFrame every time
I assume @bernhard uses JLD because of custom columns that Arrow.jl cannot represent properly.
OK, I see it wasn’t strictly a problem with DF, as I assumed from “I serialized a DataFrame from DataFrames.jl@1.3”, that some DF function was used. It’s a one-time thing now, until maybe a later version. And even more important if it’s a problem with Julia’s (or JLD) serialization, to know if something better can be done with it, or to avoid the problem by not using it, and then what else?
I recall from Pandas, that Pandas has a function to store dataframes. I’m still curious if a problem in other languages. Should DF have such a function that would just work?