Convert DataFrames 1.3 DataFrame to DataFrames 1.4

I serialized a DataFrame from DataFrames.jl@1.3. Can I convert it to be compatible with DataFrames.jl@1.4 so I can load it in an environment with the new version?

One option is to read it in and write it back out as an arrow file, then read it with newer DataFrames.

This dataframe has complicated column types that aren’t natively supported by Arrow, so I’d have to write a lot of extra code to make that work.

Arrow should support self-descriptive structs, they have a metadata thingy in it, as long as your struct definition can be loaded again, you can load the same arrow file again

Do the following:

  • convert data frame using Tables.columntable
  • serialize produced NamedTuple
  • update to DataFrames.jl 1.4
  • deserialize produced NamedTuple
  • convert NamedTuple to a DataFrame
3 Likes

@bkamins is there a programmatic way to do this, if I have a couple of dozen JLD files based on DataFrames 1.3?

An example.

Under DataFrames.jl 1.3 for all files you have

  1. read filename.jld JLD file to df data frame
  2. run serialize("filename.bin", Tables.columntable(df))

Now update to DataFrames.jl 1.4 and for all files you have:

  1. run df = DataFrame(deserialize("filename.bin"))
  2. write df as JLD file
1 Like

Thanks. Do I need to restart Julia after/before updating the DataFrames pkg? I think so.
EDIT: I note that I disabled Revise, as it gave me ton’s of complaints.

Yes, you need to restart Julia.

While this is your solution, that one (or converting to Arrow) seem very inconvenient, even with the programmatic way. Can’t it be done in DataFrame.jl 1.4 (or 1.5?) to do it for you when you open older (by default, or if problematic, as opt-in)? What’s done in other languages, e.g. pandas (or for pickle) do the just support saved data from older versions?

1 Like

Along your idea: I keep thinking, that a smart person would be able to start up multiple julia sessions that talk to each other (most primitive variant would be through binary files on disk). One using DataFrames 1.3 and the other using DataFrames 1.4

1 Like

What you ask for cannot be done in DataFrames.jl as it is not DataFrames.jl that “opens” the stored data. It is JLD.jl in this case that would need to add such an extension (which is not very likely I assume).

Indeed the process is a bit cumbersome, but it is a one-time action. Here are scripts that do all what is needed. I assume that you want to convert "old.jld" to "new.jld" (not tested - I have written if from my head)

main.sh

mkdir old
mkdir new
cd old
julia old.jl
julia new.jl

old.jl

using Pkg
Pkg.activate("old")
Pkg.add("DataFrames", version="1.3.6")
Pkg.add("JLD", version="XXXX") # JLD version you used to save data
using DataFrames
using JLD
using Serialization
df = load("old.jld")
serialize("df.bin", Tables.columntable(df))

new.jl

using Pkg
Pkg.activate("new")
Pkg.add("DataFrames", version="1.4.2")
Pkg.add("JLD", version="XXXX") # JLD version you want to use to save data with
using DataFrames
using JLD
using Serialization
df = DataFrame(deserialize("df.bin")) # or whatever name you want to use
save("new.jld", "df")

you really should be using Arrow.jl and just get back DataFrame every time

1 Like

I assume @bernhard uses JLD because of custom columns that Arrow.jl cannot represent properly.

3 Likes

OK, I see it wasn’t strictly a problem with DF, as I assumed from “I serialized a DataFrame from DataFrames.jl@1.3”, that some DF function was used. It’s a one-time thing now, until maybe a later version. And even more important if it’s a problem with Julia’s (or JLD) serialization, to know if something better can be done with it, or to avoid the problem by not using it, and then what else?

I recall from Pandas, that Pandas has a function to store dataframes. I’m still curious if a problem in other languages. Should DF have such a function that would just work?