Saving a DataFrame containing vectors

I am trying to save a DataFrame, where one of the columns (Say “e_list”) contains vectors of size 40.
I tried the following packages, and none of them seem to work.

  • CSV
    When opening the .csv file stored using the CSV package, the vectors are converted into a string.
  • JLD
    JLD takes forever to open the saved file. (It eventually never opened)
  • HDF5
    Will not let me save the DataFrame.
  • Parquet2
    Saves the data frame, but then when opening it, the vectors are stored as Union{AbstractDict{String, Any}, Vector{Any}}. This makes resaving this data frame impossible, as Union{AbstractDict{String, Any}, Vector{Any}} is not a standard datatype for Parquet2.

What would you think would be the right way to do it?

Did you try just serializing/deserializing them?

1 Like

Did you try JLD2? To be honest, I didn’t realize JLD was still maintained.

Yeah, CSV is not a particularly flexible format. That said, you could try calling Meta.parse() on the strings after they’re loaded :person_shrugging:

The option I would probably use if your Vectors are all the same length is to stack / unstack therm. That is, if you have 5 vectors of length 10, expand that to 50 rows, adding a column for vector_num or sometime

You could open an issue for this - I know ExpandingMan is trying hard to make this package functional.

You might also try Arrow.jl, though I don’t know if it’s support for in-cell vector types is better.

2 Likes

It does support arbitrarily nested vectors (at least all the variants I have tried).

I am patching Parquet2 so that it will try bson/json for arrays with eltype Union{AbstractDict,AbstractVector}. However, Parquet2.jl does not truly support nested data types, and defaulting to json/bson is a bit of an undesirable “hack”. In fact, at some point I’m going to have to change it so that this doesn’t happen by default (though there will be an option to restore the current behavior). In the meantime, I agree that it makes no sense for it to refuse to write its own output.

I strongly suggest using Arrow.jl if possible, as arrow supports true arbitrarily nested data types with its own binary format and will allow for type-stable reading and writing, which Parquet2.jl can’t do unless or until support for nested data structures is added.

1 Like