I have a DataFrame with a column of 3x3 Float32 matrices. I would like to save this to a file and later read it back, preserving this column. I tried this round trip with several tools:
CSV => converts to String
JSONTables => converts to Vector{Any} containing Vector{Any}
Arrow => converts to Vector{Float32}, not bad!
Avro => "ArgumentError: type does not have a definite number of fields", this seems like I am probably using it wrong.
Parquet2 => "ArgumentError: type Matrix{Float32} does not have a corresponding parquet type"
Is there something else I could try that knows how to preserve matrices? If not, I will probably use CSV with a transform before to unpack the matrix into columns and a corresponding re-pack on load.
Serialize
using DataFrames, Random
num_rows = 5
data = [rand(Float32, 3, 3) for _ in 1:num_rows]
df = DataFrame(matrix = data)
# Save and reload the DataFrame, preserving the type of the "matrix" column (Vector{Matrix{Float32}})
using Serialization
fname = "df_serialized.bin"
# Serialize the DataFrame to preserve column types
open(fname, "w") do io
serialize(io, df)
end
# To read back (as an example):
df2 = open(fname, "r") do io
deserialize(io)
end
df
julia> df
5×1 DataFrame
Row │ matrix
│ Array…
─────┼───────────────────────────────────
1 │ Float32[0.606444 0.947109 0.4793…
2 │ Float32[0.349336 0.322748 0.6130…
3 │ Float32[0.595545 0.502662 0.4875…
4 │ Float32[0.442332 0.81498 0.42264…
5 │ Float32[0.672831 0.567095 0.7454…
df2
julia> df2
5×1 DataFrame
Row │ matrix
│ Array…
─────┼───────────────────────────────────
1 │ Float32[0.606444 0.947109 0.4793…
2 │ Float32[0.349336 0.322748 0.6130…
3 │ Float32[0.595545 0.502662 0.4875…
4 │ Float32[0.442332 0.81498 0.42264…
5 │ Float32[0.672831 0.567095 0.7454…
df2