Save Dataframe in file and read it again

Hello,

right now I am trying to save a Dataframe into a file and read it again. For this I used the CSV package. My dataframe has two columns where each entry is a vector. Here is an small example:

using DataFrames
using CSV

x = Vector{Array{Float64, 1}}()
y = Vector{Array{Float64, 1}}()
push!(x, [1,2,3])
push!(x, [4,5,6])
push!(y, [100,101,102])
push!(y, [103,104,105])

df = DataFrame(x=x, y=y)
CSV.write("export_df.csv", df)

test = CSV.read("export_df.csv")
println(typeof(df["x"][1]))
println(typeof(test["x"][1]))

The first output returns an Array{Float64,1} type, while type of test[“x”][1] returns a string. Can someone give me an advise how to fix this, such that I get an Array again?

If your columns are Vector of Arrays then I think you would need something like JLSO.jl, JLD.jl or JLD2.jl or just serialize it

using DataFrames
x = Vector{Array{Float64, 1}}()
y = Vector{Array{Float64, 1}}()
push!(x, [1,2,3])
push!(x, [4,5,6])
push!(y, [100,101,102])
push!(y, [103,104,105])

df = DataFrame(x=x, y=y)

using Serialization
serialize("ok.jls", df)
df2 = deserialize("ok.jls")

df2 == df # true

But if you actually want to store just Vector{Float64} etc then you can try the JDF.jl package? Other options are Parquet.jl and Feather.jl.

2 Likes

cc @quinnj on this. Reading through the docs CSV.File (which will replace CSV.read eventually), it looks like the type keyword you would use to specify types only works for small concrete types like Float64, Int etc.

I also think it’s tough to parse the results from test into an array.

My recommendation is to use DataFrame’s flatten function to save the data in “long” format and then use combine to put the result back into vectors after you read it in.

Presumably, the CSV format isn’t really designed to handle this. Serialization is a better option here IMO.

I would agree. The flatten option is pretty tough to implement and hard to generalize.

ulia> function save_to_csv(df)
           df.rownum = 1:nrow(df)
           long = stack(df, [:x, :y])
           longer = flatten(long, :value)
           CSV.write("out.csv", longer)
           return longer
       end

julia> function read_from_csv(file)
       df = CSV.File(file) |> DataFrame
       combine(groupby(df, "rownum")) do sdf
           (x = [sdf.value[sdf.variable .== "x"]], 
            y = [sdf.value[sdf.variable .== "y"]])
       end
       end