Save Dataframe in file and read it again

FrootLoops · May 28, 2020, 1:22pm

Hello,

right now I am trying to save a Dataframe into a file and read it again. For this I used the CSV package. My dataframe has two columns where each entry is a vector. Here is an small example:

using DataFrames
using CSV

x = Vector{Array{Float64, 1}}()
y = Vector{Array{Float64, 1}}()
push!(x, [1,2,3])
push!(x, [4,5,6])
push!(y, [100,101,102])
push!(y, [103,104,105])

df = DataFrame(x=x, y=y)
CSV.write("export_df.csv", df)

test = CSV.read("export_df.csv")
println(typeof(df["x"][1]))
println(typeof(test["x"][1]))

The first output returns an Array{Float64,1} type, while type of test[“x”][1] returns a string. Can someone give me an advise how to fix this, such that I get an Array again?

xiaodai · May 28, 2020, 1:30pm

If your columns are Vector of Arrays then I think you would need something like JLSO.jl, JLD.jl or JLD2.jl or just serialize it

using DataFrames
x = Vector{Array{Float64, 1}}()
y = Vector{Array{Float64, 1}}()
push!(x, [1,2,3])
push!(x, [4,5,6])
push!(y, [100,101,102])
push!(y, [103,104,105])

df = DataFrame(x=x, y=y)

using Serialization
serialize("ok.jls", df)
df2 = deserialize("ok.jls")

df2 == df # true

But if you actually want to store just Vector{Float64} etc then you can try the JDF.jl package? Other options are Parquet.jl and Feather.jl.

pdeffebach · May 28, 2020, 1:32pm

cc @quinnj on this. Reading through the docs CSV.File (which will replace CSV.read eventually), it looks like the type keyword you would use to specify types only works for small concrete types like Float64, Int etc.

I also think it’s tough to parse the results from test into an array.

My recommendation is to use DataFrame’s flatten function to save the data in “long” format and then use combine to put the result back into vectors after you read it in.

xiaodai · May 28, 2020, 1:35pm

Presumably, the CSV format isn’t really designed to handle this. Serialization is a better option here IMO.

pdeffebach · May 28, 2020, 1:45pm

I would agree. The flatten option is pretty tough to implement and hard to generalize.

ulia> function save_to_csv(df)
           df.rownum = 1:nrow(df)
           long = stack(df, [:x, :y])
           longer = flatten(long, :value)
           CSV.write("out.csv", longer)
           return longer
       end

julia> function read_from_csv(file)
       df = CSV.File(file) |> DataFrame
       combine(groupby(df, "rownum")) do sdf
           (x = [sdf.value[sdf.variable .== "x"]], 
            y = [sdf.value[sdf.variable .== "y"]])
       end
       end

Topic		Replies	Views
Issues reading CSV file with array elements General Usage dataframes , csv	4	1779	September 6, 2021
Outputing/Inputing vectors in DataFrames General Usage dataframes , csv	2	468	January 17, 2023
DataFrames/CSV: how to read vectors from *.csv? General Usage	9	2848	March 26, 2021
Saving a DataFrame containing vectors Data question	4	294	May 27, 2023
[ANN] JDF.jl - Experimental Julia DataFrames serialization format Package Announcements	3	1428	January 19, 2020

Save Dataframe in file and read it again

Related topics