DataFrames/CSV: how to read vectors from *.csv?

mike_k · March 24, 2021, 5:19pm

Dear folks,
unfortunately, I fail to read vectors from a *.csv file. Here is an MWE:

using CSV, DataFrames

df = DataFrame( A=[[1,2,3], [4,5,6]] )
file = "my_test.csv"
CSV.write(file, df, delim=';')

new_df = DataFrame( CSV.File(file; delim=';' ) )

The problem is that column A in new_df of type String:

jjulia> new_df
2×1 DataFrame
 Row │ A         
     │ String    
─────┼───────────
   1 │ [1, 2, 3]
   2 │ [4, 5, 6]

I also tried:

CSV.File( file; types=Dict(:B => Vector{Int}) )

and

CSV.File( file; types=Dict(:B => Int[]) )

but it does not work. Do you have any ideas? Thanks in advance!

chris-b1 · March 24, 2021, 5:49pm

There currently isn’t a built-in way to do this with CSV.jl Take a look at this issue comment for a suggested approach, using a custom type.

Or perhaps more simply, you could parse the data after the fact

# read this inside out - strip off brackets, split by comma, then parse each sub-array to ints
julia> new_df.A_parsed = map(split.(strip.(new_df.A, Ref(['[', ']'])), ',')) do nums
           parse.(Int64, nums)
       end
2-element Array{Array{Int64,1},1}:
 [1, 2, 3]
 [4, 5, 6]

julia> new_df
2×2 DataFrame
 Row │ A          A_parsed
     │ String     Array…
─────┼──────────────────────
   1 │ [1, 2, 3]  [1, 2, 3]
   2 │ [4, 5, 6]  [4, 5, 6]

Henrique_Becker · March 24, 2021, 7:22pm

I do not know any CSV library that would do this for you. Is this use common in other languages?

mike_k · March 24, 2021, 7:28pm

I don’t know. But I think it would be a great feature. At least a certain CSV package should be able to “understand” the column types it wrote.

tbeason · March 24, 2021, 8:16pm

Edit: comment removed, didn’t read OP closely enough.

Not sure I would generally expect this to work, however.

chris-b1 · March 24, 2021, 9:49pm

CSV is inevitably a lossy format - you might also consider Arrow.jl or some other binary format that can roundtrip this data.

using Arrow
Arrow.write("test.arrow", df)

julia> DataFrame(Arrow.Table("test.arrow"))
2×1 DataFrame
 Row │ A
     │ Array…
─────┼───────────
   1 │ [1, 2, 3]
   2 │ [4, 5, 6]

Mattriks · March 25, 2021, 9:22am

Another way is:

df2 = CSV.File(file, delim=';')
eval.(Meta.parse.(df2.A))

but don’t know if this is passé.

johnmyleswhite · March 25, 2021, 12:38pm

+1 to this. The kinds of places where it’s common to have columns of type array or map have all moved away from storing their data as CSV’s. At best, you can resurrect the old Hive conventions with explicit collection item delimiters: CREATE HIVEFORMAT TABLE - Spark 3.3.0 Documentation

cgeoga · March 25, 2021, 2:21pm

You could also just use Serialization.serialize and Serialization.deserialize from the standard library. From the docstrings it looks like they don’t promise compatibility across Julia versions, and you can’t open the serialized files in a text editor and look at what’s inside. But I think for use cases where you don’t want to use heavy machinery like JLD/HDF5/Arrow/etc, that’s a perfectly convenient way to store an object with a fancy type.

Tamas_Papp · March 26, 2021, 10:09am

In addition to the great suggestions above, you could also reformat as “tidy” data, eg

6×2 DataFrame
│ Row │ a     │ index │
│     │ Int64 │ Int64 │
├─────┼───────┼───────┤
│ 1   │ 1     │ 1     │
│ 2   │ 2     │ 2     │
│ 3   │ 3     │ 3     │
│ 4   │ 4     │ 1     │
│ 5   │ 5     │ 2     │
│ 6   │ 6     │ 3     │

which CSV should handle fine.

Topic		Replies	Views
DataFrames: reading vector from *.csv file to dataframe column General Usage	2	1373	October 3, 2019
Outputing/Inputing vectors in DataFrames General Usage dataframes , csv	2	470	January 17, 2023
How can I read back a delimited file with mixed column types involving vectors General Usage question , csv , io	10	975	May 16, 2022
Issues reading CSV file with array elements General Usage dataframes , csv	4	1780	September 6, 2021
Save Dataframe in file and read it again General Usage question	4	3811	May 28, 2020

DataFrames/CSV: how to read vectors from *.csv?

Related topics