DataFrames/CSV: how to read vectors from *.csv?

Dear folks,
unfortunately, I fail to read vectors from a *.csv file. Here is an MWE:

using CSV, DataFrames

df = DataFrame( A=[[1,2,3], [4,5,6]] )
file = "my_test.csv"
CSV.write(file, df, delim=';')

new_df = DataFrame( CSV.File(file; delim=';' ) )

The problem is that column A in new_df of type String:

jjulia> new_df
2×1 DataFrame
 Row │ A         
     │ String    
─────┼───────────
   1 │ [1, 2, 3]
   2 │ [4, 5, 6]

I also tried:

CSV.File( file; types=Dict(:B => Vector{Int}) )

and

CSV.File( file; types=Dict(:B => Int[]) )

but it does not work. Do you have any ideas? Thanks in advance!

1 Like

There currently isn’t a built-in way to do this with CSV.jl Take a look at this issue comment for a suggested approach, using a custom type.

Or perhaps more simply, you could parse the data after the fact

# read this inside out - strip off brackets, split by comma, then parse each sub-array to ints
julia> new_df.A_parsed = map(split.(strip.(new_df.A, Ref(['[', ']'])), ',')) do nums
           parse.(Int64, nums)
       end
2-element Array{Array{Int64,1},1}:
 [1, 2, 3]
 [4, 5, 6]

julia> new_df
2×2 DataFrame
 Row │ A          A_parsed
     │ String     Array…
─────┼──────────────────────
   1 │ [1, 2, 3]  [1, 2, 3]
   2 │ [4, 5, 6]  [4, 5, 6]
3 Likes

I do not know any CSV library that would do this for you. Is this use common in other languages?

I don’t know. But I think it would be a great feature. At least a certain CSV package should be able to “understand” the column types it wrote.

Edit: comment removed, didn’t read OP closely enough.

Not sure I would generally expect this to work, however.

1 Like

CSV is inevitably a lossy format - you might also consider Arrow.jl or some other binary format that can roundtrip this data.

using Arrow
Arrow.write("test.arrow", df)

julia> DataFrame(Arrow.Table("test.arrow"))
2×1 DataFrame
 Row │ A
     │ Array…
─────┼───────────
   1 │ [1, 2, 3]
   2 │ [4, 5, 6]
6 Likes

Another way is:

df2 = CSV.File(file, delim=';')
eval.(Meta.parse.(df2.A))

but don’t know if this is passé.

2 Likes

+1 to this. The kinds of places where it’s common to have columns of type array or map have all moved away from storing their data as CSV’s. At best, you can resurrect the old Hive conventions with explicit collection item delimiters: CREATE HIVEFORMAT TABLE - Spark 3.3.0 Documentation

2 Likes

You could also just use Serialization.serialize and Serialization.deserialize from the standard library. From the docstrings it looks like they don’t promise compatibility across Julia versions, and you can’t open the serialized files in a text editor and look at what’s inside. But I think for use cases where you don’t want to use heavy machinery like JLD/HDF5/Arrow/etc, that’s a perfectly convenient way to store an object with a fancy type.

1 Like

In addition to the great suggestions above, you could also reformat as “tidy” data, eg

6×2 DataFrame
│ Row │ a     │ index │
│     │ Int64 │ Int64 │
├─────┼───────┼───────┤
│ 1   │ 1     │ 1     │
│ 2   │ 2     │ 2     │
│ 3   │ 3     │ 3     │
│ 4   │ 4     │ 1     │
│ 5   │ 5     │ 2     │
│ 6   │ 6     │ 3     │

which CSV should handle fine.

3 Likes