mike_k
March 24, 2021, 5:19pm
1
Dear folks,
unfortunately, I fail to read vectors from a *.csv file. Here is an MWE:
using CSV, DataFrames
df = DataFrame( A=[[1,2,3], [4,5,6]] )
file = "my_test.csv"
CSV.write(file, df, delim=';')
new_df = DataFrame( CSV.File(file; delim=';' ) )
The problem is that column A in new_df
of type String
:
jjulia> new_df
2×1 DataFrame
Row │ A
│ String
─────┼───────────
1 │ [1, 2, 3]
2 │ [4, 5, 6]
I also tried:
CSV.File( file; types=Dict(:B => Vector{Int}) )
and
CSV.File( file; types=Dict(:B => Int[]) )
but it does not work. Do you have any ideas? Thanks in advance!
1 Like
There currently isn’t a built-in way to do this with CSV.jl
Take a look at this issue comment for a suggested approach, using a custom type.
Or perhaps more simply, you could parse the data after the fact
# read this inside out - strip off brackets, split by comma, then parse each sub-array to ints
julia> new_df.A_parsed = map(split.(strip.(new_df.A, Ref(['[', ']'])), ',')) do nums
parse.(Int64, nums)
end
2-element Array{Array{Int64,1},1}:
[1, 2, 3]
[4, 5, 6]
julia> new_df
2×2 DataFrame
Row │ A A_parsed
│ String Array…
─────┼──────────────────────
1 │ [1, 2, 3] [1, 2, 3]
2 │ [4, 5, 6] [4, 5, 6]
3 Likes
I do not know any CSV library that would do this for you. Is this use common in other languages?
mike_k
March 24, 2021, 7:28pm
4
I don’t know. But I think it would be a great feature. At least a certain CSV package should be able to “understand” the column types it wrote.
Edit: comment removed, didn’t read OP closely enough.
Not sure I would generally expect this to work, however.
1 Like
CSV is inevitably a lossy format - you might also consider Arrow.jl
or some other binary format that can roundtrip this data.
using Arrow
Arrow.write("test.arrow", df)
julia> DataFrame(Arrow.Table("test.arrow"))
2×1 DataFrame
Row │ A
│ Array…
─────┼───────────
1 │ [1, 2, 3]
2 │ [4, 5, 6]
6 Likes
Another way is:
df2 = CSV.File(file, delim=';')
eval.(Meta.parse.(df2.A))
but don’t know if this is passé.
2 Likes
+1 to this. The kinds of places where it’s common to have columns of type array or map have all moved away from storing their data as CSV’s. At best, you can resurrect the old Hive conventions with explicit collection item delimiters: CREATE HIVEFORMAT TABLE - Spark 3.3.0 Documentation
2 Likes
cgeoga
March 25, 2021, 2:21pm
9
You could also just use Serialization.serialize
and Serialization.deserialize
from the standard library. From the docstrings it looks like they don’t promise compatibility across Julia versions, and you can’t open the serialized files in a text editor and look at what’s inside. But I think for use cases where you don’t want to use heavy machinery like JLD/HDF5/Arrow/etc, that’s a perfectly convenient way to store an object with a fancy type.
1 Like
In addition to the great suggestions above, you could also reformat as “tidy” data , eg
6×2 DataFrame
│ Row │ a │ index │
│ │ Int64 │ Int64 │
├─────┼───────┼───────┤
│ 1 │ 1 │ 1 │
│ 2 │ 2 │ 2 │
│ 3 │ 3 │ 3 │
│ 4 │ 4 │ 1 │
│ 5 │ 5 │ 2 │
│ 6 │ 6 │ 3 │
which CSV should handle fine.
3 Likes