CSV.jl "corrupts" data when a field is very large

I have a CSV (from here) where one field is a MULTIPOLYGON that can be very large, and when I import it with CSV (even a smaller 10 rows version) I got a strange corruption:

julia> data    = CSV.read("MYRIAD-HES/test.csv",DataFrames.DataFrame;delim=',',quotechar='\"')
9×8 DataFrame
 Row │ Event    Hazard        code      starttime   endtime     Intensity  Unit          Geometry                          
     │ String7  String15      String15  Date        Date        Float64    String15      String                            
─────┼─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
   1 │ event0   heatwave      hw31      2004-01-04  2004-01-06  294.0      Kelvin        MULTIPOLYGON (((10.992 11.493, 1…
   2 │ event0   wildfire      wf481347  2004-01-06  2004-01-17    8.46256  Area          POLYGON ((11.161 10.799, 11.157 …
   3 │ event1   heatwave      hw18      2004-01-04  2004-01-06  295.0      Kelvin        POLYGON ((4.497 13.49, 4.497 12.…
   4 │ event1   wildfire      wf479859  2004-01-05  2004-01-17  104.733    Area          POLYGON ((4.388 13.44, 4.383 13.…
   5 │ event2   coldwave      cw56      2004-01-04  2004-01-11  228.0      Kelvin        MULTIPOLYGON (((-69.951 47.182, …
   6 │ event2   flood         fl0       2004-01-04  2004-01-16    1.0      DFO severity  LTIPOLYGON (((-91.514 34.569, -9…
   7 │ event3   flood         fl0       2004-01-04  2004-01-16    1.0      DFO severity  MULTIPOLYGON (((-91.514 34.569, …
   8 │ event3   extreme wind  ew15      2004-01-07  2004-01-07   19.0      m/s           POLYGON ((-83.442 44.935, -83.44…
   9 │ event4   flood         fl0       2004-01-04  2004-01-16    1.0      DFO severity  MULTIPOLYGON (((-91.514 34.569, …

julia> data[6,"Geometry"]
"LTIPOLYGON (((-91.514 34.569, -91.514 34.567, -91.516 34.567, -91.516 34.569, -91.514 34.569)), ((-91.518 34.569, -91.518 34.572, -91.516 34.572, -91.516 34.569, -91.518 34.569)), ((-91.505 34.576, -91.507 34.576, -91.507 34.578, -91.505 34.578, -91.505 34.576)), ((-91.502 34.581, -91.5 34.581, -91.5 34.578, -91.505 34.578, -91.505 34.583, -91.507 34.583, -91.507 34.585, -91.509 34.585, -91.509 34.587, -91.507 34.587, -91.507 34.59, -91.505 34.59, -91.505 34.587, -91.502 34.587, -91.502 34.581)), ((-91.493 34.635, -91.487 34.635, -91.487 34.632, -91.505 34.632, -91.505 34.635, -91.502 34.635, -91.502 34.637, -91.493 34.637, -91.493 34.635)), ((-91.498 34.655, -91.498 34.653, -91.5 34.653, -91.5 34.655, -91.498 34.655)), ((-91.502 34.659, -91.502 34.657, -91.505 34.657, -91.505 34.659, -91.502 34.659)), ((-91" ⋯ 429049 bytes ⋯ "8.488, -89.479 38.491, -89.474 38.491, -89.474 38.493, -89.477 38.493, -89.477 38.495, -89.472 38.495, -89.472 38.497, -89.474 38.497, -89.474 38.5, -89.468 38.5, -89.468 38.502, -89.461 38.502, -89.461 38.5, -89.454 38.5, -89.454 38.497, -89.45 38.497, -89.45 38.495, -89.461 38.495, -89.461 38.493, -89.463 38.493, -89.463 38.488, -89.477 38.488, -89.477 38.486, -89.47 38.486, -89.47 38.484, -89.477 38.484, -89.477 38.479, -89.479 38.479, -89.479 38.482)), ((-89.486 38.488, -89.486 38.491, -89.481 38.491, -89.481 38.488, -89.486 38.488)), ((-89.593 38.493, -89.596 38.493, -89.596 38.495, -89.593 38.495, -89.593 38.493)), ((-89.445 38.495, -89.447 38.495, -89.447 38.497, -89.445 38.497, -89.445 38.495)), ((-89.418 38.5, -89.418 38.497, -89.421 38.497, -89.421 38.5, -89.418 38.5)), ((-89.829 38.504, -89.834 3"

Note the “LTIPOLYGON” instead of “MULTIPOLYGON”
I explorer the input csv file and it seems correct.

Is there other CSV import library? Or should I read the file manually ?

Please file an issue!

https://github.com/JuliaData/CSV.jl/issues/new

done:

(ps: it seems to work with CSVFiles.jl)

2 Likes

Yes. For example GitHub - sl-solution/DLMReader.jl: High-performance delimited-file reader and writer for Julia , often used together with GitHub - sl-solution/InMemoryDatasets.jl: Multithreaded package for working with tabular data in Julia .

Here’s another possible option:

Also, have you tried using CSV with ntasks=1. Multi-threading in CSV sometimes has some issues (e.g. with a \n character in a quoted string) and working single threaded can avoid these.

Try GeoDataFrames package.

Yeah, the fundamental problem here is that you are trying to read what is essentially a shapefile as a .csv. Maybe CSV.jl should handle this case but its definitely never going to be the right tool for the job.