CSV.jl "corrupts" data when a field is very large

sylvaticus · October 29, 2025, 1:31pm

I have a CSV (from here) where one field is a MULTIPOLYGON that can be very large, and when I import it with CSV (even a smaller 10 rows version) I got a strange corruption:

julia> data    = CSV.read("MYRIAD-HES/test.csv",DataFrames.DataFrame;delim=',',quotechar='\"')
9×8 DataFrame
 Row │ Event    Hazard        code      starttime   endtime     Intensity  Unit          Geometry                          
     │ String7  String15      String15  Date        Date        Float64    String15      String                            
─────┼─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
   1 │ event0   heatwave      hw31      2004-01-04  2004-01-06  294.0      Kelvin        MULTIPOLYGON (((10.992 11.493, 1…
   2 │ event0   wildfire      wf481347  2004-01-06  2004-01-17    8.46256  Area          POLYGON ((11.161 10.799, 11.157 …
   3 │ event1   heatwave      hw18      2004-01-04  2004-01-06  295.0      Kelvin        POLYGON ((4.497 13.49, 4.497 12.…
   4 │ event1   wildfire      wf479859  2004-01-05  2004-01-17  104.733    Area          POLYGON ((4.388 13.44, 4.383 13.…
   5 │ event2   coldwave      cw56      2004-01-04  2004-01-11  228.0      Kelvin        MULTIPOLYGON (((-69.951 47.182, …
   6 │ event2   flood         fl0       2004-01-04  2004-01-16    1.0      DFO severity  LTIPOLYGON (((-91.514 34.569, -9…
   7 │ event3   flood         fl0       2004-01-04  2004-01-16    1.0      DFO severity  MULTIPOLYGON (((-91.514 34.569, …
   8 │ event3   extreme wind  ew15      2004-01-07  2004-01-07   19.0      m/s           POLYGON ((-83.442 44.935, -83.44…
   9 │ event4   flood         fl0       2004-01-04  2004-01-16    1.0      DFO severity  MULTIPOLYGON (((-91.514 34.569, …

julia> data[6,"Geometry"]
"LTIPOLYGON (((-91.514 34.569, -91.514 34.567, -91.516 34.567, -91.516 34.569, -91.514 34.569)), ((-91.518 34.569, -91.518 34.572, -91.516 34.572, -91.516 34.569, -91.518 34.569)), ((-91.505 34.576, -91.507 34.576, -91.507 34.578, -91.505 34.578, -91.505 34.576)), ((-91.502 34.581, -91.5 34.581, -91.5 34.578, -91.505 34.578, -91.505 34.583, -91.507 34.583, -91.507 34.585, -91.509 34.585, -91.509 34.587, -91.507 34.587, -91.507 34.59, -91.505 34.59, -91.505 34.587, -91.502 34.587, -91.502 34.581)), ((-91.493 34.635, -91.487 34.635, -91.487 34.632, -91.505 34.632, -91.505 34.635, -91.502 34.635, -91.502 34.637, -91.493 34.637, -91.493 34.635)), ((-91.498 34.655, -91.498 34.653, -91.5 34.653, -91.5 34.655, -91.498 34.655)), ((-91.502 34.659, -91.502 34.657, -91.505 34.657, -91.505 34.659, -91.502 34.659)), ((-91" ⋯ 429049 bytes ⋯ "8.488, -89.479 38.491, -89.474 38.491, -89.474 38.493, -89.477 38.493, -89.477 38.495, -89.472 38.495, -89.472 38.497, -89.474 38.497, -89.474 38.5, -89.468 38.5, -89.468 38.502, -89.461 38.502, -89.461 38.5, -89.454 38.5, -89.454 38.497, -89.45 38.497, -89.45 38.495, -89.461 38.495, -89.461 38.493, -89.463 38.493, -89.463 38.488, -89.477 38.488, -89.477 38.486, -89.47 38.486, -89.47 38.484, -89.477 38.484, -89.477 38.479, -89.479 38.479, -89.479 38.482)), ((-89.486 38.488, -89.486 38.491, -89.481 38.491, -89.481 38.488, -89.486 38.488)), ((-89.593 38.493, -89.596 38.493, -89.596 38.495, -89.593 38.495, -89.593 38.493)), ((-89.445 38.495, -89.447 38.495, -89.447 38.497, -89.445 38.497, -89.445 38.495)), ((-89.418 38.5, -89.418 38.497, -89.421 38.497, -89.421 38.5, -89.418 38.5)), ((-89.829 38.504, -89.834 3"

Note the “LTIPOLYGON” instead of “MULTIPOLYGON”
I explorer the input csv file and it seems correct.

Is there other CSV import library? Or should I read the file manually ?

mbauman · October 29, 2025, 1:41pm

Please file an issue!

https://github.com/JuliaData/CSV.jl/issues/new

sylvaticus · October 29, 2025, 1:51pm

done:

github.com/JuliaData/CSV.jl

CSV.jl "corrupts" data when a field is very large

opened 01:50PM - 29 Oct 25 UTC

sylvaticus

I have a CSV file (43GB original version from [here](https://zenodo.org/records/…8269680/files/MYRIAD-HES.zip?download=1) and [here](https://nc.beta-lorraine.fr/s/nfxfykaHo9QRykT/download) a reduced 10MB, 10 rows version, produced with `head -n10`) where one field is a MULTIPOLYGON that can be very large, and when I import it with CSV.jl (even a smaller 10 rows version) I got a strange corruption: ``` julia> data = CSV.read("MYRIAD-HES/test.csv",DataFrames.DataFrame;delim=',',quotechar='\"') 9×8 DataFrame Row │ Event Hazard code starttime endtime Intensity Unit Geometry │ String7 String15 String15 Date Date Float64 String15 String ─────┼───────────────────────────────────────────────────────────────────────────────────────────────────────────────────── 1 │ event0 heatwave hw31 2004-01-04 2004-01-06 294.0 Kelvin MULTIPOLYGON (((10.992 11.493, 1… 2 │ event0 wildfire wf481347 2004-01-06 2004-01-17 8.46256 Area POLYGON ((11.161 10.799, 11.157 … 3 │ event1 heatwave hw18 2004-01-04 2004-01-06 295.0 Kelvin POLYGON ((4.497 13.49, 4.497 12.… 4 │ event1 wildfire wf479859 2004-01-05 2004-01-17 104.733 Area POLYGON ((4.388 13.44, 4.383 13.… 5 │ event2 coldwave cw56 2004-01-04 2004-01-11 228.0 Kelvin MULTIPOLYGON (((-69.951 47.182, … 6 │ event2 flood fl0 2004-01-04 2004-01-16 1.0 DFO severity LTIPOLYGON (((-91.514 34.569, -9… 7 │ event3 flood fl0 2004-01-04 2004-01-16 1.0 DFO severity MULTIPOLYGON (((-91.514 34.569, … 8 │ event3 extreme wind ew15 2004-01-07 2004-01-07 19.0 m/s POLYGON ((-83.442 44.935, -83.44… 9 │ event4 flood fl0 2004-01-04 2004-01-16 1.0 DFO severity MULTIPOLYGON (((-91.514 34.569, … julia> data[6,"Geometry"] "LTIPOLYGON (((-91.514 34.569, -91.514 34.567, -91.516 34.567, -91.516 34.569, -91.514 34.569)), ((-91.518 34.569, -91.518 34.572, -91.516 34.572, -91.516 34.569, -91.518 34.569)), ((-91.505 34.576, -91.507 34.576, -91.507 34.578, -91.505 34.578, -91.505 34.576)), ((-91.502 34.581, -91.5 34.581, -91.5 34.578, -91.505 34.578, -91.505 34.583, -91.507 34.583, -91.507 34.585, -91.509 34.585, -91.509 34.587, -91.507 34.587, -91.507 34.59, -91.505 34.59, -91.505 34.587, -91.502 34.587, -91.502 34.581)), ((-91.493 34.635, -91.487 34.635, -91.487 34.632, -91.505 34.632, -91.505 34.635, -91.502 34.635, -91.502 34.637, -91.493 34.637, -91.493 34.635)), ((-91.498 34.655, -91.498 34.653, -91.5 34.653, -91.5 34.655, -91.498 34.655)), ((-91.502 34.659, -91.502 34.657, -91.505 34.657, -91.505 34.659, -91.502 34.659)), ((-91" ⋯ 429049 bytes ⋯ "8.488, -89.479 38.491, -89.474 38.491, -89.474 38.493, -89.477 38.493, -89.477 38.495, -89.472 38.495, -89.472 38.497, -89.474 38.497, -89.474 38.5, -89.468 38.5, -89.468 38.502, -89.461 38.502, -89.461 38.5, -89.454 38.5, -89.454 38.497, -89.45 38.497, -89.45 38.495, -89.461 38.495, -89.461 38.493, -89.463 38.493, -89.463 38.488, -89.477 38.488, -89.477 38.486, -89.47 38.486, -89.47 38.484, -89.477 38.484, -89.477 38.479, -89.479 38.479, -89.479 38.482)), ((-89.486 38.488, -89.486 38.491, -89.481 38.491, -89.481 38.488, -89.486 38.488)), ((-89.593 38.493, -89.596 38.493, -89.596 38.495, -89.593 38.495, -89.593 38.493)), ((-89.445 38.495, -89.447 38.495, -89.447 38.497, -89.445 38.497, -89.445 38.495)), ((-89.418 38.5, -89.418 38.497, -89.421 38.497, -89.421 38.5, -89.418 38.5)), ((-89.829 38.504, -89.834 3" ``` Note the "LTIPOLYGON" instead of "MULTIPOLYGON" I explorer the input csv file and it seems correct. Also, it works with CSVFiles.jl

(ps: it seems to work with CSVFiles.jl)

ufechner7 · October 29, 2025, 1:53pm

Yes. For example GitHub - sl-solution/DLMReader.jl: High-performance delimited-file reader and writer for Julia , often used together with GitHub - sl-solution/InMemoryDatasets.jl: Multithreaded package for working with tabular data in Julia .

TimG · October 29, 2025, 2:11pm

Here’s another possible option:

Also, have you tried using CSV with ntasks=1. Multi-threading in CSV sometimes has some issues (e.g. with a \n character in a quoted string) and working single threaded can avoid these.

technocrat · October 29, 2025, 9:22pm

Try GeoDataFrames package.

pdeffebach · October 30, 2025, 1:06pm

Yeah, the fundamental problem here is that you are trying to read what is essentially a shapefile as a .csv. Maybe CSV.jl should handle this case but its definitely never going to be the right tool for the job.

Topic		Replies	Views
CSV won't read tab separated file General Usage csv	23	732	March 4, 2024
CSV.jl writing quoted strings General Usage question , csv	14	161	December 19, 2024
CSV Reading (rewrite in C?) Internals & Design	50	5217	October 1, 2018
Failing to import (relatively) large CSV file with Julia and VSC Data performance , csv , arrow	24	867	September 22, 2024
Data loss with repeated CSV.jl import and export operations General Usage	10	791	November 28, 2018

CSV.jl "corrupts" data when a field is very large

Related topics