Data loss with repeated CSV.jl import and export operations

essenciary · November 26, 2018, 4:08pm

I have this problem with writing and reading tabular data with CSV.jl – and I can’t make any sense of its behaviour.

I have a dataset loaded in memory: 69795×13 DataFrames.DataFrame

Then I write it to file:

julia> CSV.write("data/test/t1.csv", t1)
"data/test/t1.csv"

When I read it back, I get about half of the rows:

julia> t2 = CSV.read("data/test/t1.csv")
35232×13 DataFrames.DataFrame

Ok - that I can understand. The data is weird so CSV.jl writes it as default (separator, escaping, etc) and when it reads it back, some rows fail.

But if I write these to file:

julia> CSV.write("data/test/t2.csv", t2)
"data/test/t2.csv"

When I read them back, only 10K come in:

julia> t3 = CSV.read("data/test/t2.csv")
10628×13 DataFrames.DataFrame.

Which I can’t understand. If it has successfully loaded 35K rows surely it is expected to be able to write them back and then load them back again.

pdeffebach · November 26, 2018, 4:37pm

You can try CSV.validate to get a better sense of what is causing problems

davidanthoff · November 26, 2018, 5:07pm

Could you also try CSVFiles.jl? I’m really curious how it fares with this.

Also, any chance you could post the file? Maybe save the original as a feather file? Or post t1.csv?

essenciary · November 26, 2018, 7:17pm

Cool, I’ll give CSVFiles it a try, thank you!

Here is the original file (which gets loaded as the 69795x13 DataFrame). If you keep importing and exporting starting from this you should end up with not much pretty fast.

https://www.dropbox.com/s/s71ciycci7wu6j1/top_ratings.csv?dl=0

davidanthoff · November 26, 2018, 7:54pm

So the initial read of that file seems to be correct with CSVFiles.jl, as far as I can tell (no rows dropped etc.).

But the write/read round trip seems to mess something up in strings with quotes in them… So we probably need a bug fix for that. But the data you provided is more than enough to fix that Thanks!

essenciary · November 27, 2018, 8:45am

You’re welcome!

If you need more test data, the dataset is using this: Book-Crossing Dataset

ImreSamu · November 27, 2018, 5:01pm

based on your test data - I have created a minimal example :

Probably an escaping problem ( like: "Tres Mosqueteros, Los: Adaptacic\"n" )

"ISBN";"Book-Title"
"9500286327";"Tres Mosqueteros, Los: Adaptacic\"n"
"0671727680";"Romeo and Juliet"
"0385333757";"Losing Julia"

------ code ------


# tested with: julia 1.0.1  + [336ed68f] CSV v0.4.3
using CSV

# Create test file 
books=""""ISBN";"Book-Title"
"9500286327";"Tres Mosqueteros, Los: Adaptacic\\\"n"
"0671727680";"Romeo and Juliet"
"0385333757";"Losing Julia"
"""
open("x0.csv", "w") do f
    write(f, books)
end
run(`cat x0.csv`)

# Simple test
x1=CSV.read("x0.csv"     ; delim=';' ,quotechar='"' ,escapechar='\\', normalizenames=true )
CSV.write(  "x1.csv",  x1; delim=';' ,quotechar='"' ,escapechar='\\' )
x2=CSV.read("x1.csv"     ; delim=';' ,quotechar='"' ,escapechar='\\', normalizenames=true )

** ------- log -------- **

               _
   _       _ _(_)_     |  Documentation: https://docs.julialang.org
  (_)     | (_) (_)    |
   _ _   _| |_  __ _   |  Type "?" for help, "]?" for Pkg help.
  | | | | | | |/ _` |  |
  | | |_| | | | (_| |  |  Version 1.0.1 (2018-09-29)
 _/ |\__'_|_|_|\__'_|  |  Official https://julialang.org/ release
|__/                   |

julia> # tested with: julia 1.0.1  + [336ed68f] CSV v0.4.3
       using CSV

julia> # Create test file 
       books=""""ISBN";"Book-Title"
       "9500286327";"Tres Mosqueteros, Los: Adaptacic\\\"n"
       "0671727680";"Romeo and Juliet"
       "0385333757";"Losing Julia"
       """
"\"ISBN\";\"Book-Title\"\n\"9500286327\";\"Tres Mosqueteros, Los: Adaptacic\\\"n\"\n\"0671727680\";\"Romeo and Juliet\"\n\"0385333757\";\"Losing Julia\"\n"

julia> open("x0.csv", "w") do f
           write(f, books)
       end
131

julia> run(`cat x0.csv`)
"ISBN";"Book-Title"
"9500286327";"Tres Mosqueteros, Los: Adaptacic\"n"
"0671727680";"Romeo and Juliet"
"0385333757";"Losing Julia"
Process(`cat x0.csv`, ProcessExited(0))

julia> # Simple test
       x1=CSV.read("x0.csv"     ; delim=';' ,quotechar='"' ,escapechar='\\', normalizenames=true )
3×2 DataFrames.DataFrame
│ Row │ ISBN       │ Book_Title                         │
│     │ Int64⍰     │ Union{Missing, String}             │
├─────┼────────────┼────────────────────────────────────┤
│ 1   │ 9500286327 │ Tres Mosqueteros, Los: Adaptacic"n │
│ 2   │ 671727680  │ Romeo and Juliet                   │
│ 3   │ 385333757  │ Losing Julia                       │

julia> CSV.write(  "x1.csv",  x1; delim=';' ,quotechar='"' ,escapechar='\\' )
"x1.csv"

julia> x2=CSV.read("x1.csv"     ; delim=';' ,quotechar='"' ,escapechar='\\', normalizenames=true )
1×2 DataFrames.DataFrame
│ Row │ ISBN       │ Book_Title                         │
│     │ Int64⍰     │ Union{Missing, String}             │
├─────┼────────────┼────────────────────────────────────┤
│ 1   │ 9500286327 │ Tres Mosqueteros, Los: Adaptacic"n │

julia>

ImreSamu · November 27, 2018, 5:41pm

I have created a github issue : https://github.com/JuliaData/CSV.jl/issues/357

quinnj · November 27, 2018, 7:01pm

Thanks for the detailed report @ImreSamu! That kind of preliminary investigation and steps to reproduce are so, so wonderful and make debugging/fixing so much nicer. I’ve put up a fix for CSV.jl here: Ensure we always quote a field if it needs escaping, fixes #357 by quinnj · Pull Request #358 · JuliaData/CSV.jl · GitHub.

ImreSamu · November 27, 2018, 7:42pm

Thank you for the fix!
I have tested : [336ed68f] CSV v0.4.1 #jq/357 (https://github.com/JuliaData/CSV.jl.git)
and looks OK ( with my example )

ImreSamu · November 28, 2018, 7:23am

@quinnj :

just a note: Probably the CSV.read Doc is not valid for the default escapechar value:

CSV.read( ; escapechar='\\' )

because without this parameter - my test code is not working.

julia> x1qe =CSV.read("x0.csv" ; delim=';' ,quotechar='"' ,escapechar='\\' )  # expected=3 : result=3
3×2 DataFrame
│ Row │ ISBN       │ Book-Title                         │
│     │ Int64⍰     │ Union{Missing, String}             │
├─────┼────────────┼────────────────────────────────────┤
│ 1   │ 9500286327 │ Tres Mosqueteros, Los: Adaptacic"n │
│ 2   │ 671727680  │ Romeo and Juliet                   │
│ 3   │ 385333757  │ Losing Julia                       │

julia> x1q_ =CSV.read("x0.csv" ; delim=';' ,quotechar='"'                  )  # expected=3 : result=1  !!
warning: failed parsing String on row=1, col=2, error=INVALID: OK, QUOTED, NEWLINE, INVALID_DELIMITER
1×2 DataFrame
│ Row │ ISBN       │ Book-Title                         │
│     │ Int64⍰     │ Union{Missing, String}             │
├─────┼────────────┼────────────────────────────────────┤
│ 1   │ 9500286327 │ Tres Mosqueteros, Los: Adaptacic\\ │

Topic		Replies	Views
Export csv - CSV.jl and CSVFiles do not help General Usage	9	718	October 5, 2018
First try seems a bit sluggish Performance	5	619	February 21, 2021
Suggestions for a package to read tabular data Data question	12	2723	February 13, 2017
CSV.jl's CSV write seems slow Performance	32	5738	January 28, 2020
CSV.jl: Write csv row by row General Usage	3	1137	June 18, 2020

Data loss with repeated CSV.jl import and export operations

Related topics