Data loss with repeated CSV.jl import and export operations


#1

I have this problem with writing and reading tabular data with CSV.jl – and I can’t make any sense of its behaviour.

I have a dataset loaded in memory: 69795×13 DataFrames.DataFrame

Then I write it to file:

julia> CSV.write("data/test/t1.csv", t1)
"data/test/t1.csv"

When I read it back, I get about half of the rows:

julia> t2 = CSV.read("data/test/t1.csv")
35232×13 DataFrames.DataFrame

Ok - that I can understand. The data is weird so CSV.jl writes it as default (separator, escaping, etc) and when it reads it back, some rows fail.

But if I write these to file:

julia> CSV.write("data/test/t2.csv", t2)
"data/test/t2.csv"

When I read them back, only 10K come in:

julia> t3 = CSV.read("data/test/t2.csv")
10628×13 DataFrames.DataFrame.

Which I can’t understand. If it has successfully loaded 35K rows surely it is expected to be able to write them back and then load them back again.


#2

You can try CSV.validate to get a better sense of what is causing problems


#3

Could you also try CSVFiles.jl? I’m really curious how it fares with this.

Also, any chance you could post the file? Maybe save the original as a feather file? Or post t1.csv?


#4

Cool, I’ll give CSVFiles it a try, thank you!

Here is the original file (which gets loaded as the 69795x13 DataFrame). If you keep importing and exporting starting from this you should end up with not much pretty fast.


#5

So the initial read of that file seems to be correct with CSVFiles.jl, as far as I can tell (no rows dropped etc.).

But the write/read round trip seems to mess something up in strings with quotes in them… So we probably need a bug fix for that. But the data you provided is more than enough to fix that :slight_smile: Thanks!


#6

You’re welcome! :slight_smile:

If you need more test data, the dataset is using this: http://www2.informatik.uni-freiburg.de/~cziegler/BX/


#7

based on your test data - I have created a minimal example :

Probably an escaping problem ( like: "Tres Mosqueteros, Los: Adaptacic\"n" )

"ISBN";"Book-Title"
"9500286327";"Tres Mosqueteros, Los: Adaptacic\"n"
"0671727680";"Romeo and Juliet"
"0385333757";"Losing Julia"

------ code ------


# tested with: julia 1.0.1  + [336ed68f] CSV v0.4.3
using CSV

# Create test file 
books=""""ISBN";"Book-Title"
"9500286327";"Tres Mosqueteros, Los: Adaptacic\\\"n"
"0671727680";"Romeo and Juliet"
"0385333757";"Losing Julia"
"""
open("x0.csv", "w") do f
    write(f, books)
end
run(`cat x0.csv`)

# Simple test
x1=CSV.read("x0.csv"     ; delim=';' ,quotechar='"' ,escapechar='\\', normalizenames=true )
CSV.write(  "x1.csv",  x1; delim=';' ,quotechar='"' ,escapechar='\\' )
x2=CSV.read("x1.csv"     ; delim=';' ,quotechar='"' ,escapechar='\\', normalizenames=true )

** ------- log -------- **

               _
   _       _ _(_)_     |  Documentation: https://docs.julialang.org
  (_)     | (_) (_)    |
   _ _   _| |_  __ _   |  Type "?" for help, "]?" for Pkg help.
  | | | | | | |/ _` |  |
  | | |_| | | | (_| |  |  Version 1.0.1 (2018-09-29)
 _/ |\__'_|_|_|\__'_|  |  Official https://julialang.org/ release
|__/                   |

julia> # tested with: julia 1.0.1  + [336ed68f] CSV v0.4.3
       using CSV

julia> # Create test file 
       books=""""ISBN";"Book-Title"
       "9500286327";"Tres Mosqueteros, Los: Adaptacic\\\"n"
       "0671727680";"Romeo and Juliet"
       "0385333757";"Losing Julia"
       """
"\"ISBN\";\"Book-Title\"\n\"9500286327\";\"Tres Mosqueteros, Los: Adaptacic\\\"n\"\n\"0671727680\";\"Romeo and Juliet\"\n\"0385333757\";\"Losing Julia\"\n"

julia> open("x0.csv", "w") do f
           write(f, books)
       end
131

julia> run(`cat x0.csv`)
"ISBN";"Book-Title"
"9500286327";"Tres Mosqueteros, Los: Adaptacic\"n"
"0671727680";"Romeo and Juliet"
"0385333757";"Losing Julia"
Process(`cat x0.csv`, ProcessExited(0))

julia> # Simple test
       x1=CSV.read("x0.csv"     ; delim=';' ,quotechar='"' ,escapechar='\\', normalizenames=true )
3×2 DataFrames.DataFrame
│ Row │ ISBN       │ Book_Title                         │
│     │ Int64⍰     │ Union{Missing, String}             │
├─────┼────────────┼────────────────────────────────────┤
│ 1   │ 9500286327 │ Tres Mosqueteros, Los: Adaptacic"n │
│ 2   │ 671727680  │ Romeo and Juliet                   │
│ 3   │ 385333757  │ Losing Julia                       │

julia> CSV.write(  "x1.csv",  x1; delim=';' ,quotechar='"' ,escapechar='\\' )
"x1.csv"

julia> x2=CSV.read("x1.csv"     ; delim=';' ,quotechar='"' ,escapechar='\\', normalizenames=true )
1×2 DataFrames.DataFrame
│ Row │ ISBN       │ Book_Title                         │
│     │ Int64⍰     │ Union{Missing, String}             │
├─────┼────────────┼────────────────────────────────────┤
│ 1   │ 9500286327 │ Tres Mosqueteros, Los: Adaptacic"n │

julia> 


#8

I have created a github issue : https://github.com/JuliaData/CSV.jl/issues/357


#9

Thanks for the detailed report @ImreSamu! That kind of preliminary investigation and steps to reproduce are so, so wonderful and make debugging/fixing so much nicer. I’ve put up a fix for CSV.jl here: https://github.com/JuliaData/CSV.jl/pull/358.


#10

Thank you for the fix!
I have tested : [336ed68f] CSV v0.4.1 #jq/357 (https://github.com/JuliaData/CSV.jl.git)
and looks OK ( with my example ) :slight_smile:


#11

@quinnj :

just a note: Probably the CSV.read Doc is not valid for the default escapechar value:

  • CSV.read( ; escapechar='\\' )

because without this parameter - my test code is not working.

julia> x1qe =CSV.read("x0.csv" ; delim=';' ,quotechar='"' ,escapechar='\\' )  # expected=3 : result=3
3×2 DataFrame
│ Row │ ISBN       │ Book-Title                         │
│     │ Int64⍰     │ Union{Missing, String}             │
├─────┼────────────┼────────────────────────────────────┤
│ 1   │ 9500286327 │ Tres Mosqueteros, Los: Adaptacic"n │
│ 2   │ 671727680  │ Romeo and Juliet                   │
│ 3   │ 385333757  │ Losing Julia                       │

julia> x1q_ =CSV.read("x0.csv" ; delim=';' ,quotechar='"'                  )  # expected=3 : result=1  !!
warning: failed parsing String on row=1, col=2, error=INVALID: OK, QUOTED, NEWLINE, INVALID_DELIMITER
1×2 DataFrame
│ Row │ ISBN       │ Book-Title                         │
│     │ Int64⍰     │ Union{Missing, String}             │
├─────┼────────────┼────────────────────────────────────┤
│ 1   │ 9500286327 │ Tres Mosqueteros, Los: Adaptacic\\ │