Problem related to CSV.read() and strings

IljaK91 · March 8, 2018, 1:37pm

I had a possibly related problem in the past.

I want to read a large data set through CSV.read(). The data set contains comments from on online forum that I scraped using Python and saved it as JSON. In Python I converted it to a unicode-encoded csv and try to read it now using CSV.read(). You can find the file here (beware, it has 700 MB). This is the code that I am using:

# Define some inline functions that allow us to  read comments properly
escape_double_quote(s::String) = replace(s, "\"\"", "\\\"")
escape_back_quote(s::String) = replace(s, "\\\"", "\\\\\"");
esca(s::String) = escape_double_quote(escape_back_quote(s));

# Read data from the Speculation subforum, this does not work at the moment
f                  = open("output_speculation_unicode.csv")
cleaned_file       = IOBuffer(readstring(f))
df_speculation_raw = CSV.read(cleaned_file, DataFrame, rows_for_type_detect = 200000)

The first part relates to problems that were mentioned in this thread.

However, when I run this code, this is the preview that I receive:

grafik

As you can see, something gets mixed up on the way and instead of seeing the authors name and the comment id, I see random dates in those columns. My guess is that something is mixed up in the comments with the quotation marks and I need to add even more cases of strings that I want to replace.

If anyone has an idea what I could do, I’d be very grateful!

davidanthoff · March 8, 2018, 5:24pm

 using CSVFiles, FileIO, DataFrames

df = load("output_speculation_unicode.csv") |> DataFrame

just seems to work for me on that file, without any hacks.

ExpandingMan · March 8, 2018, 5:41pm

In light of @davidanthoff’s comment, try to make sure that you are using up-to-date versions of both DataFrames and CSV. Annoyingly, the current package manager sometimes makes that rather difficult.

IljaK91 · March 9, 2018, 11:45am

That works, thanks a lot!

@ExpandingMan: I was using CSV.jl in version 0.2.2 and DataFrames.jl in version 0.11.5, so I think I am up-to-date.

nalimilan · March 9, 2018, 9:41pm

Interesting. Would you be able to identify a subset of rows which reproduces the problem with CSV.read? That would be helpful to find a fix. I guess it could be related to quoting issues.

Liso · March 10, 2018, 12:43am

I suspect that there is still problem with unwanted escaping (as I described here):

julia> CSV.readsplitline(IOBuffer("104652,Thanks  \\,a"))
2-element Array{CSV.RawField,1}:
 CSV.RawField("104652", false)      
 CSV.RawField("Thanks  \\,a", false)

nalimilan · March 10, 2018, 10:22am

@Liso AFAICT this behavior is correct. What is confusing is that \\ actually represents a single backquote in the string, it’s doubled because that’s how you can type it in Julia:

julia> "\\"[1]
'\\': ASCII/Unicode U+005c (category Po: Punctuation, other)

julia> print("\\")
\

Liso · March 10, 2018, 10:15pm

Please look at Super CSV – What is CSV?

6. Fields containing line breaks (CRLF), double quotes, and commas should be enclosed in double-quotes.

or

Please note that the sometimes-used convention of escaping double-quotes as \" (instead of "") is not supported.

Although there is possibility to implement different dialects too (for inspiration look at csv — CSV File Reading and Writing — Python 3.12.0 documentation ), de facto standard rfc-4180 define different behavior than CSV.jl implemements!

CSVFiles, FileIO are more standard in this case (although work with dataformats or “magic” bytes is really heavy):

julia> open("/tmp/tst.csv", "w") do f write(f, IOBuffer("a,b,c\r\n104652,\"Thanks  \\\",a\r\n")) end

julia> load("/tmp/tst.csv") |> DataFrame
1×3 DataFrames.DataFrame
│ Row │ a      │ b           │ c   │
├─────┼────────┼─────────────┼─────┤
│ 1   │ 104652 │ "Thanks  \\" │ "a" │

)

nalimilan · March 11, 2018, 11:20am

Ah, OK, that’s not strictly speaking a bug then, just a debatable default. See this issue.

Topic		Replies	Views
Fatal error while reading in messy data using DataFrames, CSV Data dataframes , csv	6	629	May 25, 2021
Data loss with repeated CSV.jl import and export operations General Usage	10	757	November 28, 2018
CSV.jl writing quoted strings General Usage question , csv	14	96	December 19, 2024
Encoding, CSV.read / write General Usage package	5	2275	August 17, 2018
Unicode related error when reading a .csv General Usage	21	6141	November 25, 2017

Problem related to CSV.read() and strings

Related topics