Problem related to CSV.read() and strings

I had a possibly related problem in the past.

I want to read a large data set through CSV.read(). The data set contains comments from on online forum that I scraped using Python and saved it as JSON. In Python I converted it to a unicode-encoded csv and try to read it now using CSV.read(). You can find the file here (beware, it has 700 MB). This is the code that I am using:

# Define some inline functions that allow us to  read comments properly
escape_double_quote(s::String) = replace(s, "\"\"", "\\\"")
escape_back_quote(s::String) = replace(s, "\\\"", "\\\\\"");
esca(s::String) = escape_double_quote(escape_back_quote(s));

# Read data from the Speculation subforum, this does not work at the moment
f                  = open("output_speculation_unicode.csv")
cleaned_file       = IOBuffer(readstring(f))
df_speculation_raw = CSV.read(cleaned_file, DataFrame, rows_for_type_detect = 200000)

The first part relates to problems that were mentioned in this thread.

However, when I run this code, this is the preview that I receive:

grafik

As you can see, something gets mixed up on the way and instead of seeing the authors name and the comment id, I see random dates in those columns. My guess is that something is mixed up in the comments with the quotation marks and I need to add even more cases of strings that I want to replace.

If anyone has an idea what I could do, I’d be very grateful!

 using CSVFiles, FileIO, DataFrames

df = load("output_speculation_unicode.csv") |> DataFrame

just seems to work for me on that file, without any hacks.

1 Like

In light of @davidanthoff’s comment, try to make sure that you are using up-to-date versions of both DataFrames and CSV. Annoyingly, the current package manager sometimes makes that rather difficult.

2 Likes

That works, thanks a lot!

@ExpandingMan: I was using CSV.jl in version 0.2.2 and DataFrames.jl in version 0.11.5, so I think I am up-to-date.

Interesting. Would you be able to identify a subset of rows which reproduces the problem with CSV.read? That would be helpful to find a fix. I guess it could be related to quoting issues.

I suspect that there is still problem with unwanted escaping (as I described here):

julia> CSV.readsplitline(IOBuffer("104652,Thanks  \\,a"))
2-element Array{CSV.RawField,1}:
 CSV.RawField("104652", false)      
 CSV.RawField("Thanks  \\,a", false)

@Liso AFAICT this behavior is correct. What is confusing is that \\ actually represents a single backquote in the string, it’s doubled because that’s how you can type it in Julia:

julia> "\\"[1]
'\\': ASCII/Unicode U+005c (category Po: Punctuation, other)

julia> print("\\")
\
1 Like

Please look at Super CSV – What is CSV?

6. Fields containing line breaks (CRLF), double quotes, and commas should be enclosed in double-quotes.

or

Please note that the sometimes-used convention of escaping double-quotes as \" (instead of "") is not supported.

Although there is possibility to implement different dialects too (for inspiration look at csv β€” CSV File Reading and Writing β€” Python 3.12.0 documentation ), de facto standard rfc-4180 define different behavior than CSV.jl implemements!

CSVFiles, FileIO are more standard in this case (although work with dataformats or β€œmagic” bytes is really heavy):

julia> open("/tmp/tst.csv", "w") do f write(f, IOBuffer("a,b,c\r\n104652,\"Thanks  \\\",a\r\n")) end

julia> load("/tmp/tst.csv") |> DataFrame
1Γ—3 DataFrames.DataFrame
β”‚ Row β”‚ a      β”‚ b           β”‚ c   β”‚
β”œβ”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€
β”‚ 1   β”‚ 104652 β”‚ "Thanks  \\" β”‚ "a" β”‚

)

Ah, OK, that’s not strictly speaking a bug then, just a debatable default. See this issue.

2 Likes