I had a possibly related problem in the past.
I want to read a large data set through CSV.read(). The data set contains comments from on online forum that I scraped using Python and saved it as JSON. In Python I converted it to a unicode-encoded csv and try to read it now using CSV.read(). You can find the file here (beware, it has 700 MB). This is the code that I am using:
# Define some inline functions that allow us to read comments properly escape_double_quote(s::String) = replace(s, "\"\"", "\\\"") escape_back_quote(s::String) = replace(s, "\\\"", "\\\\\""); esca(s::String) = escape_double_quote(escape_back_quote(s)); # Read data from the Speculation subforum, this does not work at the moment f = open("output_speculation_unicode.csv") cleaned_file = IOBuffer(readstring(f)) df_speculation_raw = CSV.read(cleaned_file, DataFrame, rows_for_type_detect = 200000)
The first part relates to problems that were mentioned in this thread.
However, when I run this code, this is the preview that I receive:
As you can see, something gets mixed up on the way and instead of seeing the authors name and the comment id, I see random dates in those columns. My guess is that something is mixed up in the comments with the quotation marks and I need to add even more cases of strings that I want to replace.
If anyone has an idea what I could do, I’d be very grateful!