I want to read a large data set through CSV.read(). The data set contains comments from on online forum that I scraped using Python and saved it as JSON. In Python I converted it to a unicode-encoded csv and try to read it now using CSV.read(). You can find the file here (beware, it has 700 MB). This is the code that I am using:
# Define some inline functions that allow us to read comments properly
escape_double_quote(s::String) = replace(s, "\"\"", "\\\"")
escape_back_quote(s::String) = replace(s, "\\\"", "\\\\\"");
esca(s::String) = escape_double_quote(escape_back_quote(s));
# Read data from the Speculation subforum, this does not work at the moment
f = open("output_speculation_unicode.csv")
cleaned_file = IOBuffer(readstring(f))
df_speculation_raw = CSV.read(cleaned_file, DataFrame, rows_for_type_detect = 200000)
The first part relates to problems that were mentioned in this thread.
However, when I run this code, this is the preview that I receive:
As you can see, something gets mixed up on the way and instead of seeing the authors name and the comment id, I see random dates in those columns. My guess is that something is mixed up in the comments with the quotation marks and I need to add even more cases of strings that I want to replace.
If anyone has an idea what I could do, Iβd be very grateful!
In light of @davidanthoffβs comment, try to make sure that you are using up-to-date versions of both DataFrames and CSV. Annoyingly, the current package manager sometimes makes that rather difficult.
Interesting. Would you be able to identify a subset of rows which reproduces the problem with CSV.read? That would be helpful to find a fix. I guess it could be related to quoting issues.
@Liso AFAICT this behavior is correct. What is confusing is that \\ actually represents a single backquote in the string, itβs doubled because thatβs how you can type it in Julia:
CSVFiles, FileIO are more standard in this case (although work with dataformats or βmagicβ bytes is really heavy):
julia> open("/tmp/tst.csv", "w") do f write(f, IOBuffer("a,b,c\r\n104652,\"Thanks \\\",a\r\n")) end
julia> load("/tmp/tst.csv") |> DataFrame
1Γ3 DataFrames.DataFrame
β Row β a β b β c β
βββββββΌβββββββββΌββββββββββββββΌββββββ€
β 1 β 104652 β "Thanks \\" β "a" β