Fatal error while reading in messy data using DataFrames, CSV

jessecm · May 17, 2021, 7:02pm

Hi, while reading in a large, messy dataset, I got the message:

using DataFrames, CSV, Random, Tables
GBIF_chordate = CSV.read("occurrence.txt", DataFrame; header = true);
**ERROR:** thread = 1 fatal error, encountered an invalidly quoted field while parsing around row = 16331414, col = 125: ""Arthurs Ck mouth, Gulf of Car

Totally understandable that there would be some impossible syntaxes for CSV to read in, and I can probably fix with grep or sed… but are there ways around this? Are there other impossible character combos I should screen for when reading in messy datasets? Thanks in advance!

For what it’s worth, I think my packages are up to date

pkg> status
      Status `~/.julia/environments/v1.6/Project.toml`
  [336ed68f] CSV v0.8.4
  [a93c6f00] DataFrames v1.1.1
  [bd369af6] Tables v1.4.2
  [9a3f8284] Random

I’m guessing this won’t require an example dataset, since it’s an I/O problem that seems pretty easy to diagnose, but if anyone just wants a ~25GB dataset to play around with, you can download it here https://www.gbif.org/occurrence/download/0274460-200613084148143

pdeffebach · May 17, 2021, 7:10pm

Maybe play around with quoting options here?

You can set to just read in a few rows as well, which can help you pin down the error.

jessecm · May 17, 2021, 7:24pm

Thanks, yeah, the fatal error occurred on row = 16331414, so it would take a while to track down reading row by row haha

The issue is it’s a big file that has lots of old museum specimen data in it, so there are columns specifically for decimal degree latitude longitude, but elsewhere in the row there is verbatim data from specimen labels which often includes lat/long in formats like this: Southern Ocean (65° 19’ 30" S, 95° 27’ E)

So there are just lots of problematic open quote situations…

I just used this sed example to remove all the quotes from the file, and it seems to have worked:
https://unix.stackexchange.com/questions/236154/shell-script-to-remove-double-quotes-within-the-column-value

julia> using DataFrames, CSV, Tables, Random

julia> GBIF_chordate = CSV.read("occurrenceNoQuotes.txt", DataFrame; header = true);

julia> size(GBIF_chordate)
(16343123, 250)

quinnj · May 17, 2021, 7:57pm

Yeah, the only fatal error for CSV.jl is when you have an invalid quoted cell like that; so if it starts parsing a cell and encounters an “open quote character”, then it needs to find a matching “close quote character” and if not, then it hard bails. So you perhaps have a cell somewhere that starts with a '"' character, purely by coincidence, and because CSV.jl defaults quotechar='"', then it thinks the cell will be quoted, but then doesn’t find a closing quote and panics. So one potential solution, if you know you don’t have any quoted fields in your file is to do quotechar='\0' or something, i.e. set quotechar to something that it won’t find at all in the file.

jessecm · May 17, 2021, 8:08pm

Thanks very much! I had just tried to type quotechar = false to see if it would work, but apparently that’s not valid… Might be a useful option? In these big messy files, it might take some trial and error to find a character combo that never occurs

jessecm · May 17, 2021, 8:12pm

ah, you can’t do multiple characters?

julia> @time GBIF_protozoaNoEscape = CSV.read("occurrence.txt", DataFrame; header = true, quotechar = '{}');
ERROR: syntax: character literal contains multiple characters
Stacktrace:
 [1] top-level scope
   @ none:1

edit: but I should have just tried your original suggestion! Oops!

julia> @time GBIF_protozoaNoEscape = CSV.read("occurrence.txt", DataFrame; header = true, quotechar = '\0');
 21.961506 seconds (17.54 M allocations: 1.958 GiB, 27.29% gc time)

quinnj · May 25, 2021, 3:28am

Sorry for the slow response; but note that you can pass separate openquotechar='{' and closequotechar='}' if your cells are actually quoted like {quoted \n string}.

Topic		Replies	Views
Problem related to CSV.read() and strings General Usage	8	2946	March 11, 2018
CSV misread General Usage	13	1381	May 28, 2018
CSV won't read tab separated file General Usage csv	23	651	March 4, 2024
Can't read a CSV file with empty fields into a DataFrame Data dataframes , csv	7	1121	May 16, 2022
CSV.jl writing quoted strings General Usage question , csv	14	96	December 19, 2024

Fatal error while reading in messy data using DataFrames, CSV

Related topics