Fatal error while reading in messy data using DataFrames, CSV

Hi, while reading in a large, messy dataset, I got the message:

using DataFrames, CSV, Random, Tables
GBIF_chordate = CSV.read("occurrence.txt", DataFrame; header = true);
**ERROR:** thread = 1 fatal error, encountered an invalidly quoted field while parsing around row = 16331414, col = 125: ""Arthurs Ck mouth, Gulf of Car

Totally understandable that there would be some impossible syntaxes for CSV to read in, and I can probably fix with grep or sed… but are there ways around this? Are there other impossible character combos I should screen for when reading in messy datasets? Thanks in advance!

For what it’s worth, I think my packages are up to date

pkg> status
      Status `~/.julia/environments/v1.6/Project.toml`
  [336ed68f] CSV v0.8.4
  [a93c6f00] DataFrames v1.1.1
  [bd369af6] Tables v1.4.2
  [9a3f8284] Random

I’m guessing this won’t require an example dataset, since it’s an I/O problem that seems pretty easy to diagnose, but if anyone just wants a ~25GB dataset to play around with, you can download it here :nerd_face: https://www.gbif.org/occurrence/download/0274460-200613084148143

Maybe play around with quoting options here?

You can set to just read in a few rows as well, which can help you pin down the error.

1 Like

Thanks, yeah, the fatal error occurred on row = 16331414, so it would take a while to track down reading row by row haha

The issue is it’s a big file that has lots of old museum specimen data in it, so there are columns specifically for decimal degree latitude longitude, but elsewhere in the row there is verbatim data from specimen labels which often includes lat/long in formats like this: Southern Ocean (65° 19’ 30" S, 95° 27’ E)

So there are just lots of problematic open quote situations…

I just used this sed example to remove all the quotes from the file, and it seems to have worked:
https://unix.stackexchange.com/questions/236154/shell-script-to-remove-double-quotes-within-the-column-value

julia> using DataFrames, CSV, Tables, Random

julia> GBIF_chordate = CSV.read("occurrenceNoQuotes.txt", DataFrame; header = true);

julia> size(GBIF_chordate)
(16343123, 250)
2 Likes

Yeah, the only fatal error for CSV.jl is when you have an invalid quoted cell like that; so if it starts parsing a cell and encounters an “open quote character”, then it needs to find a matching “close quote character” and if not, then it hard bails. So you perhaps have a cell somewhere that starts with a '"' character, purely by coincidence, and because CSV.jl defaults quotechar='"', then it thinks the cell will be quoted, but then doesn’t find a closing quote and panics. So one potential solution, if you know you don’t have any quoted fields in your file is to do quotechar='\0' or something, i.e. set quotechar to something that it won’t find at all in the file.

1 Like

Thanks very much! I had just tried to type quotechar = false to see if it would work, but apparently that’s not valid… Might be a useful option? In these big messy files, it might take some trial and error to find a character combo that never occurs

ah, you can’t do multiple characters?

julia> @time GBIF_protozoaNoEscape = CSV.read("occurrence.txt", DataFrame; header = true, quotechar = '{}');
ERROR: syntax: character literal contains multiple characters
Stacktrace:
 [1] top-level scope
   @ none:1

edit: but I should have just tried your original suggestion! Oops!

julia> @time GBIF_protozoaNoEscape = CSV.read("occurrence.txt", DataFrame; header = true, quotechar = '\0');
 21.961506 seconds (17.54 M allocations: 1.958 GiB, 27.29% gc time)

Sorry for the slow response; but note that you can pass separate openquotechar='{' and closequotechar='}' if your cells are actually quoted like {quoted \n string}.

1 Like