Hi, while reading in a large, messy dataset, I got the message:
using DataFrames, CSV, Random, Tables
GBIF_chordate = CSV.read("occurrence.txt", DataFrame; header = true);
**ERROR:** thread = 1 fatal error, encountered an invalidly quoted field while parsing around row = 16331414, col = 125: ""Arthurs Ck mouth, Gulf of Car
Totally understandable that there would be some impossible syntaxes for CSV to read in, and I can probably fix with grep or sed… but are there ways around this? Are there other impossible character combos I should screen for when reading in messy datasets? Thanks in advance!
For what it’s worth, I think my packages are up to date
pkg> status
Status `~/.julia/environments/v1.6/Project.toml`
[336ed68f] CSV v0.8.4
[a93c6f00] DataFrames v1.1.1
[bd369af6] Tables v1.4.2
[9a3f8284] Random
I’m guessing this won’t require an example dataset, since it’s an I/O problem that seems pretty easy to diagnose, but if anyone just wants a ~25GB dataset to play around with, you can download it here https://www.gbif.org/occurrence/download/0274460-200613084148143
Maybe play around with quoting options here?
You can set to just read in a few rows as well, which can help you pin down the error.
1 Like
Thanks, yeah, the fatal error occurred on row = 16331414, so it would take a while to track down reading row by row haha
The issue is it’s a big file that has lots of old museum specimen data in it, so there are columns specifically for decimal degree latitude longitude, but elsewhere in the row there is verbatim data from specimen labels which often includes lat/long in formats like this: Southern Ocean (65° 19’ 30" S, 95° 27’ E)
So there are just lots of problematic open quote situations…
I just used this sed example to remove all the quotes from the file, and it seems to have worked:
https://unix.stackexchange.com/questions/236154/shell-script-to-remove-double-quotes-within-the-column-value
julia> using DataFrames, CSV, Tables, Random
julia> GBIF_chordate = CSV.read("occurrenceNoQuotes.txt", DataFrame; header = true);
julia> size(GBIF_chordate)
(16343123, 250)
2 Likes
Yeah, the only fatal error for CSV.jl is when you have an invalid quoted cell like that; so if it starts parsing a cell and encounters an “open quote character”, then it needs to find a matching “close quote character” and if not, then it hard bails. So you perhaps have a cell somewhere that starts with a '"'
character, purely by coincidence, and because CSV.jl defaults quotechar='"'
, then it thinks the cell will be quoted, but then doesn’t find a closing quote and panics. So one potential solution, if you know you don’t have any quoted fields in your file is to do quotechar='\0'
or something, i.e. set quotechar
to something that it won’t find at all in the file.
1 Like
Thanks very much! I had just tried to type quotechar = false
to see if it would work, but apparently that’s not valid… Might be a useful option? In these big messy files, it might take some trial and error to find a character combo that never occurs
ah, you can’t do multiple characters?
julia> @time GBIF_protozoaNoEscape = CSV.read("occurrence.txt", DataFrame; header = true, quotechar = '{}');
ERROR: syntax: character literal contains multiple characters
Stacktrace:
[1] top-level scope
@ none:1
edit: but I should have just tried your original suggestion! Oops!
julia> @time GBIF_protozoaNoEscape = CSV.read("occurrence.txt", DataFrame; header = true, quotechar = '\0');
21.961506 seconds (17.54 M allocations: 1.958 GiB, 27.29% gc time)
Sorry for the slow response; but note that you can pass separate openquotechar='{'
and closequotechar='}'
if your cells are actually quoted like {quoted \n string}
.
1 Like