I have a CSV file which has all of
"" (empty string),
" " (1x whitespace), and " " (2x whitespace) as missing values (horrible, I know). I tried tackling it with
CSV.File(file, missingstrings = ["", " ", " "]) but apparently whitespace characters are not allowed as I get the error sentinel value isn’t allowed to start with wh1 or wh2 characters.
Is there a workaround for that or do I just go over the file afterwards and replace the additional missings?
cc @quinnj this seems like a bug in CSV
Ah, the issue here is the 2x space missingstring; by default, CSV.jl uses a single space to remove leading/trailing whitespace around numbers when parsing, and if a single space is used as
missingstring, then it doesn’t strip the leading/trailing whitespace. But in the case of 2x space, it’s not switching to not stripping whitespace, but also realizes there will be a conflict (i.e. it will try to strip whitespace, but hte sentinel value is also whitespace). Let me think of a better way we can handle this in CSV.jl. Maybe we just need to allow manually setting the “whitespace” characters to strip, because then you coudl set them to
0x00 and not get hte conflict.
Great, thanks! I also opened an issue.
If you’re allowed to modify the csv, you can find and replace with Ctrl + F. Find
," ", and replace by
,"",. As a workaround
Sure, but I’d rather not edit files manually. Also they’re 8GB each so a manual find and replace is probably not the best idea
As a workaround I can for now do a
replace!(col, " " => missing) after reading the file, but it would probably be more efficient / elegant if this could be handled in the initial parsing step.