Whitespace as missingstring in CSV.File()

I have a CSV file which has all of "" (empty string), " " (1x whitespace), and " " (2x whitespace) as missing values (horrible, I know). I tried tackling it with CSV.File(file, missingstrings = ["", " ", " "]) but apparently whitespace characters are not allowed as I get the error sentinel value isn’t allowed to start with wh1 or wh2 characters.

Is there a workaround for that or do I just go over the file afterwards and replace the additional missings?

cc @quinnj this seems like a bug in CSV

Ah, the issue here is the 2x space missingstring; by default, CSV.jl uses a single space to remove leading/trailing whitespace around numbers when parsing, and if a single space is used as missingstring, then it doesn’t strip the leading/trailing whitespace. But in the case of 2x space, it’s not switching to not stripping whitespace, but also realizes there will be a conflict (i.e. it will try to strip whitespace, but hte sentinel value is also whitespace). Let me think of a better way we can handle this in CSV.jl. Maybe we just need to allow manually setting the “whitespace” characters to strip, because then you coudl set them to 0x00 and not get hte conflict.

Great, thanks! I also opened an issue.

If you’re allowed to modify the csv, you can find and replace with Ctrl + F. Find ," ", and replace by ,"",. As a workaround

Sure, but I’d rather not edit files manually. Also they’re 8GB each so a manual find and replace is probably not the best idea :slight_smile:

As a workaround I can for now do a replace!(col, " " => missing) after reading the file, but it would probably be more efficient / elegant if this could be handled in the initial parsing step.