I need to parse some data out of pdf files that have been converted (ocr) to text. Unfortunately, the app generates bad utf-8. I will go back and see if I can get it to output ascii encoding which could solve my problem (and waste lots of time, again).
In the meantime, I have identified the “bad” characters with
as = [isascii(tt[i]) for i in eachindex(tt)]; badones = find(!as);
So, now badones contains the indexes of the “bad” characters from the string tt. There are 297 out of 360486 characters.
For my parsing code to work, I need to get rid of the bad characters. I see no easy way to do it.
I can certainly go to each starting index of a character, find its length in bytes, and replace all of those bytes. But, I seem unable to broadcast into the string:
tt[9823:9826] .= '?' # for a 4 byte utf-8 character replaced with 4 ?s
That doesn’t work. I can’t do it one by one because I am doing bad indices into the middle of the character. Probably, I need a new approach but I can’t see how to get out of the catch-22.
I tried simple ascii(tt), which is supposed to return an error with bad byte index–but it doesn’t. It just errors out with no return of a byte index.
Tried String(tt), which gives me an array of bytes, but points out that the string contains invalid UTF-8 data. Yah, I know–that’s what I am trying to get rid of. But, it seems impossible to touch the bad bytes by indexing.
I could try UTF8proc.normalize_string but I totally don’t understand what arguments will set the string to rights. Actually, I tried it with just :NFC to give it a shot, but it errors out telling me I have an invalid UTF-8 string.
This is just circular: I have .008 bad characters and there appears to be no programmatic way to purge/convert them. I could build a new string by walking around them since I think I have their locations–maybe.
I realize none of this is Julia’s fault and I’ve read many of the postings trying to find solutions and realize that we are already in a much improved state. Still some ways to go to do some kind of bulk purging/replacing.
Let me know if there are any suggestions!