Purging utf-8 bad characters


#1

I need to parse some data out of pdf files that have been converted (ocr) to text. Unfortunately, the app generates bad utf-8. I will go back and see if I can get it to output ascii encoding which could solve my problem (and waste lots of time, again).

In the meantime, I have identified the “bad” characters with

as = [isascii(tt[i]) for i in eachindex(tt)];
badones = find(!as);

So, now badones contains the indexes of the “bad” characters from the string tt. There are 297 out of 360486 characters.

For my parsing code to work, I need to get rid of the bad characters. I see no easy way to do it.

I can certainly go to each starting index of a character, find its length in bytes, and replace all of those bytes. But, I seem unable to broadcast into the string:

tt[9823:9826] .= '?'  # for a 4 byte utf-8 character replaced with 4 ?s

That doesn’t work. I can’t do it one by one because I am doing bad indices into the middle of the character. Probably, I need a new approach but I can’t see how to get out of the catch-22.

I tried simple ascii(tt), which is supposed to return an error with bad byte index–but it doesn’t. It just errors out with no return of a byte index.

Tried String(tt), which gives me an array of bytes, but points out that the string contains invalid UTF-8 data. Yah, I know–that’s what I am trying to get rid of. But, it seems impossible to touch the bad bytes by indexing.

I could try UTF8proc.normalize_string but I totally don’t understand what arguments will set the string to rights. Actually, I tried it with just :NFC to give it a shot, but it errors out telling me I have an invalid UTF-8 string.

This is just circular: I have .008 bad characters and there appears to be no programmatic way to purge/convert them. I could build a new string by walking around them since I think I have their locations–maybe.

I realize none of this is Julia’s fault and I’ve read many of the postings trying to find solutions and realize that we are already in a much improved state. Still some ways to go to do some kind of bulk purging/replacing.

Let me know if there are any suggestions!

Thanks,
Lewis


#2

Clarification: the actual problem has nothing to do with bad utf8 code points in the text I am parsing.

When I filtered on isvalid rather than isascii, all the characters are valid utf8.

So, the actual problem is that 297 of the characters have no ascii interpretation. I was applying lowercase() to the entire string because the various files are inconsistent in how case is being used. But, lowercase() errors out on some of those characters.

I read the long discussion thread on how to deal with this and whether an error should be generated or some arbitrary other character would stand in for the ones that could not be converted. That discussion went thoughtfully back and forth: I thought the resolution was make it work, but error was apparently the final conclusion.

I can easily work around this by lowercasing smaller known safe chunks of the text.

I can see that the function would certainly be slower in a try/catch clause. Don’t know if it would be worth a “safe” method as in:

lowercase(mystring; safe::Bool=false, substitute='~')

There is no obvious substitute character as all punctuation is meaningful these days and one would want the return string to remain the same length as the input.

Generally, I shouldn’t have an indexing problem as I use indices returned by searchindex. If I need to loop over the characters I can use eachindex and nextind.


#3
julia> filter(isascii, "H∃llø, woℝld.")
"Hll, wold."

#4

It would be easier to give advice if you could post a short snippet of string that illustrates the problem you are having.


#5

What version of Julia are you using?
If the string is valid utf8, then lowercase should work on all of the characters.


#6

Here are a couple of examples. I think from the stacktrace the problem is indexing into a multi-byte character rather than lowercase’ing it. Both examples involve using contains(). I would have thought contains was able to walk characters without breaking one.

85       if contains(lowercase(line), "continue")
86                            continue

output:

ERROR: UnicodeError: invalid character index
Stacktrace:
 [1] slow_utf8_next(::Ptr{UInt8}, ::UInt8, ::Int64, ::Int64) at ./strings/string.jl:172
 [2] next at ./strings/string.jl:204 [inlined]
 [3] map(::Base.#lowercase, ::String) at ./strings/basic.jl:481
 [4] #parsedata#1(::Bool, ::Function, ::String, ::Int64) at /Users/lewis/Dropbox/American Campaign/Data_Analysis/parsedata.jl:85
 [5] (::#kw##parsedata)(::Array{Any,1}, ::#parsedata, ::String, ::Int64) at ./<missing>:0

here is the value of line:

7. Andre?Carson, Democrat...........................................................................172,650 

the character at the question mark position is:

'�': Unicode U+fffd (category So: Symbol, other)

Julia thinks that character is valid:

julia> isvalid(bad[1])  # pasted the string into the str variable bad
true

Here is another example:

failed on this code line:

elseif contains(lowercase(line), "for ") 

output:

Rick Biondi, Libertarian............................................................................10,137 
ERROR: UnicodeError: invalid character index
Stacktrace:
 [1] slow_utf8_next(::Ptr{UInt8}, ::UInt8, ::Int64, ::Int64) at ./strings/string.jl:172
 [2] next at ./strings/string.jl:204 [inlined]
 [3] map(::Base.#lowercase, ::String) at ./strings/basic.jl:481
 [4] #parsedata#12(::Bool, ::Function, ::String, ::Int64) at /Users/lewis/Dropbox/American Campaign/Data_Analysis/parsedata.jl:98
 [5] (::#kw##parsedata)(::Array{Any,1}, ::#parsedata, ::String, ::Int64) at ./<missing>:0

This 2nd one is a bit mystifying: there don’t appear to be any non-ascii code points in it.

I am mostly working around this using other approaches. But, as I data-wrangle through these horrible files I keep peeling the onion to find some new gotchas. And these are the “clean” files.

Thanks for all your help. These “mess” questions are also sort of hard to parse.


#7

This is not actually what’s in the string data. An invalid character is typically displayed in the terminal by substituting U+FFFD. Probably, the actual data contains invalid UTF-8 data, and that’s why lowercase is failing.

For example, in Julia 0.6:

julia> "foo \x83 bar"
ERROR: syntax: invalid UTF-8 sequence

julia> lowercase(String(UInt8[0x66, 0x6f, 0x6f, 0x20, 0x83, 0x20, 0x62, 0x61, 0x72]))
ERROR: UnicodeError: invalid character index

Julia 0.7 allows you to enter and iterate over invalid UTF-8 data, and lets you filter out invalid characters, but still will fail (with a more comprehensible error) on lowercase:

julia> s = "foo \x83 bar"
"foo \x83 bar"

julia> isvalid(s)
false

julia> lowercase(s)
ERROR: Base.InvalidCharError{Char}('\x83')

julia> filter(isvalid, s)
"foo  bar"

Most likely, your string data is valid but is just not UTF-8 encoded, and you should figure out what encoding it is (probably Windows-1252) and convert it to UTF-8 using a tool like iconv or StringEncodings.jl.


#8

Can I pull out the actual bytes with Vector{UInt8}(str) ?

I’ll look at the ocr software and see if I can regenerate with utf-8 encoding–or maybe it just has a bug because the rest of that file appears to be valid utf-8.

Looks like I can use stringencodings.jl to force the conversion to utf-8 in bulk or by line (which is how I do the parsing). That could solve pretty much the whole issue and let me work without workarounds. It’s all been an education.

Final question: can you tell from the stack trace whether it’s contains() or lowercase() that is failing?


#9

Yes.

UTF-8 and Windows-1252 are the same for ASCII characters. So if most of the file is ASCII it would appear to be mostly okay. That’s why I suggested it is probably Windows-1252: this is the most common non-UTF8 encoding that is ASCII compatible.

No need to regenerate the file, since there are lots of tools to convert encodings (once you know what encoding it is using).


#10

Loading the file with StringEncodings.open(readstring,,enc"WINDOWS-1252") solves the problem. Of course, there is no easy way to test an encoding except by trying. The parser usually crashes if it assumes utf-8 for a file encoded differently so that’s how I’ll know.

I’ll probably go back and either re-scan or convert, as you suggest.

I remember this from the old days when Windows had “code pages”. UTF-8 was supposed to come to the rescue. Almost…

Very helpful. Thanks.


#11

Trivial to fix in Visual Studio Code. By default loads as utf-8, but you can reload with a different encoding and it actually tends to guess the proposed encoding pretty well–it’s already loaded the entire file after all. Then, it’s trivial to save with a new encoding. About 4 seconds per roundtrip…

I am sure Sublime Text can do as well…

Case closed.