Bug in isvalid with an overlong UTF-8 encoded vector or string

The code (at strings/string.jl:168 on version 1.0) correctly returns false for 2 byte and 4 byte overlong sequences, but is broken for 3 byte overlong sequences.

julia> isvalid(String, UInt8[0xf0,128,128,128])
false

julia> isvalid(String, UInt8[0xe0,128,128])
true

julia> isvalid(String, UInt8[0xc0,128])
false
1 Like

Thanks for the bug report. Issue filed: invalid bug for three-byte characters · Issue #29311 · JuliaLang/julia · GitHub.

1 Like

If nobody beats me to it, I’ll fix it (but will need somebody to create the PR).
Thanks for your response!
Was good to see you all at the Meetup this week!

1 Like

If somebody wants to pick it up, I have a fix (with tests!) on my fork:

Now, I have to hang my head in shame, because when I fixed some other bugs in UTF-8 validation 3 years ago,
(such as detecting UTF-16 surrogates present in UTF-8), I missed a check (for the overlong 3-byte case).
https://github.com/JuliaLang/julia/issues/11141

Hopefully some nice person can pull this fix in!

Thanks,
Scott

1 Like