The code (at strings/string.jl:168 on version 1.0) correctly returns false for 2 byte and 4 byte overlong sequences, but is broken for 3 byte overlong sequences.
julia> isvalid(String, UInt8[0xf0,128,128,128])
false
julia> isvalid(String, UInt8[0xe0,128,128])
true
julia> isvalid(String, UInt8[0xc0,128])
false
1 Like
If nobody beats me to it, I’ll fix it (but will need somebody to create the PR).
Thanks for your response!
Was good to see you all at the Meetup this week!
1 Like
If somebody wants to pick it up, I have a fix (with tests!) on my fork:
Now, I have to hang my head in shame, because when I fixed some other bugs in UTF-8 validation 3 years ago,
(such as detecting UTF-16 surrogates present in UTF-8), I missed a check (for the overlong 3-byte case).
https://github.com/JuliaLang/julia/issues/11141
Hopefully some nice person can pull this fix in!
Thanks,
Scott
1 Like