The code (at strings/string.jl:168 on version 1.0) correctly returns false for 2 byte and 4 byte overlong sequences, but is broken for 3 byte overlong sequences.
julia> isvalid(String, UInt8[0xf0,128,128,128])
julia> isvalid(String, UInt8[0xe0,128,128])
julia> isvalid(String, UInt8[0xc0,128])
If nobody beats me to it, I’ll fix it (but will need somebody to create the PR).
Thanks for your response!
Was good to see you all at the Meetup this week!
If somebody wants to pick it up, I have a fix (with tests!) on my fork:
Now, I have to hang my head in shame, because when I fixed some other bugs in UTF-8 validation 3 years ago,
(such as detecting UTF-16 surrogates present in UTF-8), I missed a check (for the overlong 3-byte case).
Hopefully some nice person can pull this fix in!