Valid chars

question
strings

#1

Disclaimer: I’m totally abusing the fact that there are people here that really know what they’re doing when it comes to strings, this has little to do with Julia.

When working with databases that store strings, sometimes the string might include a character that is invalid, that is, it fails isvalid. I have some control over the input to these databases. So I could limit the user’s input to exclude certain characters, all in the interest of keeping the data valid for the modest cost of user expressiveness. How would you best limit a user’s input to guarantee the string is valid?

Thanks!


#2

Not sure what you are asking about, since isvalid tests bytes in a string from a particular offset. Valid strings can fail for isvalid from certain positions, as shown in the docstring (?isvalid).


#3

I’m trying to build some kind MWE to show you what I mean but it’s kind of hard. Anyway, one of the strings includes a degree symbol that gets read as ‘\xb0’. This seems to be an invalid character.

It is entirely possible that I’m misusing isvalid here!


#4

Here is an example of an invalid piece of text.

To make it even simpler, here’s a piece of code to download, read, and test the example:

link = "https://www.dropbox.com/s/10b8yodzm8eq3kv/tmp.json?dl=0"
run(`wget -O tmp.json $link`)
txt = read("tmp.json", String)
@assert all(isvalid, txt) # this fails

#5

That just looks like malformed UTF8, ie the file is not valid UTF8-encoding. Looks something like Latin-1, if the 0xb0 is ° (the degree sign).

It would be best to establish the valid format (whoever generated the file should know), or failing that, convert heuristically and/or reject invalid strings.


#6

Great! So in an ideal world, the validity of the UTF8-encoding of the strings input into the database would first be checked before it got accepted. This sounds straight forward, but I feel it might be more complicated than that… Any pitfalls I should be warned about? And btw, in Julia, is there any way to test for the validity of the UTF8 encoding of a string?

Right now I do the latter (with isvalid):


#7

In Julia, strings are stored as UTF-8, so isvalid(String, s) where s is read in by e.g. s = readline(<path>) should be checking whether the read string is valid UTF-8. You could also read the string in as UInt8 and do isvalid(String, s) on that, going by the help page.

help?> isvalid
#...
isvalid(T, value) -> Bool

Returns true if the given value is valid for that type. 
Types currently can be either AbstractChar or String. 
Values for AbstractChar can be of type AbstractChar or UInt32. 
Values for String can be of that type, or Vector{UInt8}.
#...

julia> s = "Hello"
"Hello"

julia> UInt8[s...]
5-element Array{UInt8,1}:
   0x48   
   0x65
   0x6c
   0x6c
   0x6f                                                 

julia> isvalid(String, ans)
true