How to clean strings?

I have 2 strings that as a human appear exactly the same, but Julia sees 2 different strings.

test1 = raw"C:\Program Files\MATLAB\R2019a"
test2 = raw"C:\Program Files\MATLAB\R2019a"

test1 == test2 #julia says this is false

It turns out that Julia sees the space as a different character:

julia> test1[11]
' ': Unicode U+00A0 (category Zs: Separator, space)

julia> test2[11]
' ': ASCII/Unicode U+0020 (category Zs: Separator, space)

The reason why this is a problem for me is the following (Julia doesn’t think test1 is a valid path):

julia> isdir(test1)

julia> isdir(test2)

Not sure how I ended up with this different space character, I think it’s from coping and pasting from OneNote. What is a good practice to avoid this problem in the future? Maybe a setting in VS Code so I can “see” the space character as different. Maybe a clean function to convert all chars to ASCII/Unicode?

The first is a no break space. Maybe you could use the isspace function to replace all spaces with ' ' (0x20).

Or of course regex



Hmmm… maybe the question from Brad_Carman was more general… it happened to me as well sometimes that I had problems when a string had characters almost invisible or rendered in the same way, for example “-” has an other very similar character…

But I don’t know how to solve this… a “sanitise” function?

you can use a regex with replace to clean out any white space character to some unified space character maybe.

You could use NFKC normalization, which removes some confusable characters but not necessarily all. It works in this case because a non-breaking space normalizes to an ordinary space:

julia> import Unicode

julia> Unicode.normalize(test1, :NFKC) == Unicode.normalize(test2, :NFKC)

However, be aware that NFKC normalization will also treat some characters as equivalent even though they are visually distinct:

julia> Unicode.normalize("𝐴ᶜ𝐇𝕆𝓞", :NFKC)

There is also NFC normalization, which will only treat characters as the same if they are visually and semantically identical — typically this is to canonicalize combining characters.