How to clean strings?

Brad_Carman · September 1, 2020, 7:08pm

I have 2 strings that as a human appear exactly the same, but Julia sees 2 different strings.

test1 = raw"C:\Program Files\MATLAB\R2019a"
test2 = raw"C:\Program Files\MATLAB\R2019a"

test1 == test2 #julia says this is false

It turns out that Julia sees the space as a different character:

julia> test1[11]
' ': Unicode U+00A0 (category Zs: Separator, space)

julia> test2[11]
' ': ASCII/Unicode U+0020 (category Zs: Separator, space)

The reason why this is a problem for me is the following (Julia doesn’t think test1 is a valid path):

julia> isdir(test1)
false

julia> isdir(test2)
true

Not sure how I ended up with this different space character, I think it’s from coping and pasting from OneNote. What is a good practice to avoid this problem in the future? Maybe a setting in VS Code so I can “see” the space character as different. Maybe a clean function to convert all chars to ASCII/Unicode?

danielw2904 · September 1, 2020, 7:38pm

The first is a no break space. Maybe you could use the isspace function to replace all spaces with ' ' (0x20).

Edit
Or of course regex

See Strings · The Julia Language

sylvaticus · September 1, 2020, 8:33pm

Hmmm… maybe the question from Brad_Carman was more general… it happened to me as well sometimes that I had problems when a string had characters almost invisible or rendered in the same way, for example “-” has an other very similar character…

But I don’t know how to solve this… a “sanitise” function?

jling · September 1, 2020, 8:52pm

you can use a regex with replace to clean out any white space character to some unified space character maybe.

stevengj · September 1, 2020, 11:20pm

You could use NFKC normalization, which removes some confusable characters but not necessarily all. It works in this case because a non-breaking space normalizes to an ordinary space:

julia> import Unicode

julia> Unicode.normalize(test1, :NFKC) == Unicode.normalize(test2, :NFKC)

However, be aware that NFKC normalization will also treat some characters as equivalent even though they are visually distinct:

julia> Unicode.normalize("𝐴ᶜ𝐇𝕆𝓞", :NFKC)
"AcHOO"

There is also NFC normalization, which will only treat characters as the same if they are visually and semantically identical — typically this is to canonicalize combining characters.

xiaodai · September 2, 2020, 12:32am

Topic		Replies	Views
String conversion from Symbol with Unicode does not yield a string, which is intended to be the same New to Julia question , bug	6	768	December 5, 2020
Syntax: Escape hatch for unicode haters Internals & Design syntax , unicode	128	4486	January 16, 2024
How to replace \ by \\ in string New to Julia strings	5	516	June 12, 2022
Why doesn't Julia recognize my string? New to Julia	15	819	April 6, 2019
Addressing raw string syntax and semantics for Julia 2.0? General Usage strings	59	6202	December 23, 2020

How to clean strings?

Related topics