UTF8 to Ascii and Reverse


#1

I could program this, but I am wondering if this is already in j: an encoder/decoder of UTF8 to ASCII. Something like

julia> x= "\u00a5"
"¥"

julia> encodeutf8toascii(x)
"\\00a5"

julia> encodeascii2utf8(ans)
"¥"

PS: and thanks for all the help everyone, here and earlier.


#2

https://docs.julialang.org/en/stable/stdlib/strings/#Base.unescape_string


#3

nope.

julia> x= "\u00a5"
"¥"

julia> unescape_string(x)
"¥"

julia> escape_string(x)
"¥"

#4

unescape_string at least goes in one direction:

julia> "\\u00a5"
"\\u00a5"

julia> unescape_string("\\u00a5")
"¥"

To go in the opposite direction, you’re correct that you can’t use escape_string. You’d have to write your own function, something like:

function escape_unicode(s::AbstractString)
    buf = IOBuffer()
    for c in s
        if isascii(c)
            print(buf, c)
        else
            i = UInt32(c)
            if i < 0x10000
                print(buf, "\\u", hex(i, 4))
            else
                print(buf, "\\U", hex(i, 6))
            end
        end
    end
    return String(take!(buf))
end

Why do you need this, however? If it is because you want to embed Unicode data into an ASCII (7-bit) stream, base64 encoding is a much more standard way to do this (and is implemented in Base).


#5

I wanted the unescape string, but then I got curious. the natural inverse, escape_string, did not work. thanks, steven.


#6

In stevengj’s unescape_unicode you should change the padding in the second hex() from 6 to 8. This will give you:

julia> s = escape_unicode("𠱸1")
"\\U00020c781"

julia> unescape_string(s)
"𠱸1"

With padding of 6, the unescape_string() will result in an invalid unicode escape sequence.