Getting Char bytes without allocation

How can I get the bytes in a UTF-8 character like 'é' without allocating?

My use case is a for loop on all characters in a string. The loop must sometimes access the bytes directly:

for c in str
    # Do some work

    if some_condition
        # process each byte in `c`
    end
end

This seems surprisingly difficult… Here are the solutions I found but they both allocate:

c = 'é'

# First method
codeunits(string(c))

# Second method (allows to preallocate and reuse the buffer)
buf = IOBuffer()
write(buf, c)
take!(buf)

Here’s an implementation which seems to work, based on the code in julia/substring.jl at d753a0b6a78d94d08c17b2c227916700e3359e4a · JuliaLang/julia · GitHub

julia> struct CodeUnitIterator{C <: AbstractChar}
         char::C
       end

julia> Base.eltype(::Type{CodeUnitIterator}) = UInt8

julia> Base.length(c::CodeUnitIterator) = ncodeunits(c.char)

julia> function Base.iterate(iter::CodeUnitIterator{Char}, state=1)
         n = ncodeunits(iter.char)
         if state > n
           return nothing
         end
         x = bswap(reinterpret(UInt32, iter.char))
         x >>= 8 * (state - 1)
         return (x % UInt8, state + 1)
       end

julia> collect(CodeUnitIterator('é'))
2-element Array{Any,1}:
 0xc3
 0xa9

julia> codeunits(string('é'))
2-element Base.CodeUnits{UInt8,String}:
 0xc3
 0xa9

You wouldn’t want to use collect in your code–I just did that to show the result. You should be able to do something like:

for byte in CodeUnitIterator(c)
  do_stuff_with(byte)
end

without any allocations.

1 Like

Thanks @rdeits, this works!

I also just realized that in my particular case (having access to the string that contains the character c at position i), I can use codeunit(str, i+j) with j between 0 and ncodeunits(c)-1.

with less finesse, also iterable without allocation:

bytesof(c::Char) = bytesof(Val(ncodeunits(c)), c)
function bytes(n::Val{1}, c::Char)
   bits   = reinterpret(UInt32, c)
   (bits >> 24 % UInt8,)
end
function bytesof(n::Val{2}, c::Char)
  bits   = reinterpret(UInt32, c)
  (bits >> 24 % UInt8, bits >> 16 % UInt8)
end

julia> achar = 'é'
julia> bytesof(achar)
(0xc3, 0xa9)
julia> @allocated bytesof(achar)
0
julia> for b in bytesof(c)
          println(bitstring(b))
       end
11000011
10101001
3 Likes

Could you do something like:

str = "like 'é' without allocating?"

function test(str)
    for i in eachindex(str)
        ch = str[i]
        println("ch => $ch")
    end
end
1 Like

@JeffreySarnoff thanks, very readable (though it will take 2 more methods (Val{3} and Val{4}) to cover all UTF-8 characters).

@pixel27 this gives each character in a string, but I need each byte in a single character (UTF-8 characters take between 1 and 4 bytes).

Maybe it would make sense to add a method codeunit(c::Char, i::Integer) in Base (and leave it optional to implement codeunit(::AbstractChar, ::Integer) for other character types)…

1 Like

Since Base already implements codeunits for strings, I think it would make sense to add for Char as well. Would you mind opening an issue or even a PR with your solution?

(Also note that you probably meant to overload Base.eltype(::Type{<:CodeUnitIterator}) = UInt8 instead of Base.eltype(::Type{CodeUnitIterator}) = UInt8, which would fix collect accidentally returning an array with element type Any here.)

2 Likes