Read file directly into Vector{Char}

torrance · December 6, 2022, 7:25am

I want to read a UTF8 encoded file into Vector{Char}.

One way to do this is:

data = collect(Char, read("input.txt"))

However, this will perform a copy, first reading in the contents of the file, and then copying it out to the final vector.

Manually, this would look like:

read(fname, ::Type{Vector{Char}}) = open(fname) do fhandle
    data = Char[]
    while !eof(fhandle)
        push!(data, read(fhandle, Char))
    end
    return data
end

But I feel like there should be an easier way to do this…?

Sukera · December 6, 2022, 8:11am

Not sure why you want a Vector{Char} in particular - you can iterate over strings and get each char out:

julia> for c in "foobar"
           @show c
       end
c = 'f'
c = 'o'
c = 'o'
c = 'b'
c = 'a'
c = 'r'

So I’d usually read the input as a String and use that.

I think reading as a Vector{Char} doesn’t exist in Base already is because you have to know the encoding of the file to read a Char correctly. In the worst case, whatever is read needs to be converted to Unicode. Reading as a String enforces the conversion semantically. Also:

help?> Char
search: Char ischardev Cchar skipchars Cuchar Cwchar_t AbstractChar

Char(c::Union{Number,AbstractChar})
[…]
In order to losslessly represent arbitrary byte streams stored in a String, a Char value may store information that cannot be converted to a Unicode codepoint

So in particular, a Char is not just a byte (unlike in e.g. C).

torrance · December 6, 2022, 8:20am

Not sure why you want a Vector{Char} in particular - you can iterate over strings and get each char out:

It makes sense if you’re going to iterate over the same string many times or do complex, repeating indexing into that string. To do so, requires parsing the UTF8 string each time, whereas a one-time conversion to Vector{Char} pays that price just once.

(Parenthetically, it’s always a little annoying to have to justify your use-case on this forum when asking a specific question.)

So in particular, a Char is not just a byte (unlike in e.g. C)

That’s right. The closest thing to a C char is UInt8. A Julia Char is 4 bytes, which is large enough to capture all Unicode code points.

Sukera · December 6, 2022, 8:28am

Apologies, I didn’t mean to imply that your usecase was bad or that you had to justify it.

Iterating over a string in julia does not require reindexing/reparsing. Strings are indexed by byte, not by unicode index:

julia> a = "αβγ"
"αβγ"

julia> collect(a)
3-element Vector{Char}:
 'α': Unicode U+03B1 (category Ll: Letter, lowercase)
 'β': Unicode U+03B2 (category Ll: Letter, lowercase)
 'γ': Unicode U+03B3 (category Ll: Letter, lowercase)

julia> collect(eachindex(a))
3-element Vector{Int64}:
 1
 3
 5

so indexing into strings is O(1) instead of O(n), which is where my confusion about why a Vector{Char} is better for your usecase is coming from (prevind/nextind are usually used for finding the next n neighbor). Apologies if that was a bad phrasing on my part, I should have asked for clarification.

sijo · December 6, 2022, 8:30am

Here’s one way that builds the vector directly from a single pass through the stream I think:

open("input.text") do io
       collect(readeach(io, Char))
end

torrance · December 6, 2022, 8:33am

Do you have a reference for this?

My understanding is the String wraps a Vector{UInt8}, but since UTF8 is a variable-length encoding, then indexing into the string requires parsing the string.

DNF · December 6, 2022, 8:36am

This works, but

read(filename, String) |> collect

is both faster and allocates less.

torrance · December 6, 2022, 8:40am

Yeah, the benchmarks don’t lie. I guess I’ll stick with collect(read(f, String))

julia> readchars(f) = open(f) do io
           collect(readeach(io, Char))
       end
readchars (generic function with 1 method)

julia> @benchmark readchars("input.txt")
BenchmarkTools.Trial: 10000 samples with 1 evaluation.
 Range (min … max):  91.853 μs … 418.225 μs  ┊ GC (min … max): 0.00% … 68.80%
 Time  (median):     95.309 μs               ┊ GC (median):    0.00%
 Time  (mean ± σ):   96.852 μs ±   7.996 μs  ┊ GC (mean ± σ):  0.08% ±  1.17%

   ▁▄▆▇████▇▇▆▆▅▅▅▄▄▃▂▂▁▁▁▂▁▁▁▁ ▁▁▁                            ▃
  ▆███████████████████████████████████▇▇▇▆▇▇▇▇▇█▇█████▇▇▇▆▇▅▆▅ █
  91.9 μs       Histogram: log(frequency) by time       116 μs <

 Memory estimate: 122.66 KiB, allocs estimate: 19.

julia> @benchmark read("input.txt", String) |> collect
BenchmarkTools.Trial: 10000 samples with 5 evaluations.
 Range (min … max):  6.326 μs … 307.844 μs  ┊ GC (min … max): 0.00% … 88.34%
 Time  (median):     6.903 μs               ┊ GC (median):    0.00%
 Time  (mean ± σ):   7.443 μs ±   9.049 μs  ┊ GC (mean ± σ):  4.91% ±  3.97%

      ▃▇█▆▂
  ▂▂▄▇██████▆▅▄▃▃▃▂▂▂▂▂▂▂▂▂▂▂▂▂▁▁▁▂▂▁▂▁▂▁▂▂▂▂▂▁▁▁▂▂▂▁▁▁▁▂▂▁▁▂ ▃
  6.33 μs         Histogram: frequency by time        11.3 μs <

 Memory estimate: 21.01 KiB, allocs estimate: 14.

DNF · December 6, 2022, 8:42am

I don’t have a reference, but here’s a benchmark:

julia> using Random

julia> str = randstring(1_000_000);

julia> @btime $str[i] setup=(i=rand(1:1000))
  2.900 ns (0 allocations: 0 bytes)
'f': ASCII/Unicode U+0066 (category Ll: Letter, lowercase)

julia> @btime $str[i] setup=(i=rand(999_001:1_000_000))
  2.900 ns (0 allocations: 0 bytes)
'3': ASCII/Unicode U+0033 (category Nd: Number, decimal digit)

sijo · December 6, 2022, 8:43am

Indeed, that’s quite a difference. With a big 93MB binary file:

julia> @time collect(read(filename, String));
  0.991675 seconds (16 allocations: 438.592 MiB)

julia> @time open(filename) do io
       collect(readeach(io, Char))
       end;
  6.185349 seconds (41.45 M allocations: 1.614 GiB, 0.09% compilation time)

sijo · December 6, 2022, 8:44am

Regarding the complexity of indexing, it’s in the manual:

Conceptually, a string is a partial function from indices to characters: for some index values, no character value is returned, and instead an exception is thrown. This allows for efficient indexing into strings by the byte index of an encoded representation rather than by a character index, which cannot be implemented both efficiently and simply for variable-width encodings of Unicode strings.

torrance · December 6, 2022, 8:45am

My original collect(Char, read("input.txt")) is the fastest still.

julia> @benchmark collect(Char, read("input.txt"))
BenchmarkTools.Trial: 10000 samples with 7 evaluations.
 Range (min … max):  3.831 μs … 229.583 μs  ┊ GC (min … max): 0.00% … 85.40%
 Time  (median):     4.508 μs               ┊ GC (median):    0.00%
 Time  (mean ± σ):   5.050 μs ±   7.874 μs  ┊ GC (mean ± σ):  7.27% ±  4.61%

       ▃▆█▆▂
  ▂▂▃▅██████▇▅▄▃▃▃▂▂▂▂▂▂▂▂▂▂▂▁▁▂▁▁▁▁▂▂▁▁▁▁▁▁▁▁▁▁▁▁▁▁▂▂▂▂▂▂▂▂▂ ▃
  3.83 μs         Histogram: frequency by time        9.07 μs <

 Memory estimate: 21.01 KiB, allocs estimate: 14.

Sukera · December 6, 2022, 8:45am

github.com

JuliaLang/julia/blob/a8b399416208d91061324814ff8ae080a918e48b/base/strings/string.jl#L237-L268


      
          @propagate_inbounds function getindex(s::String, i::Int)
              b = codeunit(s, i)
              u = UInt32(b) << 24
              between(b, 0x80, 0xf7) || return reinterpret(Char, u)
              return getindex_continued(s, i, u)
          end
          
          function getindex_continued(s::String, i::Int, u::UInt32)
              if u < 0xc0000000
                  # called from `getindex` which checks bounds
                  @inbounds isvalid(s, i) && @goto ret
                  string_index_err(s, i)
              end
              n = ncodeunits(s)
          
              (i += 1) > n && @goto ret
              @inbounds b = codeunit(s, i) # cont byte 1
              b & 0xc0 == 0x80 || @goto ret
              u |= UInt32(b) << 16

This file has been truncated. show original

While true, that parsing already happens when the String is first created. There’s no need to reparse the entire string just for indexing. All it checks for on indexing is (at most) the next three bytes for continuation, if necessary.

That’s what is gained from the fact that the indices are the byte positions, not the “character index” (which is not an accurate thing either way… maybe grapheme index? Regardless, indexing into strings is done with the byte position of the underlying Vector{UInt8}). The cost is of course that some indices are “invalid”, in the sense that they don’t point at the start of a UTF-8 character. However, that too doesn’t mean you have to reparse everything - you just have to look for the previous/next UTF-8 start mark:

julia> a = "αβγ"
"αβγ"

julia> prevind(a, 2)
1

julia> nextind(a, 2)
3

It’s of course perfectly valid if you want to expand every UTF-8 character into its full codepoint via Char, if that tradeoff of more space for easiest indexing is worth it to you

torrance · December 6, 2022, 8:48am

Interesting, I didn’t know that. This suggests some indices would be invalid, which is most definitely what I don’t want!

Which reinforces my desire for Vector{Char} since I can guarantee any arbitrary index within the range of its length will be valid.

Palli · December 6, 2022, 10:25am

It’s possible, but it might still not be fastest if you want to index to a Char (or a grapheme cluster).

The Char type is 64-bit (built on UInt64, e.g. CPU register), but in Unicode a Char is guaranteed to never be larger than 21-bit, so if you want to hack something you could fit 3 Chars into each UInt64. In very many cases you are dealing with BMP (UCS-2), then you can fit 4 into a UInt64 (what Python does in such a case, it has currently three different encodings in memory, Latin1, that, and UTF-32, what you’re asking for, just in case, and was least useful until recently when emojis got popular…).

IF you know your UTF-8 string (all of its Chars) is actually an ASCII string (it’s a subset of UTF-8), then you could index into the String, just as it is (to a byte), then there will be no invalided indices. You would only have to check once, or not if for some reason you just know this reliably. If it’s very probable but not always ASCII you could keep a flag (is_ASCII) and use byte indexing if true, O(1), otherwise suffer O(n), rarely hopefully, and no copies.

It would be nice if Julia strings actually had such a flag, or is_potentially_NOT_ASCII, that could be automatically set to true until proven otherwise. Note Julia strings are NOT guaranteed to be even UTF-8! That’s the encoding Julia assumes, but when you read from a file it could be invalid UTF-8 too. You can iterate over it, and it’s your responsibility to validate it (there’s a Julia function to do that, after input, but the result isn’t stored after, in a potentially useful to have flag is_potentially_INVALID_UTF-8).

I (and others) didn’t think indexing to a Char useful enough, but there is an O(1) way for UTF-8 strings. You could make another auxiliary vector of byte-indexes, then you can O(1) index into that one, likely get away with storing the index there as UInt32, and then to index to the string you had. It will save memory in all cases over your original plan but you need to index twice.

If you think the memory use is too much for such an index, then it’s viable to store in it say the index of every 64th letter in your string. Then you have 1 bit overhead per letter in your original String, and you index to a cache-line of information. Yes, you need to scan max 64 letters after, but it’s still O(1), since a constant amount. And you can skip the 3 in one UInt64 Char idea. You could lazily store this auxiliary index, even enlarge it lazily.

If you want to index to a grapheme cluster (e.g. [flag] emojis), since they can be very long, the above idea is the most practical.

Topic		Replies	Views
String indices : byte indexing feels wrong New to Julia strings , unicode	18	1411	December 5, 2023
StringIndex idea (Julia 2.0) Internals & Design strings , indexing	72	3343	March 27, 2024
Indexing strings by Unicode code point instead of code unit? General Usage strings	14	2511	January 12, 2024
Problem processing non utf8 string New to Julia	17	2169	June 1, 2018
Changes to the representation of Char Internals & Design	14	2851	December 12, 2017

Read file directly into Vector{Char}

Related topics