Read file directly into Vector{Char}

I want to read a UTF8 encoded file into Vector{Char}.

One way to do this is:

data = collect(Char, read("input.txt"))

However, this will perform a copy, first reading in the contents of the file, and then copying it out to the final vector.

Manually, this would look like:

read(fname, ::Type{Vector{Char}}) = open(fname) do fhandle
    data = Char[]
    while !eof(fhandle)
        push!(data, read(fhandle, Char))
    end
    return data
end

But I feel like there should be an easier way to do this…?

Not sure why you want a Vector{Char} in particular - you can iterate over strings and get each char out:

julia> for c in "foobar"
           @show c
       end
c = 'f'
c = 'o'
c = 'o'
c = 'b'
c = 'a'
c = 'r'

So I’d usually read the input as a String and use that.

I think reading as a Vector{Char} doesn’t exist in Base already is because you have to know the encoding of the file to read a Char correctly. In the worst case, whatever is read needs to be converted to Unicode. Reading as a String enforces the conversion semantically. Also:

help?> Char
search: Char ischardev Cchar skipchars Cuchar Cwchar_t AbstractChar

Char(c::Union{Number,AbstractChar})
[…]
In order to losslessly represent arbitrary byte streams stored in a String, a Char value may store information that cannot be converted to a Unicode codepoint

So in particular, a Char is not just a byte (unlike in e.g. C).

3 Likes

Not sure why you want a Vector{Char} in particular - you can iterate over strings and get each char out:

It makes sense if you’re going to iterate over the same string many times or do complex, repeating indexing into that string. To do so, requires parsing the UTF8 string each time, whereas a one-time conversion to Vector{Char} pays that price just once.

(Parenthetically, it’s always a little annoying to have to justify your use-case on this forum when asking a specific question.)

So in particular, a Char is not just a byte (unlike in e.g. C)

That’s right. The closest thing to a C char is UInt8. A Julia Char is 4 bytes, which is large enough to capture all Unicode code points.

Apologies, I didn’t mean to imply that your usecase was bad or that you had to justify it.

Iterating over a string in julia does not require reindexing/reparsing. Strings are indexed by byte, not by unicode index:

julia> a = "Ξ±Ξ²Ξ³"
"Ξ±Ξ²Ξ³"

julia> collect(a)
3-element Vector{Char}:
 'Ξ±': Unicode U+03B1 (category Ll: Letter, lowercase)
 'Ξ²': Unicode U+03B2 (category Ll: Letter, lowercase)
 'Ξ³': Unicode U+03B3 (category Ll: Letter, lowercase)

julia> collect(eachindex(a))
3-element Vector{Int64}:
 1
 3
 5

so indexing into strings is O(1) instead of O(n), which is where my confusion about why a Vector{Char} is better for your usecase is coming from (prevind/nextind are usually used for finding the next n neighbor). Apologies if that was a bad phrasing on my part, I should have asked for clarification.

Here’s one way that builds the vector directly from a single pass through the stream I think:

open("input.text") do io
       collect(readeach(io, Char))
end
2 Likes

Do you have a reference for this?

My understanding is the String wraps a Vector{UInt8}, but since UTF8 is a variable-length encoding, then indexing into the string requires parsing the string.

This works, but

read(filename, String) |> collect

is both faster and allocates less.

4 Likes

Yeah, the benchmarks don’t lie. I guess I’ll stick with collect(read(f, String))

julia> readchars(f) = open(f) do io
           collect(readeach(io, Char))
       end
readchars (generic function with 1 method)

julia> @benchmark readchars("input.txt")
BenchmarkTools.Trial: 10000 samples with 1 evaluation.
 Range (min … max):  91.853 ΞΌs … 418.225 ΞΌs  β”Š GC (min … max): 0.00% … 68.80%
 Time  (median):     95.309 ΞΌs               β”Š GC (median):    0.00%
 Time  (mean Β± Οƒ):   96.852 ΞΌs Β±   7.996 ΞΌs  β”Š GC (mean Β± Οƒ):  0.08% Β±  1.17%

   β–β–„β–†β–‡β–ˆβ–ˆβ–ˆβ–ˆβ–‡β–‡β–†β–†β–…β–…β–…β–„β–„β–ƒβ–‚β–‚β–β–β–β–‚β–β–β–β– ▁▁▁                            β–ƒ
  β–†β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‡β–‡β–‡β–†β–‡β–‡β–‡β–‡β–‡β–ˆβ–‡β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‡β–‡β–‡β–†β–‡β–…β–†β–… β–ˆ
  91.9 ΞΌs       Histogram: log(frequency) by time       116 ΞΌs <

 Memory estimate: 122.66 KiB, allocs estimate: 19.

julia> @benchmark read("input.txt", String) |> collect
BenchmarkTools.Trial: 10000 samples with 5 evaluations.
 Range (min … max):  6.326 ΞΌs … 307.844 ΞΌs  β”Š GC (min … max): 0.00% … 88.34%
 Time  (median):     6.903 ΞΌs               β”Š GC (median):    0.00%
 Time  (mean Β± Οƒ):   7.443 ΞΌs Β±   9.049 ΞΌs  β”Š GC (mean Β± Οƒ):  4.91% Β±  3.97%

      β–ƒβ–‡β–ˆβ–†β–‚
  β–‚β–‚β–„β–‡β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–†β–…β–„β–ƒβ–ƒβ–ƒβ–‚β–‚β–‚β–‚β–‚β–‚β–‚β–‚β–‚β–‚β–‚β–‚β–‚β–β–β–β–‚β–‚β–β–‚β–β–‚β–β–‚β–‚β–‚β–‚β–‚β–β–β–β–‚β–‚β–‚β–β–β–β–β–‚β–‚β–β–β–‚ β–ƒ
  6.33 ΞΌs         Histogram: frequency by time        11.3 ΞΌs <

 Memory estimate: 21.01 KiB, allocs estimate: 14.

I don’t have a reference, but here’s a benchmark:

julia> using Random

julia> str = randstring(1_000_000);

julia> @btime $str[i] setup=(i=rand(1:1000))
  2.900 ns (0 allocations: 0 bytes)
'f': ASCII/Unicode U+0066 (category Ll: Letter, lowercase)

julia> @btime $str[i] setup=(i=rand(999_001:1_000_000))
  2.900 ns (0 allocations: 0 bytes)
'3': ASCII/Unicode U+0033 (category Nd: Number, decimal digit)

Indeed, that’s quite a difference. With a big 93MB binary file:

julia> @time collect(read(filename, String));
  0.991675 seconds (16 allocations: 438.592 MiB)

julia> @time open(filename) do io
       collect(readeach(io, Char))
       end;
  6.185349 seconds (41.45 M allocations: 1.614 GiB, 0.09% compilation time)

Regarding the complexity of indexing, it’s in the manual:

  • Conceptually, a string is a partial function from indices to characters: for some index values, no character value is returned, and instead an exception is thrown. This allows for efficient indexing into strings by the byte index of an encoded representation rather than by a character index, which cannot be implemented both efficiently and simply for variable-width encodings of Unicode strings.

My original collect(Char, read("input.txt")) is the fastest still.

julia> @benchmark collect(Char, read("input.txt"))
BenchmarkTools.Trial: 10000 samples with 7 evaluations.
 Range (min … max):  3.831 ΞΌs … 229.583 ΞΌs  β”Š GC (min … max): 0.00% … 85.40%
 Time  (median):     4.508 ΞΌs               β”Š GC (median):    0.00%
 Time  (mean Β± Οƒ):   5.050 ΞΌs Β±   7.874 ΞΌs  β”Š GC (mean Β± Οƒ):  7.27% Β±  4.61%

       β–ƒβ–†β–ˆβ–†β–‚
  β–‚β–‚β–ƒβ–…β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‡β–…β–„β–ƒβ–ƒβ–ƒβ–‚β–‚β–‚β–‚β–‚β–‚β–‚β–‚β–‚β–‚β–‚β–β–β–‚β–β–β–β–β–‚β–‚β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–‚β–‚β–‚β–‚β–‚β–‚β–‚β–‚β–‚ β–ƒ
  3.83 ΞΌs         Histogram: frequency by time        9.07 ΞΌs <

 Memory estimate: 21.01 KiB, allocs estimate: 14.
1 Like

While true, that parsing already happens when the String is first created. There’s no need to reparse the entire string just for indexing. All it checks for on indexing is (at most) the next three bytes for continuation, if necessary.

That’s what is gained from the fact that the indices are the byte positions, not the β€œcharacter index” (which is not an accurate thing either way… maybe grapheme index? Regardless, indexing into strings is done with the byte position of the underlying Vector{UInt8}). The cost is of course that some indices are β€œinvalid”, in the sense that they don’t point at the start of a UTF-8 character. However, that too doesn’t mean you have to reparse everything - you just have to look for the previous/next UTF-8 start mark:

julia> a = "Ξ±Ξ²Ξ³"
"Ξ±Ξ²Ξ³"

julia> prevind(a, 2)
1

julia> nextind(a, 2)
3

It’s of course perfectly valid if you want to expand every UTF-8 character into its full codepoint via Char, if that tradeoff of more space for easiest indexing is worth it to you :person_shrugging:

1 Like

Interesting, I didn’t know that. This suggests some indices would be invalid, which is most definitely what I don’t want!

Which reinforces my desire for Vector{Char} since I can guarantee any arbitrary index within the range of its length will be valid.

It’s possible, but it might still not be fastest if you want to index to a Char (or a grapheme cluster).

The Char type is 64-bit (built on UInt64, e.g. CPU register), but in Unicode a Char is guaranteed to never be larger than 21-bit, so if you want to hack something you could fit 3 Chars into each UInt64. In very many cases you are dealing with BMP (UCS-2), then you can fit 4 into a UInt64 (what Python does in such a case, it has currently three different encodings in memory, Latin1, that, and UTF-32, what you’re asking for, just in case, and was least useful until recently when emojis got popular…).

IF you know your UTF-8 string (all of its Chars) is actually an ASCII string (it’s a subset of UTF-8), then you could index into the String, just as it is (to a byte), then there will be no invalided indices. You would only have to check once, or not if for some reason you just know this reliably. If it’s very probable but not always ASCII you could keep a flag (is_ASCII) and use byte indexing if true, O(1), otherwise suffer O(n), rarely hopefully, and no copies.

It would be nice if Julia strings actually had such a flag, or is_potentially_NOT_ASCII, that could be automatically set to true until proven otherwise. Note Julia strings are NOT guaranteed to be even UTF-8! That’s the encoding Julia assumes, but when you read from a file it could be invalid UTF-8 too. You can iterate over it, and it’s your responsibility to validate it (there’s a Julia function to do that, after input, but the result isn’t stored after, in a potentially useful to have flag is_potentially_INVALID_UTF-8).

I (and others) didn’t think indexing to a Char useful enough, but there is an O(1) way for UTF-8 strings. You could make another auxiliary vector of byte-indexes, then you can O(1) index into that one, likely get away with storing the index there as UInt32, and then to index to the string you had. It will save memory in all cases over your original plan but you need to index twice.

If you think the memory use is too much for such an index, then it’s viable to store in it say the index of every 64th letter in your string. Then you have 1 bit overhead per letter in your original String, and you index to a cache-line of information. Yes, you need to scan max 64 letters after, but it’s still O(1), since a constant amount. And you can skip the 3 in one UInt64 Char idea. You could lazily store this auxiliary index, even enlarge it lazily.

If you want to index to a grapheme cluster (e.g. [flag] emojis), since they can be very long, the above idea is the most practical.

1 Like