Working with ASCII strings

I am processing very large files (~500GB), parsing integers, dates, and similar. I found that writing custom functions for this (along the lines of, eg, tryparsenext_base10) really speeds up my code (2–20x).

I found that further speedups are achieved by working with bytes instead of strings/characters. I can do this because all of my data is ASCII. Most of the speedups come from not calling slow_utf8_next.

I would like to understand what the best way is to implement these things. Should I convert(Vector{UInt8}, ...) and then use getindex, or use the string directly with codeunit?

I also want to use “views” into strings (to minimize allocation). My understanding is that SubString is precisely for this, by then it works using character indices. If I want views into ASCII strings and want to avoid calling slow_utf8_next, should I stick to Vector{UInt8} instead?

How about using the ASCIIString type from https://github.com/JuliaArchive/LegacyStrings.jl.

Where does slow_utf8_next get called in your code?
I am asking because next(s::String, i::Int) should not call slow_utf8_next if your strings contain only ASCII characters (so maybe actually there is a different performance problem that should be solved for ASCII only strings).

Perhaps I am doing something not the right way, but if in v0.6.0 I define

function tryparse_digits(str, start, stop)
    n = 0
    pos = start
    @inbounds while pos ≤ stop
        chr, next_pos = next(str, pos)
        if '0' ≤ chr ≤ '9'
            n = n*10 + (chr - '0')
            pos = next_pos
        else
            return Nullable{Int}()
        end
    end
    Nullable(n)
end

then

julia> @code_warntype tryparse_digits("1980", 1, 4)
Variables:
  #self#::#tryparse_digits
  str::String
  start::Int64
  stop::Int64
  chr::Char
  next_pos::Int64
  #temp#@_7::Int64
  n::Int64
  pos::Int64
  #temp#@_10::Bool
  p::Ptr{UInt8}
  b::UInt8
  #temp#@_13::Tuple{Char,Int64}

Body:
  begin 
      n::Int64 = 0 # line 3:
      pos::Int64 = start::Int64 # line 4:
      $(Expr(:inbounds, true))
      NewvarNode(:(chr::Char))
      NewvarNode(:(next_pos::Int64))
      8: 
      unless (Base.sle_int)(pos::Int64, stop::Int64)::Bool goto 65 # line 5:
      $(Expr(:inbounds, false))
      # meta: location strings/string.jl next 196
      NewvarNode(:(p::Ptr{UInt8}))
      NewvarNode(:(b::UInt8)) # line 197:
      16:  # line 199:
      $(Expr(:inbounds, false))
      # meta: location strings/string.jl pointer 51
      # meta: location pointer.jl unsafe_convert 37
      SSAValue(4) = $(Expr(:foreigncall, :(:jl_value_ptr), Ptr{Void}, svec(Any), :(
str), 0))                                                                         
      SSAValue(3) = (Core.sizeof)(Base.Int)::Int64
      # meta: pop location
      # meta: pop location
      $(Expr(:inbounds, :pop))
      p::Ptr{UInt8} = (Base.bitcast)(Ptr{UInt8}, (Base.bitcast)(Ptr{Void}, (Base.ad
d_int)((Base.bitcast)(UInt64, SSAValue(4)), (Base.bitcast)(UInt64, SSAValue(3)))::U
Int64)) # line 200:                                                               
      b::UInt8 = (Base.pointerref)(p::Ptr{UInt8}, pos::Int64, 1)::UInt8 # line 201:
      unless (Base.ult_int)(b::UInt8, 0x80)::Bool goto 34 # line 202:
      #temp#@_13::Tuple{Char,Int64} = (Core.tuple)((Base.bitcast)(Char, (Base.zext_
int)(UInt32, b::UInt8)::UInt32), (Base.add_int)(pos::Int64, 1)::Int64)::Tuple{Char,
Int64}                                                                            
      goto 37
      34:  # line 204:
      #temp#@_13::Tuple{Char,Int64} = $(Expr(:invoke, MethodInstance for slow_utf8_
next(::Ptr{UInt8}, ::UInt8, ::Int64, ::Int64), :(Base.slow_utf8_next), :(p), :(b), 
:(pos), :((Core.getfield)(str, :len)::Int64)))                                    
      37: 
      # meta: pop location
      $(Expr(:inbounds, :pop))
      SSAValue(0) = #temp#@_13::Tuple{Char,Int64}
      SSAValue(5) = (Base.getfield)(SSAValue(0), 1)::Char
      SSAValue(6) = (Base.add_int)(1, 1)::Int64
      chr::Char = SSAValue(5)
      SSAValue(7) = (Base.getfield)(SSAValue(0), 2)::Int64
      SSAValue(8) = (Base.add_int)(2, 1)::Int64
      next_pos::Int64 = SSAValue(7) # line 6:
      unless (Base.not_int)((Base.ult_int)((Base.bitcast)(UInt32, chr::Char), (Base
.bitcast)(UInt32, '0'))::Bool)::Bool goto 51                                      
      #temp#@_10::Bool = (Base.not_int)((Base.ult_int)((Base.bitcast)(UInt32, '9'),
 (Base.bitcast)(UInt32, chr::Char))::Bool)::Bool                                  
      goto 53
      51: 
      #temp#@_10::Bool = false
      53: 
      unless #temp#@_10::Bool goto 60 # line 7:
      n::Int64 = (Base.add_int)((Base.mul_int)(n::Int64, 10)::Int64, (Base.sub_int)
((Base.zext_int)(Int64, (Base.bitcast)(UInt32, chr::Char))::Int64, (Base.zext_int)(
Int64, (Base.bitcast)(UInt32, '0'))::Int64)::Int64)::Int64 # line 8:              
      pos::Int64 = next_pos::Int64
      goto 63
      60:  # line 10:
      return $(Expr(:new, Nullable{Int64}, false))
      63: 
      goto 8
      65: 
      $(Expr(:inbounds, :pop)) # line 13:
      return $(Expr(:new, Nullable{Int64}, true, :(n)))
  end::Nullable{Int64}

OK, I understand that the relevant branch of the conditional should not be called. But using UInt8 is still about 15-20% faster (more for longer strings):

using BenchmarkTools

function tryparse_digits(str, start, stop)
    n = 0
    pos = start
    @inbounds while pos ≤ stop
        chr, next_pos = next(str, pos)
        if '0' ≤ chr ≤ '9'
            n = n*10 + (chr - '0')
            pos = next_pos
        else
            return Nullable{Int}()
        end
    end
    Nullable(n)
end

function tryparse_digits(str::Vector{UInt8}, start, stop)
    n = 0
    z = UInt8('0')
    pos = start
    @inbounds while pos ≤ stop
        maybe_digit = str[pos] - z
        if 0 ≤ maybe_digit ≤ 9
            n = n*10 + maybe_digit
            pos += 1
        else
            return Nullable(n, false)
        end
    end
    Nullable(n)
end

str = "1980"
@benchmark tryparse_digits($str, 1, 4)
str8 = convert(Vector{UInt8}, str)
@benchmark tryparse_digits($str8, 1, 4)
julia> @benchmark tryparse_digits($str8, 1, 4)
BenchmarkTools.Trial: 
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     10.968 ns (0.00% GC)
  median time:      11.004 ns (0.00% GC)
  mean time:        11.720 ns (0.00% GC)
  maximum time:     865.807 ns (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     998

julia> @benchmark tryparse_digits($str, 1, 4)
BenchmarkTools.Trial: 
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     12.533 ns (0.00% GC)
  median time:      12.561 ns (0.00% GC)
  mean time:        13.125 ns (0.00% GC)
  maximum time:     78.257 ns (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     998

I would guess that the main reason for the difference is that next has to:

  • convert UInt8 gotten in operation b = unsafe_load(p, i) to Char(b)
  • perform a check b < 0x80

Those two operations do not have to be performed in your implementation with arrays.

@kristoffer.carlsson Unfortunately ASCIIString does not help here as it defines:

getindex(s::ASCIIString, i::Int) = (x=s.data[i]; ifelse(x < 0x80, Char(x), '\ufffd'))

which has the same cost as standard String here (so String and ASCIIString actually have the same speed in this benchmark).

In this kind of circumstance, you are probably best off just dealing with Vector{UInt8}. You don’t need String at all, and trying to interpret the bytes as Unicode characters will only slow you down.

For example, a similar strategy was used the JSON.jl module: JSON files can contain arbitrary UTF-8 encoded Unicode, but the parsing process only looks at ASCII characters like { and =. (See https://github.com/JuliaIO/JSON.jl/pull/140 and Use StringVector instead of Vector{UInt8} by TotalVerb · Pull Request #199 · JuliaIO/JSON.jl · GitHub)