Working with ASCII strings

question

#1

I am processing very large files (~500GB), parsing integers, dates, and similar. I found that writing custom functions for this (along the lines of, eg, tryparsenext_base10) really speeds up my code (2–20x).

I found that further speedups are achieved by working with bytes instead of strings/characters. I can do this because all of my data is ASCII. Most of the speedups come from not calling slow_utf8_next.

I would like to understand what the best way is to implement these things. Should I convert(Vector{UInt8}, ...) and then use getindex, or use the string directly with codeunit?

I also want to use “views” into strings (to minimize allocation). My understanding is that SubString is precisely for this, by then it works using character indices. If I want views into ASCII strings and want to avoid calling slow_utf8_next, should I stick to Vector{UInt8} instead?


#2

How about using the ASCIIString type from https://github.com/JuliaArchive/LegacyStrings.jl.


#3

Where does slow_utf8_next get called in your code?
I am asking because next(s::String, i::Int) should not call slow_utf8_next if your strings contain only ASCII characters (so maybe actually there is a different performance problem that should be solved for ASCII only strings).


#4

Perhaps I am doing something not the right way, but if in v0.6.0 I define

function tryparse_digits(str, start, stop)
    n = 0
    pos = start
    @inbounds while pos ≤ stop
        chr, next_pos = next(str, pos)
        if '0' ≤ chr ≤ '9'
            n = n*10 + (chr - '0')
            pos = next_pos
        else
            return Nullable{Int}()
        end
    end
    Nullable(n)
end

then

julia> @code_warntype tryparse_digits("1980", 1, 4)
Variables:
  #self#::#tryparse_digits
  str::String
  start::Int64
  stop::Int64
  chr::Char
  next_pos::Int64
  #temp#@_7::Int64
  n::Int64
  pos::Int64
  #temp#@_10::Bool
  p::Ptr{UInt8}
  b::UInt8
  #temp#@_13::Tuple{Char,Int64}

Body:
  begin 
      n::Int64 = 0 # line 3:
      pos::Int64 = start::Int64 # line 4:
      $(Expr(:inbounds, true))
      NewvarNode(:(chr::Char))
      NewvarNode(:(next_pos::Int64))
      8: 
      unless (Base.sle_int)(pos::Int64, stop::Int64)::Bool goto 65 # line 5:
      $(Expr(:inbounds, false))
      # meta: location strings/string.jl next 196
      NewvarNode(:(p::Ptr{UInt8}))
      NewvarNode(:(b::UInt8)) # line 197:
      16:  # line 199:
      $(Expr(:inbounds, false))
      # meta: location strings/string.jl pointer 51
      # meta: location pointer.jl unsafe_convert 37
      SSAValue(4) = $(Expr(:foreigncall, :(:jl_value_ptr), Ptr{Void}, svec(Any), :(
str), 0))                                                                         
      SSAValue(3) = (Core.sizeof)(Base.Int)::Int64
      # meta: pop location
      # meta: pop location
      $(Expr(:inbounds, :pop))
      p::Ptr{UInt8} = (Base.bitcast)(Ptr{UInt8}, (Base.bitcast)(Ptr{Void}, (Base.ad
d_int)((Base.bitcast)(UInt64, SSAValue(4)), (Base.bitcast)(UInt64, SSAValue(3)))::U
Int64)) # line 200:                                                               
      b::UInt8 = (Base.pointerref)(p::Ptr{UInt8}, pos::Int64, 1)::UInt8 # line 201:
      unless (Base.ult_int)(b::UInt8, 0x80)::Bool goto 34 # line 202:
      #temp#@_13::Tuple{Char,Int64} = (Core.tuple)((Base.bitcast)(Char, (Base.zext_
int)(UInt32, b::UInt8)::UInt32), (Base.add_int)(pos::Int64, 1)::Int64)::Tuple{Char,
Int64}                                                                            
      goto 37
      34:  # line 204:
      #temp#@_13::Tuple{Char,Int64} = $(Expr(:invoke, MethodInstance for slow_utf8_
next(::Ptr{UInt8}, ::UInt8, ::Int64, ::Int64), :(Base.slow_utf8_next), :(p), :(b), 
:(pos), :((Core.getfield)(str, :len)::Int64)))                                    
      37: 
      # meta: pop location
      $(Expr(:inbounds, :pop))
      SSAValue(0) = #temp#@_13::Tuple{Char,Int64}
      SSAValue(5) = (Base.getfield)(SSAValue(0), 1)::Char
      SSAValue(6) = (Base.add_int)(1, 1)::Int64
      chr::Char = SSAValue(5)
      SSAValue(7) = (Base.getfield)(SSAValue(0), 2)::Int64
      SSAValue(8) = (Base.add_int)(2, 1)::Int64
      next_pos::Int64 = SSAValue(7) # line 6:
      unless (Base.not_int)((Base.ult_int)((Base.bitcast)(UInt32, chr::Char), (Base
.bitcast)(UInt32, '0'))::Bool)::Bool goto 51                                      
      #temp#@_10::Bool = (Base.not_int)((Base.ult_int)((Base.bitcast)(UInt32, '9'),
 (Base.bitcast)(UInt32, chr::Char))::Bool)::Bool                                  
      goto 53
      51: 
      #temp#@_10::Bool = false
      53: 
      unless #temp#@_10::Bool goto 60 # line 7:
      n::Int64 = (Base.add_int)((Base.mul_int)(n::Int64, 10)::Int64, (Base.sub_int)
((Base.zext_int)(Int64, (Base.bitcast)(UInt32, chr::Char))::Int64, (Base.zext_int)(
Int64, (Base.bitcast)(UInt32, '0'))::Int64)::Int64)::Int64 # line 8:              
      pos::Int64 = next_pos::Int64
      goto 63
      60:  # line 10:
      return $(Expr(:new, Nullable{Int64}, false))
      63: 
      goto 8
      65: 
      $(Expr(:inbounds, :pop)) # line 13:
      return $(Expr(:new, Nullable{Int64}, true, :(n)))
  end::Nullable{Int64}

#5

OK, I understand that the relevant branch of the conditional should not be called. But using UInt8 is still about 15-20% faster (more for longer strings):

using BenchmarkTools

function tryparse_digits(str, start, stop)
    n = 0
    pos = start
    @inbounds while pos ≤ stop
        chr, next_pos = next(str, pos)
        if '0' ≤ chr ≤ '9'
            n = n*10 + (chr - '0')
            pos = next_pos
        else
            return Nullable{Int}()
        end
    end
    Nullable(n)
end

function tryparse_digits(str::Vector{UInt8}, start, stop)
    n = 0
    z = UInt8('0')
    pos = start
    @inbounds while pos ≤ stop
        maybe_digit = str[pos] - z
        if 0 ≤ maybe_digit ≤ 9
            n = n*10 + maybe_digit
            pos += 1
        else
            return Nullable(n, false)
        end
    end
    Nullable(n)
end

str = "1980"
@benchmark tryparse_digits($str, 1, 4)
str8 = convert(Vector{UInt8}, str)
@benchmark tryparse_digits($str8, 1, 4)
julia> @benchmark tryparse_digits($str8, 1, 4)
BenchmarkTools.Trial: 
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     10.968 ns (0.00% GC)
  median time:      11.004 ns (0.00% GC)
  mean time:        11.720 ns (0.00% GC)
  maximum time:     865.807 ns (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     998

julia> @benchmark tryparse_digits($str, 1, 4)
BenchmarkTools.Trial: 
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     12.533 ns (0.00% GC)
  median time:      12.561 ns (0.00% GC)
  mean time:        13.125 ns (0.00% GC)
  maximum time:     78.257 ns (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     998

#6

I would guess that the main reason for the difference is that next has to:

  • convert UInt8 gotten in operation b = unsafe_load(p, i) to Char(b)
  • perform a check b < 0x80

Those two operations do not have to be performed in your implementation with arrays.

@kristoffer.carlsson Unfortunately ASCIIString does not help here as it defines:

getindex(s::ASCIIString, i::Int) = (x=s.data[i]; ifelse(x < 0x80, Char(x), '\ufffd'))

which has the same cost as standard String here (so String and ASCIIString actually have the same speed in this benchmark).


#7

In this kind of circumstance, you are probably best off just dealing with Vector{UInt8}. You don’t need String at all, and trying to interpret the bytes as Unicode characters will only slow you down.

For example, a similar strategy was used the JSON.jl module: JSON files can contain arbitrary UTF-8 encoded Unicode, but the parsing process only looks at ASCII characters like { and =. (See https://github.com/JuliaIO/JSON.jl/pull/140 and https://github.com/JuliaIO/JSON.jl/pull/199)