What is the Interface of AbstractString?


#1

The Interfaces documentation does not cover AbstractString: https://docs.julialang.org/en/latest/manual/interfaces

What is the minimal set of methods that must be defined for a custom AbstractString type?
Is there a good example of a custom string type somewhere, or should I just follow the pattern of SubString:


#2

There are more string related packages here:

But I haven’t found a detailed answer there, these packages use string macros.


#3

It’ now documented in ?AbstractString: https://github.com/JuliaLang/julia/blob/16620b62443f0ffd3aa5fa58174b5d66a28cc646/base/strings/basic.jl#L3-L41. Should probably be added to interfaces too.


#4

Thanks Milan,

The ?AbstractString docstring doesn’t seem to have a list of required methods, but there is a section commented ## required string functions ## in the source just after the doc string. From this I conclude that the required methods are:

  • ncodeunits(s::MyAbstractString) -> Int
  • codeunit(s::MyAbstractString, i::Integer) -> Union{UInt8, UInt16, UInt32}
  • isvalid(s::MyAbstractString, i::Integer) -> Bool
  • next(s::MyAbstractString, i::Integer) -> Tuple{Char, Int}
  • length(s::MyAbstractString) or define Base.IteratorSize(::MyAbstractString) = Base.SizeUnknown()

Based on looking at substring.jl I guess that it is probably good to specialise some of these functions as well (where the implementation permits): getindex, cmp, pointer, unsafe_convert, nextind, thisind.

I’m experimenting with struct JSONString{S <: AbstractString} j::S; i::Int where j is the whole JSON text and i is the offset of a " delimited JSON string value. The intention is to do parsing of JSON escape sequences in-line in codeunit so that JSON string values can be used in-place without copying or un-escaping.

I’m hoping that this makes field name lookup string comparisons fast because most comparisons will fail early without ever unescaping the full string (the scanner sets a flag for strings with no escaping so that these can shortcut to a direct memcmp).

I’m also hoping for performance gains in the case where a large JSON text is opened, a few values are tweaked, and then it is saved again. In this case there is no need to ever un-escape or re-escape all the unchanged values.


#5

Reading in a bit more detail I see that…

In order for isvalid(s, i) to be an O(1) function, the encoding of s must be self-synchronizing this is a basic assumption of Julia’s generic string support.

This mightn’t be doable in general for JSON strings, which can have escape sequences up to 6 characters long: \uXXXX. But I’m hopeful that I can still come up with something efficient for sequential access in strings with occasional escape sequences.


#6

6 is O(1)


#7

True. The wiki page about self-synchronizing codes talks about “the symbol stream formed by a portion of one code word, is not a valid code word”, which seems not to be the case for JSON escaping. "["\uD834\uDd1e"]. If my index points to the second '' I have to backtrack to the first '\' to see that it is not a valid index. But this is still a constant amount of work.


#8

I plan on documenting this more thoroughly if I can ever get the Pkg3 stuff merged and working on master… it’s a real beast to change all of code loading and package resolution and installation. But it’s getting there.


#9

I have JSON.String <: AbstractString mostly working now.

I’m seeing something unexpected. I have instrumented next to print out its inputs and outputs.
When i compare a JSON.String to a Base.String with no common prefix, I expect to see next called just once. However, it seems that cmp(a::AbstractString, b::AbstractString) calls next one more time than it needs to:

julia> using LazyJSON
julia> json = """\"Foo\\u1234Bar\""""
julia> j = LazyJSON.parse(json)
julia> typeof(j)
LazyJSON.String{String}
julia> j.s
"\"Foo\\u1234Bar\""
julia> j.i
1

julia> j == "XYZ"
next(::JSON.String, 2) -> F, 2
next(::JSON.String, 3) -> o, 3
false

julia> j == "Food"
next(::JSON.String, 2) -> F, 2
next(::JSON.String, 3) -> o, 3
next(::JSON.String, 4) -> o, 4
next(::JSON.String, 10) -> ሴ, 10
next(::JSON.String, 11) -> B, 11
false

julia> j == "FooሴBar"
next(::JSON.String, 2) -> F, 2
next(::JSON.String, 3) -> o, 3
next(::JSON.String, 4) -> o, 4
next(::JSON.String, 10) -> ሴ, 10
next(::JSON.String, 11) -> B, 11
next(::JSON.String, 12) -> a, 12
next(::JSON.String, 13) -> r, 13
true

#10

Example with escape sequences in field names, 992 bytes allocated vs 37k for JSON.jl:

julia> json = """
           {
               "Image\\t Tab": {
                   "Width":  800,
                   "Height": 600,
                   "Title":  "View from 15th Floor",
                   "Thumb\\nail": {
                       "\\ud83e\\udd16 Url": "http://www.example.com/image/481989943",
                       "Height": 125,
                       "Width":  100
                   },
                   "Animated" : false,
                   "IDs": [116, 943, 234, 38793]
               }
           }
       """
julia> function f(json)
           v = LazyJSON.parse(json)
           v["Image\t Tab"]["Thumb\nail"]["🤖 Url"]
       end

julia> @time f(json)
  0.000023 seconds (28 allocations: 992 bytes)
"http://www.example.com/image/481989943"```

vs JSON.jl

julia> function f(json)
           v = JSON.parse(json)
           v["Image\t Tab"]["Thumb\nail"]["🤖 Url"]
       end
julia> @time f(json)
  0.000554 seconds (294 allocations: 36.844 KiB)
"http://www.example.com/image/481989943"

Example without escapes, only 8 allocations:

julia> function f(json)
           v = LazyJSON.parse(json)
           v["Image"]["Thumbnail"]["Url"]
       end

julia> @time f(json)
  0.000006 seconds (8 allocations: 288 bytes)
"http://www.example.com/image/481989943"

#11

But, here is the fun part…

With a 1MB input file, the lazy AbstractString parser takes 2% of the time and 384 bytes vs 7MB.

$ du -sh ec2-2016-11-15.normal.json
1012K	ec2-2016-11-15.normal.json
j = String(read("ec2-2016-11-15.normal.json"))
julia> function f(json)
           v = JSON.parse(json)
           v["shapes"]["scope"]["enum"][1]
       end
julia> @time f(j)
  0.066773 seconds (66.43 k allocations: 7.087 MiB)
"Availability Zone"
julia> function f(json)
           v = LazyJSON.parse(json)
           v["shapes"]["scope"]["enum"][1]
       end
julia> @time f(j)
  0.001392 seconds (12 allocations: 384 bytes)
"Availability Zone"

#12

I’m just wrapping up some rather severe rewriting of the parser in JSON.jl (I started it before I saw Jacob Quinn’s JSON2.jl and before you posted your LazyJSON.jl).
I’m hoping that maybe we can combine some of our efforts, and have a really kick-*ss JSON parser for Julia! :grinning:


#13

I would like to think that this is probably the most complete custom string type: https://github.com/JuliaString/Strs.jl, it is meant to provide complete replacements for both the String and Char types.

When I am done, I’ll try to document what string functionality needed to be implemented, both for a minimal set, and for a complete replacement.


#14

I’ve cleaned up my jsonhack repo and added some documentation and moved it to here: https://github.com/samoconnor/LazyJSON.jl


#15

I’ve recently implemented another AbstractString. This time backed by an IOBuffer. I noticed a few rough edges:

  • The Base.thisind(::String, i) function just calls Base._thisind_str(s, i) (note: no type on s). This is conveniant because I can implement Base.thisind(ms::MyString, i) as Base._thisind_str(ms, i). However, it would be easier if the Base.thisind(::String, i) was changed to Base.thisind(::AbstractString, i), then I wouldn’t need a specialisation at all.

  • The Base.next(s::String, i::Int) is not just a wrapper for _next. So I’ve had to copy/paste it and Base.next_continued(s::String, ...) into MyStrings.jl. It would be nice if Base.next(::String, ...) was widened to Base.next(s::AbstractString, ...)

If thisind and next were tweaked like this, then the custom AbstractString would be simply:

Base.codeunit(s::IOString) = UInt8
Base.ncodeunits(s::IOString) = s.buf.size
Base.codeunit(s::IOString, i::Integer) = s.buf.data[i]
Base.pointer(s::IOString) = pointer(s.buf.data)
Base.pointer(s::IOString, i) = pointer(s.buf.data, i)

Base.convert(::Type{Base.String}, s::IOString) =
    unsafe_string(pointer(s), ncodeunits(s))

Reference: https://github.com/samoconnor/LazyJSON.jl/blob/master/src/IOStrings.jl


#16

Your proposal relies on the assumption that all AbstractString types use UTF-8, but that’s not part of the interface.