What is the Interface of AbstractString?

The Interfaces documentation does not cover AbstractString: https://docs.julialang.org/en/latest/manual/interfaces

What is the minimal set of methods that must be defined for a custom AbstractString type?
Is there a good example of a custom string type somewhere, or should I just follow the pattern of SubString:
https://github.com/JuliaLang/julia/blob/master/base/strings/substring.jl

2 Likes

There are more string related packages here:

But I havenā€™t found a detailed answer there, these packages use string macros.

Itā€™ now documented in ?AbstractString: https://github.com/JuliaLang/julia/blob/16620b62443f0ffd3aa5fa58174b5d66a28cc646/base/strings/basic.jl#L3-L41. Should probably be added to interfaces too.

Thanks Milan,

The ?AbstractString docstring doesnā€™t seem to have a list of required methods, but there is a section commented ## required string functions ## in the source just after the doc string. From this I conclude that the required methods are:

  • ncodeunits(s::MyAbstractString) -> Int
  • codeunit(s::MyAbstractString, i::Integer) -> Union{UInt8, UInt16, UInt32}
  • isvalid(s::MyAbstractString, i::Integer) -> Bool
  • next(s::MyAbstractString, i::Integer) -> Tuple{Char, Int}
  • length(s::MyAbstractString) or define Base.IteratorSize(::MyAbstractString) = Base.SizeUnknown()

Based on looking at substring.jl I guess that it is probably good to specialise some of these functions as well (where the implementation permits): getindex, cmp, pointer, unsafe_convert, nextind, thisind.

Iā€™m experimenting with struct JSONString{S <: AbstractString} j::S; i::Int where j is the whole JSON text and i is the offset of a " delimited JSON string value. The intention is to do parsing of JSON escape sequences in-line in codeunit so that JSON string values can be used in-place without copying or un-escaping.

Iā€™m hoping that this makes field name lookup string comparisons fast because most comparisons will fail early without ever unescaping the full string (the scanner sets a flag for strings with no escaping so that these can shortcut to a direct memcmp).

Iā€™m also hoping for performance gains in the case where a large JSON text is opened, a few values are tweaked, and then it is saved again. In this case there is no need to ever un-escape or re-escape all the unchanged values.

https://github.com/samoconnor/jsonhack/blob/master/src/AbstractString.jl#L27-L39

1 Like

Reading in a bit more detail I see thatā€¦

In order for isvalid(s, i) to be an O(1) function, the encoding of s must be self-synchronizing this is a basic assumption of Juliaā€™s generic string support.

This mightnā€™t be doable in general for JSON strings, which can have escape sequences up to 6 characters long: \uXXXX. But Iā€™m hopeful that I can still come up with something efficient for sequential access in strings with occasional escape sequences.

6 is O(1)

1 Like

True. The wiki page about self-synchronizing codes talks about ā€œthe symbol stream formed by a portion of one code word, is not a valid code wordā€, which seems not to be the case for JSON escaping. "["\uD834\uDd1e"]. If my index points to the second '' I have to backtrack to the first '\' to see that it is not a valid index. But this is still a constant amount of work.

I plan on documenting this more thoroughly if I can ever get the Pkg3 stuff merged and working on masterā€¦ itā€™s a real beast to change all of code loading and package resolution and installation. But itā€™s getting there.

1 Like

I have JSON.String <: AbstractString mostly working now.

Iā€™m seeing something unexpected. I have instrumented next to print out its inputs and outputs.
When i compare a JSON.String to a Base.String with no common prefix, I expect to see next called just once. However, it seems that cmp(a::AbstractString, b::AbstractString) calls next one more time than it needs to:

julia> using LazyJSON
julia> json = """\"Foo\\u1234Bar\""""
julia> j = LazyJSON.parse(json)
julia> typeof(j)
LazyJSON.String{String}
julia> j.s
"\"Foo\\u1234Bar\""
julia> j.i
1

julia> j == "XYZ"
next(::JSON.String, 2) -> F, 2
next(::JSON.String, 3) -> o, 3
false

julia> j == "Food"
next(::JSON.String, 2) -> F, 2
next(::JSON.String, 3) -> o, 3
next(::JSON.String, 4) -> o, 4
next(::JSON.String, 10) -> įˆ“, 10
next(::JSON.String, 11) -> B, 11
false

julia> j == "Fooįˆ“Bar"
next(::JSON.String, 2) -> F, 2
next(::JSON.String, 3) -> o, 3
next(::JSON.String, 4) -> o, 4
next(::JSON.String, 10) -> įˆ“, 10
next(::JSON.String, 11) -> B, 11
next(::JSON.String, 12) -> a, 12
next(::JSON.String, 13) -> r, 13
true

Example with escape sequences in field names, 992 bytes allocated vs 37k for JSON.jl:

julia> json = """
           {
               "Image\\t Tab": {
                   "Width":  800,
                   "Height": 600,
                   "Title":  "View from 15th Floor",
                   "Thumb\\nail": {
                       "\\ud83e\\udd16 Url": "http://www.example.com/image/481989943",
                       "Height": 125,
                       "Width":  100
                   },
                   "Animated" : false,
                   "IDs": [116, 943, 234, 38793]
               }
           }
       """
julia> function f(json)
           v = LazyJSON.parse(json)
           v["Image\t Tab"]["Thumb\nail"]["šŸ¤– Url"]
       end

julia> @time f(json)
  0.000023 seconds (28 allocations: 992 bytes)
"http://www.example.com/image/481989943"```

vs JSON.jl

julia> function f(json)
           v = JSON.parse(json)
           v["Image\t Tab"]["Thumb\nail"]["šŸ¤– Url"]
       end
julia> @time f(json)
  0.000554 seconds (294 allocations: 36.844 KiB)
"http://www.example.com/image/481989943"

Example without escapes, only 8 allocations:

julia> function f(json)
           v = LazyJSON.parse(json)
           v["Image"]["Thumbnail"]["Url"]
       end

julia> @time f(json)
  0.000006 seconds (8 allocations: 288 bytes)
"http://www.example.com/image/481989943"
1 Like

But, here is the fun partā€¦

With a 1MB input file, the lazy AbstractString parser takes 2% of the time and 384 bytes vs 7MB.

$ du -sh ec2-2016-11-15.normal.json
1012K	ec2-2016-11-15.normal.json
j = String(read("ec2-2016-11-15.normal.json"))
julia> function f(json)
           v = JSON.parse(json)
           v["shapes"]["scope"]["enum"][1]
       end
julia> @time f(j)
  0.066773 seconds (66.43 k allocations: 7.087 MiB)
"Availability Zone"
julia> function f(json)
           v = LazyJSON.parse(json)
           v["shapes"]["scope"]["enum"][1]
       end
julia> @time f(j)
  0.001392 seconds (12 allocations: 384 bytes)
"Availability Zone"
2 Likes

Iā€™m just wrapping up some rather severe rewriting of the parser in JSON.jl (I started it before I saw Jacob Quinnā€™s JSON2.jl and before you posted your LazyJSON.jl).
Iā€™m hoping that maybe we can combine some of our efforts, and have a really kick-*ss JSON parser for Julia! :grinning:

2 Likes

I would like to think that this is probably the most complete custom string type: GitHub - JuliaString/Strs.jl: String support package for Julia, it is meant to provide complete replacements for both the String and Char types.

When I am done, Iā€™ll try to document what string functionality needed to be implemented, both for a minimal set, and for a complete replacement.

Iā€™ve cleaned up my jsonhack repo and added some documentation and moved it to here: https://github.com/samoconnor/LazyJSON.jl

2 Likes

Iā€™ve recently implemented another AbstractString. This time backed by an IOBuffer. I noticed a few rough edges:

  • The Base.thisind(::String, i) function just calls Base._thisind_str(s, i) (note: no type on s). This is conveniant because I can implement Base.thisind(ms::MyString, i) as Base._thisind_str(ms, i). However, it would be easier if the Base.thisind(::String, i) was changed to Base.thisind(::AbstractString, i), then I wouldnā€™t need a specialisation at all.

  • The Base.next(s::String, i::Int) is not just a wrapper for _next. So Iā€™ve had to copy/paste it and Base.next_continued(s::String, ...) into MyStrings.jl. It would be nice if Base.next(::String, ...) was widened to Base.next(s::AbstractString, ...)

If thisind and next were tweaked like this, then the custom AbstractString would be simply:

Base.codeunit(s::IOString) = UInt8
Base.ncodeunits(s::IOString) = s.buf.size
Base.codeunit(s::IOString, i::Integer) = s.buf.data[i]
Base.pointer(s::IOString) = pointer(s.buf.data)
Base.pointer(s::IOString, i) = pointer(s.buf.data, i)

Base.convert(::Type{Base.String}, s::IOString) =
    unsafe_string(pointer(s), ncodeunits(s))

Reference: https://github.com/samoconnor/LazyJSON.jl/blob/master/src/IOStrings.jl

Your proposal relies on the assumption that all AbstractString types use UTF-8, but thatā€™s not part of the interface.

I have an SubString-like AbstractString subtype where it would be convenient for the indexes to be the same as an underlying string; i.e. isvalid(s, 1) == false and firstindex(s) != 1 (LazyHTTP.jl ).

I would like to know: Is it valid for firstindex to return > 1 for an AbstractString subtype? @StefanKarpinski ? @nalimilan ?

The AbstractString spec hints at sub-string types that use the same indexes as an underlying string:
https://github.com/JuliaLang/julia/blob/master/base/strings/basic.jl#L31-L32

But, the definition of first(keys(::AbstractString)) implies that isvalid(::AbstractString, 1) is always true:
https://github.com/JuliaLang/julia/blob/master/base/strings/basic.jl#L536-L538
However, the iterate method above uses firstindex to set the start state, which suggests that it might be ok to define firstindex(::MyStringSubtype) != 1 and isvalid(::MyStringSubtype, 1) == false

https://github.com/JuliaLang/julia/blob/master/base/strings/basic.jl#L158-L158

Are the places that assume isvalid(s, 1) == true bugs? (i.e. hard-coded 1 instead of calling firstindex(s))
Or is isvalid(s, 1) == true intended to be an invariant property of the AbstractString interface?

AFAIK AbstractString implementations are supposed to have indices starting at 1. Youā€™re free to use any value for the iteration state, though. (The excerpt of the docs you quote refers to what happens when passing out of bounds indices to nextind, which is a different issue.)

Thanks @nalimilan.
Regarding using non index values for iteration, I can see at least one place where it is assumed that an index is a valid iteration state:
https://github.com/JuliaLang/julia/blob/master/base/strings/basic.jl#L161-L164

Good catch. Though it can be considered as a reasonable and convenient fallback, as one can easily override getindex if needed.