What is the Interface of AbstractString?

samoconnor · February 9, 2018, 6:59am

The Interfaces documentation does not cover AbstractString: https://docs.julialang.org/en/latest/manual/interfaces

What is the minimal set of methods that must be defined for a custom AbstractString type?
Is there a good example of a custom string type somewhere, or should I just follow the pattern of SubString:
https://github.com/JuliaLang/julia/blob/master/base/strings/substring.jl

NaOH · February 9, 2018, 4:05pm

There are more string related packages here:

https://github.com/JuliaString

But I haven’t found a detailed answer there, these packages use string macros.

nalimilan · February 9, 2018, 7:12pm

It’ now documented in ?AbstractString: https://github.com/JuliaLang/julia/blob/16620b62443f0ffd3aa5fa58174b5d66a28cc646/base/strings/basic.jl#L3-L41. Should probably be added to interfaces too.

samoconnor · February 9, 2018, 11:14pm

Thanks Milan,

The ?AbstractString docstring doesn’t seem to have a list of required methods, but there is a section commented ## required string functions ## in the source just after the doc string. From this I conclude that the required methods are:

ncodeunits(s::MyAbstractString) -> Int
codeunit(s::MyAbstractString, i::Integer) -> Union{UInt8, UInt16, UInt32}
isvalid(s::MyAbstractString, i::Integer) -> Bool
next(s::MyAbstractString, i::Integer) -> Tuple{Char, Int}
length(s::MyAbstractString) or define Base.IteratorSize(::MyAbstractString) = Base.SizeUnknown()

Based on looking at substring.jl I guess that it is probably good to specialise some of these functions as well (where the implementation permits): getindex, cmp, pointer, unsafe_convert, nextind, thisind.

I’m experimenting with struct JSONString{S <: AbstractString} j::S; i::Int where j is the whole JSON text and i is the offset of a " delimited JSON string value. The intention is to do parsing of JSON escape sequences in-line in codeunit so that JSON string values can be used in-place without copying or un-escaping.

I’m hoping that this makes field name lookup string comparisons fast because most comparisons will fail early without ever unescaping the full string (the scanner sets a flag for strings with no escaping so that these can shortcut to a direct memcmp).

I’m also hoping for performance gains in the case where a large JSON text is opened, a few values are tweaked, and then it is saved again. In this case there is no need to ever un-escape or re-escape all the unchanged values.

https://github.com/samoconnor/jsonhack/blob/master/src/AbstractString.jl#L27-L39

samoconnor · February 9, 2018, 11:48pm

Reading in a bit more detail I see that…

In order for isvalid(s, i) to be an O(1) function, the encoding of s must be self-synchronizing this is a basic assumption of Julia’s generic string support.

This mightn’t be doable in general for JSON strings, which can have escape sequences up to 6 characters long: \uXXXX. But I’m hopeful that I can still come up with something efficient for sequential access in strings with occasional escape sequences.

StefanKarpinski · February 10, 2018, 12:11am

6 is O(1)

samoconnor · February 10, 2018, 12:19am

True. The wiki page about self-synchronizing codes talks about “the symbol stream formed by a portion of one code word, is not a valid code word”, which seems not to be the case for JSON escaping. "["\uD834\uDd1e"]. If my index points to the second '' I have to backtrack to the first '\' to see that it is not a valid index. But this is still a constant amount of work.

StefanKarpinski · February 10, 2018, 12:48am

I plan on documenting this more thoroughly if I can ever get the Pkg3 stuff merged and working on master… it’s a real beast to change all of code loading and package resolution and installation. But it’s getting there.

samoconnor · February 10, 2018, 5:21am

I have JSON.String <: AbstractString mostly working now.

I’m seeing something unexpected. I have instrumented next to print out its inputs and outputs.
When i compare a JSON.String to a Base.String with no common prefix, I expect to see next called just once. However, it seems that cmp(a::AbstractString, b::AbstractString) calls next one more time than it needs to:

julia> using LazyJSON
julia> json = """\"Foo\\u1234Bar\""""
julia> j = LazyJSON.parse(json)
julia> typeof(j)
LazyJSON.String{String}
julia> j.s
"\"Foo\\u1234Bar\""
julia> j.i
1

julia> j == "XYZ"
next(::JSON.String, 2) -> F, 2
next(::JSON.String, 3) -> o, 3
false

julia> j == "Food"
next(::JSON.String, 2) -> F, 2
next(::JSON.String, 3) -> o, 3
next(::JSON.String, 4) -> o, 4
next(::JSON.String, 10) -> ሴ, 10
next(::JSON.String, 11) -> B, 11
false

julia> j == "FooሴBar"
next(::JSON.String, 2) -> F, 2
next(::JSON.String, 3) -> o, 3
next(::JSON.String, 4) -> o, 4
next(::JSON.String, 10) -> ሴ, 10
next(::JSON.String, 11) -> B, 11
next(::JSON.String, 12) -> a, 12
next(::JSON.String, 13) -> r, 13
true

samoconnor · February 10, 2018, 6:16am

Example with escape sequences in field names, 992 bytes allocated vs 37k for JSON.jl:

julia> json = """
           {
               "Image\\t Tab": {
                   "Width":  800,
                   "Height": 600,
                   "Title":  "View from 15th Floor",
                   "Thumb\\nail": {
                       "\\ud83e\\udd16 Url": "http://www.example.com/image/481989943",
                       "Height": 125,
                       "Width":  100
                   },
                   "Animated" : false,
                   "IDs": [116, 943, 234, 38793]
               }
           }
       """
julia> function f(json)
           v = LazyJSON.parse(json)
           v["Image\t Tab"]["Thumb\nail"]["🤖 Url"]
       end

julia> @time f(json)
  0.000023 seconds (28 allocations: 992 bytes)
"http://www.example.com/image/481989943"```

vs JSON.jl

julia> function f(json)
           v = JSON.parse(json)
           v["Image\t Tab"]["Thumb\nail"]["🤖 Url"]
       end
julia> @time f(json)
  0.000554 seconds (294 allocations: 36.844 KiB)
"http://www.example.com/image/481989943"

Example without escapes, only 8 allocations:

julia> function f(json)
           v = LazyJSON.parse(json)
           v["Image"]["Thumbnail"]["Url"]
       end

julia> @time f(json)
  0.000006 seconds (8 allocations: 288 bytes)
"http://www.example.com/image/481989943"

samoconnor · February 10, 2018, 6:22am

But, here is the fun part…

With a 1MB input file, the lazy AbstractString parser takes 2% of the time and 384 bytes vs 7MB.

$ du -sh ec2-2016-11-15.normal.json
1012K	ec2-2016-11-15.normal.json

j = String(read("ec2-2016-11-15.normal.json"))
julia> function f(json)
           v = JSON.parse(json)
           v["shapes"]["scope"]["enum"][1]
       end
julia> @time f(j)
  0.066773 seconds (66.43 k allocations: 7.087 MiB)
"Availability Zone"

julia> function f(json)
           v = LazyJSON.parse(json)
           v["shapes"]["scope"]["enum"][1]
       end
julia> @time f(j)
  0.001392 seconds (12 allocations: 384 bytes)
"Availability Zone"

ScottPJones · February 10, 2018, 2:37pm

I’m just wrapping up some rather severe rewriting of the parser in JSON.jl (I started it before I saw Jacob Quinn’s JSON2.jl and before you posted your LazyJSON.jl).
I’m hoping that maybe we can combine some of our efforts, and have a really kick-*ss JSON parser for Julia!

ScottPJones · February 10, 2018, 4:49pm

I would like to think that this is probably the most complete custom string type: GitHub - JuliaString/Strs.jl: String support package for Julia, it is meant to provide complete replacements for both the String and Char types.

When I am done, I’ll try to document what string functionality needed to be implemented, both for a minimal set, and for a complete replacement.

samoconnor · February 11, 2018, 5:44am

I’ve cleaned up my jsonhack repo and added some documentation and moved it to here: https://github.com/samoconnor/LazyJSON.jl

samoconnor · February 21, 2018, 12:24pm

I’ve recently implemented another AbstractString. This time backed by an IOBuffer. I noticed a few rough edges:

The Base.thisind(::String, i) function just calls Base._thisind_str(s, i) (note: no type on s). This is conveniant because I can implement Base.thisind(ms::MyString, i) as Base._thisind_str(ms, i). However, it would be easier if the Base.thisind(::String, i) was changed to Base.thisind(::AbstractString, i), then I wouldn’t need a specialisation at all.
The Base.next(s::String, i::Int) is not just a wrapper for _next. So I’ve had to copy/paste it and Base.next_continued(s::String, ...) into MyStrings.jl. It would be nice if Base.next(::String, ...) was widened to Base.next(s::AbstractString, ...)

If thisind and next were tweaked like this, then the custom AbstractString would be simply:

Base.codeunit(s::IOString) = UInt8
Base.ncodeunits(s::IOString) = s.buf.size
Base.codeunit(s::IOString, i::Integer) = s.buf.data[i]
Base.pointer(s::IOString) = pointer(s.buf.data)
Base.pointer(s::IOString, i) = pointer(s.buf.data, i)

Base.convert(::Type{Base.String}, s::IOString) =
    unsafe_string(pointer(s), ncodeunits(s))

Reference: https://github.com/samoconnor/LazyJSON.jl/blob/master/src/IOStrings.jl

nalimilan · February 21, 2018, 12:50pm

Your proposal relies on the assumption that all AbstractString types use UTF-8, but that’s not part of the interface.

samoconnor · September 28, 2018, 4:36am

I have an SubString-like AbstractString subtype where it would be convenient for the indexes to be the same as an underlying string; i.e. isvalid(s, 1) == false and firstindex(s) != 1 (LazyHTTP.jl ).

I would like to know: Is it valid for firstindex to return > 1 for an AbstractString subtype? @StefanKarpinski ? @nalimilan ?

The AbstractString spec hints at sub-string types that use the same indexes as an underlying string:
https://github.com/JuliaLang/julia/blob/master/base/strings/basic.jl#L31-L32

But, the definition of first(keys(::AbstractString)) implies that isvalid(::AbstractString, 1) is always true:
https://github.com/JuliaLang/julia/blob/master/base/strings/basic.jl#L536-L538
However, the iterate method above uses firstindex to set the start state, which suggests that it might be ok to define firstindex(::MyStringSubtype) != 1 and isvalid(::MyStringSubtype, 1) == false

https://github.com/JuliaLang/julia/blob/master/base/strings/basic.jl#L158-L158

Are the places that assume isvalid(s, 1) == true bugs? (i.e. hard-coded 1 instead of calling firstindex(s))
Or is isvalid(s, 1) == true intended to be an invariant property of the AbstractString interface?

nalimilan · September 28, 2018, 7:05am

AFAIK AbstractString implementations are supposed to have indices starting at 1. You’re free to use any value for the iteration state, though. (The excerpt of the docs you quote refers to what happens when passing out of bounds indices to nextind, which is a different issue.)

samoconnor · September 28, 2018, 8:35am

Thanks @nalimilan.
Regarding using non index values for iteration, I can see at least one place where it is assumed that an index is a valid iteration state:
https://github.com/JuliaLang/julia/blob/master/base/strings/basic.jl#L161-L164

nalimilan · September 28, 2018, 8:44am

Good catch. Though it can be considered as a reasonable and convenient fallback, as one can easily override getindex if needed.

Topic		Replies	Views
Help with my string type New to Julia	5	891	May 7, 2018
String indices : byte indexing feels wrong New to Julia strings , unicode	18	1413	December 5, 2023
StringIndex idea (Julia 2.0) Internals & Design strings , indexing	72	3345	March 27, 2024
Breakage due to changes in `String` slicing in v0.7 Internals & Design	35	2313	February 12, 2018
Changes to the representation of Char Internals & Design	14	2853	December 12, 2017

What is the Interface of AbstractString?

Related topics