The ?AbstractString docstring doesnāt seem to have a list of required methods, but there is a section commented ## required string functions ## in the source just after the doc string. From this I conclude that the required methods are:
length(s::MyAbstractString) or define Base.IteratorSize(::MyAbstractString) = Base.SizeUnknown()
Based on looking at substring.jl I guess that it is probably good to specialise some of these functions as well (where the implementation permits): getindex, cmp, pointer, unsafe_convert, nextind, thisind.
Iām experimenting with struct JSONString{S <: AbstractString} j::S; i::Int where j is the whole JSON text and i is the offset of a " delimited JSON string value. The intention is to do parsing of JSON escape sequences in-line in codeunit so that JSON string values can be used in-place without copying or un-escaping.
Iām hoping that this makes field name lookup string comparisons fast because most comparisons will fail early without ever unescaping the full string (the scanner sets a flag for strings with no escaping so that these can shortcut to a direct memcmp).
Iām also hoping for performance gains in the case where a large JSON text is opened, a few values are tweaked, and then it is saved again. In this case there is no need to ever un-escape or re-escape all the unchanged values.
In order for isvalid(s, i) to be an O(1) function, the encoding of s must be self-synchronizing this is a basic assumption of Juliaās generic string support.
This mightnāt be doable in general for JSON strings, which can have escape sequences up to 6 characters long: \uXXXX. But Iām hopeful that I can still come up with something efficient for sequential access in strings with occasional escape sequences.
True. The wiki page about self-synchronizing codes talks about āthe symbol stream formed by a portion of one code word, is not a valid code wordā, which seems not to be the case for JSON escaping. "["\uD834\uDd1e"]. If my index points to the second '' I have to backtrack to the first '\' to see that it is not a valid index. But this is still a constant amount of work.
I plan on documenting this more thoroughly if I can ever get the Pkg3 stuff merged and working on masterā¦ itās a real beast to change all of code loading and package resolution and installation. But itās getting there.
I have JSON.String <: AbstractString mostly working now.
Iām seeing something unexpected. I have instrumented next to print out its inputs and outputs.
When i compare a JSON.String to a Base.String with no common prefix, I expect to see next called just once. However, it seems that cmp(a::AbstractString, b::AbstractString) calls next one more time than it needs to:
Iām just wrapping up some rather severe rewriting of the parser in JSON.jl (I started it before I saw Jacob Quinnās JSON2.jl and before you posted your LazyJSON.jl).
Iām hoping that maybe we can combine some of our efforts, and have a really kick-*ss JSON parser for Julia!
Iāve recently implemented another AbstractString. This time backed by an IOBuffer. I noticed a few rough edges:
The Base.thisind(::String, i) function just calls Base._thisind_str(s, i) (note: no type on s). This is conveniant because I can implement Base.thisind(ms::MyString, i) as Base._thisind_str(ms, i). However, it would be easier if the Base.thisind(::String, i) was changed to Base.thisind(::AbstractString, i), then I wouldnāt need a specialisation at all.
The Base.next(s::String, i::Int) is not just a wrapper for _next. So Iāve had to copy/paste it and Base.next_continued(s::String, ...) into MyStrings.jl. It would be nice if Base.next(::String, ...) was widened to Base.next(s::AbstractString, ...)
If thisind and next were tweaked like this, then the custom AbstractString would be simply:
I have an SubString-like AbstractString subtype where it would be convenient for the indexes to be the same as an underlying string; i.e. isvalid(s, 1) == false and firstindex(s) != 1 (LazyHTTP.jl ).
I would like to know: Is it valid for firstindex to return > 1 for an AbstractString subtype? @StefanKarpinski ? @nalimilan ?
But, the definition of first(keys(::AbstractString)) implies that isvalid(::AbstractString, 1) is always true: https://github.com/JuliaLang/julia/blob/master/base/strings/basic.jl#L536-L538
However, the iterate method above uses firstindex to set the start state, which suggests that it might be ok to define firstindex(::MyStringSubtype) != 1 and isvalid(::MyStringSubtype, 1) == false
Are the places that assume isvalid(s, 1) == true bugs? (i.e. hard-coded 1 instead of calling firstindex(s))
Or is isvalid(s, 1) == true intended to be an invariant property of the AbstractString interface?
AFAIK AbstractString implementations are supposed to have indices starting at 1. Youāre free to use any value for the iteration state, though. (The excerpt of the docs you quote refers to what happens when passing out of bounds indices to nextind, which is a different issue.)