[ANN]: WIP Strs.jl package ready for alpha review and testing


#1

I’d like to announce the availability for review and comment of my (very WIP!) Strs.jl package,

Please check it out, it’s still rather rough, a lot of code hasn’t been optimized, more needs to be made to use traits instead of having things hard coded, many of my ideas for this I haven’t even started to implement (optional substrings, cached hash values, cached UTF-8 or UTF-16 or raw bytes)

(I still need to make the logo I want for the JuliaString org :slight_smile: 3 concentric circles, of the “Julia” colors, with ASCII/Latin1 text in the other, BMP (probably math, Japanese, maybe Hindo stuff), and finally non-BMP (a few emojis) in the center).

Thanks, and Happy New Year!


#2

Should faster string sorts got to JuliaString or SortingAlgorithms.jl?


#3

I would vote for SortingAlgorithms. Scott, correct me if I’m wrong, but it seems that Strs.jl is meant to be used in lieu of strings in Base. Ideally, a radix-based string sort would be written generally enough that it would work for any type of string, including strings in Base and any additional String types defined in packages (including Strs.jl).

Of course, the basic idea of radix sort sorting using the binary/numerical representation of characters probably conflicts with the notion of most non-ascii character encodings, especially UTF8…

Cheers,
Kevin


#4

Yeah radixsort is useful for grouping. Actually grouping is my primary use case so at some point my attention will switch to radixgroup for fast group by operations.

I think given one of the goals of Strs.jl is O(1) access to characters (bytes would be ok too). I think the sorting method for Strs.jl might need to be specialised for Strs.jl. Not sure yet.


#5

The https://github.com/JuliaString organization is meant to hold packages for better string support in Julia, just as there are organizations for bioinformatics (https://github.com/BioJulia) and plotting in Julia (https://github.com/JuliaPlots).

https://github.com/JuliaCollections/SortingAlgorithms.jl is a nice package in the https://github.com/JuliaCollections organization, which has a number of other interesting packages as well.
You could either submit a PR to add your sorting code to the SortingAlgorithms.jl package, or I imagine as a separate package to JuliaCollections (whatever the owners etc. of the org prefer)

The other packages in JuliaString (such as https://github.com/JuliaString/StringLiterals.jl, https://github.com/JuliaString/ICU.jl, and https://github.com/JuliaString/Format.jl) do not (currently!) use my new Strs.jl package, they are meant to be useful as much as possible no matter what AbstractString type you are using.

Your faster radix sorting for strings would fit in well (I think in either JuliaCollections or JuliaString, TBH), even if you don’t add optimizations for specific Str types.
I’d love to see fast sorting for the “direct indexed” string types in Strs.jl, such as BinaryStr, or any of the ones currently implemented (except for UTF8Str and UTF16Str, those are multi codeunit encodings of Unicode).

If you wish to submit a (MIT licensed) package for faster string sorting to JuliaString, that would be great.


#6

I don’t think you’d want to be doing sorting so much on the encoded forms, but rather on the code points, so only multi- codeunit encodings, such as UTF-8 (i.e. String, LegacyStrings.UTF8String, or UTF8Str), or UTF-16 (LegacyStrings.UTF16String or UTF16Str), would be an issue, and I think you would still just do the sort on the 8-bit or 16-bit code units (the result for UTF-8 should be the same as if you sorted by UTF-32, anyway).


#7

For UTF-8, sorting on code points and code units gives the same result, but for UTF-16, it does not; some small transformations are required.


#8

I’m well aware of that, but actually some software requires sorting by the UTF-16 code points (usually for compatibility reasons because software originally written for Unicode 1.x sorts them by the 16-bit values, and indices in databases are in that order). I knew how to do a Unicode 2.0 compatible sort on UTF-16 long before that https://ssl.icu-project.org/docs/papers/utf16_code_point_order.html came out - it’s fairly obvious, but for most purposes, if you want to sort for human use, you should follow the recommendations and use locale specific collation sequences, after normalization.

Having both sorting with the transformations (which, IMO, should be the default for directly dealing with UTF-16 encoded data) and without (i.e. compatible also with CESU-8 sorting order) is useful (but is the sort of thing that I’d put in a StrExtensions.jl package - I don’t think direct operations on any multi-code unit representations should really be in the base Strs.jl package, probably only simple conversions, but I’ll sort that out later.


#9

BTW, I do hope you can take the time to thoroughly review the code, and would appreciate any suggestions.


#10

@Tamas_Papp I just saw your https://github.com/tpapp/ByteParsers.jl package (while investigating your https://github.com/tpapp/julia-repl package, which as an inveterate Emacs user, is of great interest to me! :nerd_face:) - I think it could benefit from using Strs.jl instead of (or in addition to) Vector{UInt8}.

I see that your code currently makes the string be converted to a Vector{UInt8}, (which may or may not do any real work, depending on the type of the AbstractString). That would not be necessary with my Str types.
You would also not be limited to ASCII characters for the delimiter, something that came up recently (see: Unicode related error when reading a .csv) with somebody trying to use CSV.jl, because a Python program exported strings with Unicode quotes, i.e.:

‘“’: Unicode U+201c (category Pi: Punctuation, initial quote)
’”’: Unicode U+201d (category Pf: Punctuation, final quote)

You could use any of the CodeUnitSingle Str types (currently all of them except for UTF8Str and UTF16Str), such as UniStr, BinaryStr, RawByteStr, RawWordStr, RawCharStr, etc.


#11

Thanks for the suggestion. I meant ByteParsers.jl as temporary measure, and strictly for ASCII data (fortunately, most of the social science data I get is plain vanilla ASCII). But I will keep an eye on your package and revisit the issue once v0.7 is out.


#12

Just merged in a bunch of fixes, now passes runtests :grinning:

@jeff.bezanson Even though this is still WIP, please take a look!

Green means that it is more than 5% faster than String, the number is the ratio of time for String divided by time for that Str encoding for that test.

Also I have some early benchmarking results, on a few functions!








#13

I get the following error in 0.6.2

julia> using Strs
INFO: Precompiling module Strs.
WARNING: could not import Base.thisind into Strs
ERROR: LoadError: LoadError: UndefVarError: Nothing not defined
Stacktrace:
 [1] macro expansion at /Users/amrods/.julia/v0.6/Strs/src/types.jl:159 [inlined]
 [2] anonymous at ./<missing>:?
 [3] include_from_node1(::String) at /Applications/Julia-0.6.app/Contents/Resources/julia/lib/julia/sys.dylib:?
 [4] include(::String) at /Applications/Julia-0.6.app/Contents/Resources/julia/lib/julia/sys.dylib:?
 [5] include_from_node1(::String) at /Applications/Julia-0.6.app/Contents/Resources/julia/lib/julia/sys.dylib:?
 [6] include(::String) at /Applications/Julia-0.6.app/Contents/Resources/julia/lib/julia/sys.dylib:?
 [7] anonymous at ./<missing>:2
while loading /Users/amrods/.julia/v0.6/Strs/src/types.jl, in expression starting on line 156
while loading /Users/amrods/.julia/v0.6/Strs/src/Strs.jl, in expression starting on line 82
ERROR: Failed to precompile Strs to /Users/amrods/.julia/lib/v0.6/Strs.ji.
Stacktrace:
 [1] compilecache(::String) at /Applications/Julia-0.6.app/Contents/Resources/julia/lib/julia/sys.dylib:?
 [2] _require(::Symbol) at /Applications/Julia-0.6.app/Contents/Resources/julia/lib/julia/sys.dylib:?
 [3] require(::Symbol) at /Applications/Julia-0.6.app/Contents/Resources/julia/lib/julia/sys.dylib:?

#14

I haven’t ported it to v0.6 yet, that’s planned, shortly


#15

It is now working on v0.6.2 (and latest master!)


#16

Just a heads up, the Strs.jl package is now passing most all of the String unit tests on v0.6.2 (missing a few tests that were only for new methods in v0.7, such as length, nextind, prevind with extra parameters) and on latest v0.7, so it is no longer so “WIP”.

For most of the performance tests, it is uniformly faster than String (work still needs to be done for Regex,
since the Base Regex support doesn’t support abstract strings, and would need to be extended to be able to handle 16-bit and 32-bit codeunits), frequently 2-10x, sometimes as much as 1000x faster in my testing
(which are rather extensive, testing a large set of files, with String and various Str encodings, on 19 different string/character functions currently).

I would very much appreciate any additional benchmarks, so that Str and Chr can be compared to String and Char with a wide variety of use cases.