Swift string handling


#1

Currently, Swift has much better support for string handling than Julia (something I’m working to change in my spare time, I strongly feel that in the future, Julia could have the best (in terms of features, ease of use and performance) string handling support of any language).
Python 3 also has better / faster support for strings than Julia at the moment.
Having something better for Julia I feel would help for people wanting to do NLP / ML work.


Swift for Tensorflow rationale
#2

It is quite important for many NLP aplications to have fast string processing. Checking if a word starts with capital if ends with “ing” etc… It is paramount for NLP. Specially for production systems where you can’t assume there is “a dictionary of correct words” mapped to integers. There are spelling mistakes, “joinedwords” etc…


#3

Please stop with the FUD; we all know that you prefer a different string design, but your endless refrain that disagreeing with you equates with a second-class design is tiresome. Swift’s String type reportedly uses a single variable-width encoding internally (UTF-16 last I checked, hence relatively inefficient for mostly-ASCII data), hence has an analogue of Julia’s nextind to increment string indices, while its Character type is actually a grapheme cluster (hence variable-width and relatively slow) analogous to Julia’s graphemes iterator over substrings. It also exposes iterators over code points and code units, but as far as I can tell these don’t support random access (only forwards and backwards iteration). Swift’s strings are mutable with copy-on-write semantics, whereas Julia uses IOBuffer for string building. Tastes differ, but it is not obviously superior to Julia’s approach, nor does it have major functionality that Julia lacks.


#4

I believe that I’m entitled to my own opinion in the matter - it’s not FUD at all.

Since I started using Julia, I found the string support to be buggy, slow, not well tested, limited in functionality, and hard to use.
I can easily show you many examples off all of those problems.
I have been and still am trying to address those issues.


#5

I’ve always pointed out concrete issues with the string design in Julia, it’s not a matter of disagreeing or not with me, it’s about objective facts. The number of bugs that I fixed in the string handling code in the past, and the bugs that have been around for years and are still being found (like not handling the last character of a string correctly if is wasn’t ASCII in a search), as well as the performance issues with strings in Julia, back in v0.3 before I learned about Julia, and now in master can be easily shown with benchmarking.
The lack of validated string types goes against the strong recommendations of the Unicode organization, W3C, IETF, and other bodies, because of many known security issues.

Have you actually looked at the documentation and source code for Swift string support?

They keep track (like I do in Strs.jl) of properties of strings, like whether it is just ASCII, etc.
Given that most all text in the world can be represented using just the 16-bit BMP of Unicode, the Swift code has fast paths that optimize that case (so no slow “nextind” like issues).
You can perform all sorts of indexing operations on utf8, utf16, unicodeScalar and other views of strings, random access is possible, not just iteration. Swift uses a String.Index type (a similar idea was discussed for Julia at one point, I believe), and you can even compare indices from different string types to see if they represent the same position in the string.

This works very intuitively, and performs very well.
IOBuffers can’t handle alternate string types, everything is geared towards data being forced to use UTF-8 encoding, and using them to build strings is rather clumsy, and people tend to write rather inefficient code instead, doing things like str += "ing", which is efficient in both Swift and Python.


#6

Mainly, it seems to boil down to the fact that you don’t like the tradeoffs involved in using the UTF-8 encoding for the base String type. But for you, these aren’t tradeoffs, it’s an “objective fact” that Julia’s strings are second-class, a “fact” that you insert into the discussion at every opportunity, forcing other developers to either defend the same tradeoffs over and over (and over…) or simply let your aspersions stand uncontested.

random access is possible, not just iteration. Swift uses a String.Index type

I saw the String.Index type, but the Swift documentation specifically says that the String.Index offsetby method is O(n), which I wouldn’t count as “random access”: https://developer.apple.com/documentation/swift/string/1786175-index … This is not surprising for variable-width encodings, of course. You have consecutive iteration, but can save the iteration state (an index of some type) for later access. (If you call that random access, then a linked list is random access too.)

IOBuffers can’t handle alternate string types

Yes they can, as long as the string’s data can be represented by an array of bytes (or be copied/reinterpreted therefrom). But you’ve been agitating for mutable strings with concatenation-based construction for 3+ years now, and I don’t want to re-hash this yet another time.


#7

No - UTF-8, properly implemented, does have its place. I’ve never said otherwise.
I’ve also never described Julia’s strings as “second-class”, I’ve just brought up the issues, as I’ve found them.

It is an objective fact that for the languages used by the majority of the world’s population, UTF-8 encoding takes up more space than the older 8-bit character sets (for Cyrillic and Eastern European languages, typically twice as much space, for many others, such as languages in India, 3x as much), and takes around 50% more space than Chinese, Japanese, and Korean text).
Is is also an objective fact that processing UTF-8 is slower than processing UTF-16 (and especially slower than UCS-2 (i.e. only BMP characters).
It is another objective fact that you can process already validated UTF-8 much faster than trying to handle unvalidated UTF-8, esp. if you are trying to handle it in a consistent fashion, which came up recently.
It is also an objective fact that Julia’s String does not follow the recommendations of the experts on handling invalid strings (at least since #24999).

That’s true, if you are dealing with things like their String type (i.e. operating on graphemes), or a UTF-8 view, or a string with non-BMP characters.

From what I saw diving into the source code, index operations on pure ASCII strings are O(1), and I think it’s also true for pure BMP strings.
In case you don’t believe me, you can go take a look for yourself at the Swift source code, like Julia, much of it is actually written in Swift.

Here is a link to methods of speeding up string handling in Swift: Strings, Characters and Performance in Swift, A Deep Dive

There is no support for anything but writing out UTF-8 sequences via print, or having to manage all of the conversions yourself to output a consistent encoding to an IOBuffer. An IOBuffer can handle other encodings about as well as a Vector or a C unsigned char foo[100]can, that to me doesn’t count, since you have to write your own wrappers to make sure that anything written to the IOBuffer is output in a consistent encoding.
(In Strs.jl, I made a version of sprint that adds a type to the context, to allow handling this better)

Agitating? I’ve hardly said anything about it, far from agitating, and I’ve never said that mutable strings should somehow replace immutable strings, rather that having them available in addition to immutable strings is useful, and is a lot easier for people to use than IOBuffers.


#8

I am not a core Julia developer, but I have spent some time on understanding string internals of Julia (and I know a bit how other languages handle it). As the issue of the design is returning here is what I think:

  • I guess (not a core dev though) that for Julia 0.7/1.0 the design of string infrastructure is fixed - we need to focus on bugfixing/documentation here;
  • we have AbstractChar and AbstractString so anyone can design a fully functional string package; and personally I support such initiatives - let us have well tested and efficient options so that people can choose what suits them best (this is a beauty of Julia that you are not tied to a functionality shipped in Base);
  • in the long run (Julia 2.0) the design of strings in base probably could be reconsidered if there are several fully functional and tested alternatives so that the community can make an informed choice.

#9

This can be had in Julia 1.x and even the following as efficient also for any UTF-8 (even illegal) strings:

I know you like to have a tag for ASCII, UCS-2 etc. but I believe just a bit indicating “your string is for sure ASCII” (or byte index of first non-ASCII) would be enough for most people; and byte index to end-of-string.


#10

@Palli Your thinking is reasonable but I feel that the crucial thing is how we get there. In my opinion it is impossible to reach such a conclusion on Discourse. The only way is that someone makes a package implementing what one thinks is good and let people test it. String ecosystem is a huge beast with many dark corners. In my opinion the only way to understand all consequences of some design is running it live. Otherwise we have hundreds of posts on Discourse on some topic and a risk of little fruit form such a discusion.