Reference String Based on Instance?


#1

I was looking over the TextParse package and I was surprised to find custom code in this file https://github.com/JuliaComputing/TextParse.jl/blob/master/src/util.jl and function nonallocating_setindex! that keeps a list of strings around and indexes to them when a string is allocated.

This made me wonder, since this PR means that strings should “technically” be immutable https://github.com/JuliaLang/julia/issues/22193, does the base string type automatically detect if there is an existing reference to a string and if there is just return that reference? If not isn’t this something that should be added? It seems pretty common among garbage collected languages with immutable strings.


#2

Immutability means that I could be done as an optimization. It is not currently done, however. How detect pre-existence of another reference efficiently?


#3

I think the naive solution would just a dictionary or set implementation. A faster and probably more memory efficient structure would be a Trie. My specialty is not in memory allocators and garbage collection so I can’t say for sure.

Also, as where string interning is efficient memory wise there are typically trade-offs, one being the expense of the interning operation. https://en.wikipedia.org/wiki/String_interning suggests that Julia already interns symbols. Perhaps that system could be built on top of?

In all cases I think having the functionality available but perhaps not used by default would help greatly and would remove the necessity for every package that is string heavy to implement its own, usually not inter-operable, string interning mechanism.


#4

At what point would you do the dictionary lookup? Any time a string object is created? That would be quite expensive. Interned objects like symbols are never garbage collected so interning all strings unconditionally would mean that we never free the storage for any string object which would be fairly impractical.

There is a package that provides an opt-in interned string type:

One possibility would be to automatically intern all sufficiently short strings. However, since a pointer to such an interned string is already 16 bytes on most systems, that solution is strictly dominated by storing short (< 16-byte) strings inline. I mentioned that previously which led @xiaodai to prototype the idea here:


#5

I will look into adopting these packages. It would be nice if the community would decide to back one of the solutions at some point before 1.0 to stop fragmentation of implementations.


#6

A trie would be slower, (you have to re-stitch all the characters into a string, everytime you operate with it, unless you fully re-implement all string operations, then maybe only C interop and a few other things would be slow).
But probably more memory efficient. (at least once you have a lot of strings interned)
Particularly, if we had compresses tries.