This has been bothering me since I first started using julia 3 years ago.
Julia has immutable strings, that are not interned.
Late at night about a week ago I worked out how to solve it.
Its not actually that hard.
This package solves that, and it does so without breaking garbage collection.
The full explanation and motivational rant is in the readme.
If someone wants to check the math there, and make a PR, I’ld appreciate it.
My math says that one should expect to end up using an order of magnitude less memory when using InternedStrings, on 10 million token documents.
I was really pleased when I workout that it can be done without screwing up garbage collection.
Basically every string is a Strong reference, but they are a strong reference to the same string.
In some ways this is the opposite of @quinnj’s WeakRefStrings.jl