[ANN] InternedStrings.jl: Allocate strings once and reuse them

oxinabox · November 28, 2017, 2:35am

This has been bothering me since I first started using julia 3 years ago.
Julia has immutable strings, that are not interned.
Late at night about a week ago I worked out how to solve it.
Its not actually that hard.

This package solves that, and it does so without breaking garbage collection.
The full explanation and motivational rant is in the readme.

https://github.com/oxinabox/InternedStrings.jl

If someone wants to check the math there, and make a PR, I’ld appreciate it.
My math says that one should expect to end up using an order of magnitude less memory when using InternedStrings, on 10 million token documents.

I was really pleased when I workout that it can be done without screwing up garbage collection.
Basically every string is a Strong reference, but they are a strong reference to the same string.
In some ways this is the opposite of @quinnj’s WeakRefStrings.jl

yakir12 · November 28, 2017, 8:05am

Very cool!

xiaodai · January 3, 2018, 6:14am

Very cool but converting a large vector of Strings to InternedString is quite slow and given how simple the code is there doesn’t seem to be a good way to speed it up. See example below

using InternedStrings
const n = 250_000_000; const grps=n÷100; const strlen = 10; const prefix = "id"
string_vec = rand([prefix*dec(k,strlen) for k in 1:grps], n);
to_istring_vec(string_vec) = InternedString.(string_vec);
@time istring_vec = to_istring_vec(string_vec);

oxinabox · January 3, 2018, 3:23pm

Interning strings isn’t a free operation, that is indeed true.
It requires a dictionary lookup, and I don’t think there is any way of dodging that.

With that said:
I’ve not put a huge amount of thought into optimizing it for speed.
Core code that matters is here.
I can’t see any obvious ways to make it faster.
There might be some fine tuning around making it not threadsafe… but all normal operations on WeakKeyDicts in Base are defined with this same threadsafe pattern.

It is hard to compete with nop for speed, that is for certain.

I would assume the cost of calling InternedString
is as the portion of any applications running time not the critical factor (though of course one needs to profile your application to see).
I would assume it is a fraction of the time for even a fairly simple regex tokenizer to run.
I should benchmark that though.

ScottPJones · January 3, 2018, 3:38pm

If you feel like seeing if this can be done faster, using Strs.jl as a base for the interned strings, and want to collaborate on it, this would be very nice to have available. (I’m a big fan of interned strings for some things )

ScottPJones · January 3, 2018, 3:57pm

A few things, a different hash structure can help a lot, maybe not using a 64-bit hash (base has a fast CRC-32 now), and first checking for whether the string is present before locking (if it isn’t present, after locking you can quickly check to see if any entries have been added since you did the first check, if so, you just need to recheck before adding). Note: depending on the processor, you might need to perform a memory barrier operation, to make sure that you read in an up to date “version” number before you check the first time.

oxinabox · January 4, 2018, 3:26am

If you feel like seeing if this can be done faster, using Strs.jl as a base for the interned strings, and want to collaborate on it, this would be very nice to have available. (I’m a big fan of interned strings for some things )

I think the way to go down that direction would be to make a separate more generic package

Intern.jl which would export a type Intern{T} and could be applied to any (semantically) immutable type.
E.g. Str, String or BigFloat.

then when that is done, StringInterning.jl could be rebuilt as a wrapper for it.

A few things, a different hash structure can help a lot, maybe not using a 64-bit hash (base has a fast CRC-32 now),

Yeah, I was thinking that, changing the hash function.
Maybe even using a TreeDict (eg DataStrutures,StortedDict).
(In the other direction: for ultimate memory savings, at the cost of time of evertying could use a Trie or a finite automata graph)

and first checking for whether the string is present before locking (if it isn’t present, after locking you can quickly check to see if any entries have been added since you did the first check, if so, you just need to recheck before adding).

If this kinda double checked locking works, then it should be added to all the other Base.WeakKeyDict methods, which would also speed up serialisation

ScottPJones · January 4, 2018, 3:36am

Of course it works
Pretty common technique, I was surprised that the code in Base doesn’t do that (of course, you need to think ahead, about using data structures that require the minimum of work done while locked - if you really do it well, you can use structures such that no real locking is needed at all - you use compare_and_swap or load_locked/store_conditional instructions to insert the new entry into the table). On more modern platforms, there’s the possibility of using transactional memory.

oxinabox · May 8, 2018, 4:48am

New version of InternedStrings.jl has had a pretty major overhaul
No more InternedString type,
see https://github.com/oxinabox/InternedStrings.jl/pull/9

and

https://github.com/oxinabox/InternedStrings.jl/blob/master/News.md