Efficient merging of large dictionaries

Hi, I have few (tens, max. hundreds) Dict{String, Int} dictionaries which I want to merge together using merge(+, dicts…).
The dictionaries are fairly large (up to 100k keys), and have lots of common keys.
It’s working as expected, but it’s very slow. Most of the time is spent in the actual merging in ht_keyindex and ht_keyindex2!. Is there anything I could do to make it faster, or did I hit the limit? I see that the merge itself is simple https://github.com/JuliaLang/julia/blob/v1.4.2/base/abstractdict.jl#L216, so I’m wondering if there is anything that could be done for this usecase

Not sure if it would be faster per se but you could try extracting all the key/value pairs from the various Dict objects into a Vector of Tuple{String, Int}, sort them, then create the Dict by iterating over lists at the same time. It will take a lot of memory since you will need all the Dict objects in memory at once. You could also use threading to perform the conversion/sort of the Vectors in parallel.

If it’s faster or not probably depends on the number of cores you have.

1 Like

I would consider something like

function merge_all(combine, dicts)
    d1 = first(dicts)
    result = Dict{keytype(d1),valtype(d1)}()
    for d in dicts
        merge!(combine, result, d)
    end
    result
end

(untested)

I think this is basically the same thing as official implementation, right?

But I just discovered https://github.com/andyferris/Dictionaries.jl so I’ll give them a try.

Yes, for merge!.