I have compared two implementations.
The first one kept the intermediate solutions in a sorted list of pairs and I had written a merge with the complexity O(M+N) where M and N are length of two merge lists. The processing was distributed using Transducers as
t = @elapsed c = foldxd(merge_sorted, Map(s -> dictfile(s, reformat_json)), chunks) |> Dictionary```
And on the end I put list to Dictionary from Dictionaries.jl.
Processing 3247 lead to 2_043_723_639 pairs (the problem is large) and the above has finished in 1200 seconds on a machine with 28 cores / 56 threads.
In the second approach, after processing one chunk I have store it in sortedmultidict, and process it in the same way as
@elapsed d = foldxd(Map(s -> dictfile(s, reformat_json)), chunks) do a, b
t = @elapsed DataStructures.merge!(a,b)
@info "merging took $(t)s"
a
end
This approach did not finished, as I ran out of the memory (the computational node had about 380Gb of mem).
Overall, I am quite happy how well it went. I do not understand well trees used in sortedmultidict, but I can see a space for an improvement there.
Tomas