I have compared two implementations.
The first one kept the intermediate solutions in a sorted list of pairs and I had written a merge with the complexity O(M+N) where M and N are length of two merge lists. The processing was distributed using Transducers as
t = @elapsed c = foldxd(merge_sorted, Map(s -> dictfile(s, reformat_json)), chunks) |> Dictionary```
And on the end I put list to Dictionary
from Dictionaries.jl
.
Processing 3247
lead to 2_043_723_639
pairs (the problem is large) and the above has finished in 1200 seconds on a machine with 28 cores / 56 threads.
In the second approach, after processing one chunk
I have store it in sortedmultidict
, and process it in the same way as
@elapsed d = foldxd(Map(s -> dictfile(s, reformat_json)), chunks) do a, b
t = @elapsed DataStructures.merge!(a,b)
@info "merging took $(t)s"
a
end
This approach did not finished, as I ran out of the memory (the computational node had about 380Gb of mem).
Overall, I am quite happy how well it went. I do not understand well trees used in sortedmultidict, but I can see a space for an improvement there.
Tomas