Indexing a document using Dictionary

I have compared two implementations.

The first one kept the intermediate solutions in a sorted list of pairs and I had written a merge with the complexity O(M+N) where M and N are length of two merge lists. The processing was distributed using Transducers as

t = @elapsed c = foldxd(merge_sorted, Map(s -> dictfile(s, reformat_json)), chunks) |> Dictionary```

And on the end I put list to Dictionary from Dictionaries.jl.

Processing 3247 lead to 2_043_723_639 pairs (the problem is large) and the above has finished in 1200 seconds on a machine with 28 cores / 56 threads.

In the second approach, after processing one chunk I have store it in sortedmultidict, and process it in the same way as

@elapsed d = foldxd(Map(s -> dictfile(s, reformat_json)), chunks) do a, b
        t = @elapsed DataStructures.merge!(a,b)
        @info "merging took $(t)s"
        a
    end

This approach did not finished, as I ran out of the memory (the computational node had about 380Gb of mem).

Overall, I am quite happy how well it went. I do not understand well trees used in sortedmultidict, but I can see a space for an improvement there.

Tomas