Indexing a document using Dictionary

Tomas_Pevny · June 15, 2021, 8:54am

I have compared two implementations.

The first one kept the intermediate solutions in a sorted list of pairs and I had written a merge with the complexity O(M+N) where M and N are length of two merge lists. The processing was distributed using Transducers as

t = @elapsed c = foldxd(merge_sorted, Map(s -> dictfile(s, reformat_json)), chunks) |> Dictionary```

And on the end I put list to Dictionary from Dictionaries.jl.

Processing 3247 lead to 2_043_723_639 pairs (the problem is large) and the above has finished in 1200 seconds on a machine with 28 cores / 56 threads.

In the second approach, after processing one chunk I have store it in sortedmultidict, and process it in the same way as

@elapsed d = foldxd(Map(s -> dictfile(s, reformat_json)), chunks) do a, b
        t = @elapsed DataStructures.merge!(a,b)
        @info "merging took $(t)s"
        a
    end

This approach did not finished, as I ran out of the memory (the computational node had about 380Gb of mem).

Overall, I am quite happy how well it went. I do not understand well trees used in sortedmultidict, but I can see a space for an improvement there.

Tomas

Topic		Replies	Views
[ANN] Dictionaries.jl - Improved productivity and performance of dictionaries in Julia Package Announcements dictionary , dictionaries	22	6154	December 15, 2019
Poor time performance on Dict? Performance	26	19025	March 12, 2018
How to efficiently save and read dictionaries in Julia? Performance question , dictionary , io , dictionaries	3	637	May 4, 2023
Performance of Dictionaries.jl vs Base.Dict Internals & Design package , performance , dictionary , dictionaries	32	3174	December 13, 2024
Fastest way to do memoization? Performance performance	11	627	February 23, 2023

Indexing a document using Dictionary

Related topics