Indexing a document using Dictionary

Dear All,

I would like to ask, if there is a Dict that would be suitable for indexing a large collection of documents. Specifically imagine that each document is composed by a set of words

doc1 = [1,2,3,4,5,]
doc2 = [3,4,5,6]
doc3 = [1,3,4,5]

I would like to store this in a dict containing

1 => [doc1, doc3],
2 => [doc2],
3 => [doc1, doc2, doc3],
4 => [doc1, doc2, doc3],
5 => [doc1, doc2, doc3],
6 => [doc2],

But I need the solution to scale to tens of millions of documents where words are mostly unique to a document operating on streams.

At the moment, GitHub - andyferris/Dictionaries.jl: An alternative interface for dictionaries in Julia, for improved productivity and performance by @andyferris seems to me the best solution. Does anyone knows better solution / works on something like this?

Hmm - I’m not sure there is a function for nested indexing over streams like that.

You can always write the code directly, and it would work using either Dicts or Dictionarys. I’m not exactly sure about the specifics of your problem, and what you can afford to keep in-core vs out-of-core, but I’d start with something like:

index = Dict{Word, Set{DocId}}()
for (doc_id, doc) in stream
    for word in doc
            get!(Set{DocId}, index, word),  # if word is not yet in index, add a new empty set of doc ids
            doc_id  # put doc_id into the new or existing set

(or with Dictionaries.jl its very similar, with set! instead of push!, Dictionary instead of Dict, Indices instead of Set…).

The SortedMultiDict structure from Julia Collections might help. SortedMultiDict is slower than Dict because the time per insertion grows logarithmically with n. On the other hand, it avoids the allocation overhead of arrays-of-arrays in the Dict solution because it encodes the data as 1=>doc1, 1=>doc3, 2=>doc2, 3=>doc1, 3=>doc2, etc.

This seems interesting. Your Dictionary was much faster then Dict. I am having many cores, therefore I was thinking about preparing dictionaries in parallel and merging them.

1 Like

This seems an interesting option as well. Does SortedMultiDict support efficient merging of two dicts? That would be nice for map-reduce type of approach

As I have hinted above, I would like to distribute the computation. Therefore I am thinking about converting each doc into two arrays (keys-values) sorted according to key, as was proposed in SortedMultiDict, and use merging of sorted arrays to keep them sorted after merging. This way, the complexity of merging should be O(m+n), where m, n are length of corresponding arrays. This should also nicely parallelize (ideally using distributed reduction in Transducers). On the end, I will build the Dict in single sweep.

Is the above a bad idea? Anyone did something similar?

It seems like a reasonable approach to me.

Yes, merge and merge! operations are available for SortedMultiDict. The documentation is here: Sorted Containers · DataStructures.jl

I have checked the implementation, but they seems to merge by calling insert sequentially, which is why I try to avoid (merge links to this one

which seems to me to be the sequential call of insert!.

Once I will be properly behind computer, I will try it.

Yes, this merge routine requires N*log N operations, where N is the total size of the two dictionaries. A more complicated code (loop over both dictionaries with two iterators concurrently to copy the data to new arrays and then build a tree structure above the sorted merged arrays) would require only linear time, but nobody has coded this. You can open an issue in DataStructures github page to request a more efficient implementation of merge.

1 Like

I have compared two implementations.

The first one kept the intermediate solutions in a sorted list of pairs and I had written a merge with the complexity O(M+N) where M and N are length of two merge lists. The processing was distributed using Transducers as

t = @elapsed c = foldxd(merge_sorted, Map(s -> dictfile(s, reformat_json)), chunks) |> Dictionary```

And on the end I put list to Dictionary from Dictionaries.jl.

Processing 3247 lead to 2_043_723_639 pairs (the problem is large) and the above has finished in 1200 seconds on a machine with 28 cores / 56 threads.

In the second approach, after processing one chunk I have store it in sortedmultidict, and process it in the same way as

@elapsed d = foldxd(Map(s -> dictfile(s, reformat_json)), chunks) do a, b
        t = @elapsed DataStructures.merge!(a,b)
        @info "merging took $(t)s"

This approach did not finished, as I ran out of the memory (the computational node had about 380Gb of mem).

Overall, I am quite happy how well it went. I do not understand well trees used in sortedmultidict, but I can see a space for an improvement there.