I would like to ask, if there is a Dict that would be suitable for indexing a large collection of documents. Specifically imagine that each document is composed by a set of words
Hmm - I’m not sure there is a function for nested indexing over streams like that.
You can always write the code directly, and it would work using either Dicts or Dictionarys. I’m not exactly sure about the specifics of your problem, and what you can afford to keep in-core vs out-of-core, but I’d start with something like:
index = Dict{Word, Set{DocId}}()
for (doc_id, doc) in stream
for word in doc
push!(
get!(Set{DocId}, index, word), # if word is not yet in index, add a new empty set of doc ids
doc_id # put doc_id into the new or existing set
)
end
end
(or with Dictionaries.jl its very similar, with set! instead of push!, Dictionary instead of Dict, Indices instead of Set…).
The SortedMultiDict structure from Julia Collections might help. SortedMultiDict is slower than Dict because the time per insertion grows logarithmically with n. On the other hand, it avoids the allocation overhead of arrays-of-arrays in the Dict solution because it encodes the data as 1=>doc1, 1=>doc3, 2=>doc2, 3=>doc1, 3=>doc2, etc.
This seems interesting. Your Dictionary was much faster then Dict. I am having many cores, therefore I was thinking about preparing dictionaries in parallel and merging them.
This seems an interesting option as well. Does SortedMultiDict support efficient merging of two dicts? That would be nice for map-reduce type of approach
As I have hinted above, I would like to distribute the computation. Therefore I am thinking about converting each doc into two arrays (keys-values) sorted according to key, as was proposed in SortedMultiDict, and use merging of sorted arrays to keep them sorted after merging. This way, the complexity of merging should be O(m+n), where m, n are length of corresponding arrays. This should also nicely parallelize (ideally using distributed reduction in Transducers). On the end, I will build the Dict in single sweep.
Is the above a bad idea? Anyone did something similar?
Yes, this merge routine requires N*log N operations, where N is the total size of the two dictionaries. A more complicated code (loop over both dictionaries with two iterators concurrently to copy the data to new arrays and then build a tree structure above the sorted merged arrays) would require only linear time, but nobody has coded this. You can open an issue in DataStructures github page to request a more efficient implementation of merge.
The first one kept the intermediate solutions in a sorted list of pairs and I had written a merge with the complexity O(M+N) where M and N are length of two merge lists. The processing was distributed using Transducers as
t = @elapsed c = foldxd(merge_sorted, Map(s -> dictfile(s, reformat_json)), chunks) |> Dictionary```
And on the end I put list to Dictionary from Dictionaries.jl.
Processing 3247 lead to 2_043_723_639 pairs (the problem is large) and the above has finished in 1200 seconds on a machine with 28 cores / 56 threads.
In the second approach, after processing one chunk I have store it in sortedmultidict, and process it in the same way as
@elapsed d = foldxd(Map(s -> dictfile(s, reformat_json)), chunks) do a, b
t = @elapsed DataStructures.merge!(a,b)
@info "merging took $(t)s"
a
end
This approach did not finished, as I ran out of the memory (the computational node had about 380Gb of mem).
Overall, I am quite happy how well it went. I do not understand well trees used in sortedmultidict, but I can see a space for an improvement there.