Indexing a document using Dictionary

Tomas_Pevny · June 11, 2021, 9:03am

Dear All,

I would like to ask, if there is a Dict that would be suitable for indexing a large collection of documents. Specifically imagine that each document is composed by a set of words

doc1 = [1,2,3,4,5,]
doc2 = [3,4,5,6]
doc3 = [1,3,4,5]

I would like to store this in a dict containing

Dict(
1 => [doc1, doc3],
2 => [doc2],
3 => [doc1, doc2, doc3],
4 => [doc1, doc2, doc3],
5 => [doc1, doc2, doc3],
6 => [doc2],
)

But I need the solution to scale to tens of millions of documents where words are mostly unique to a document operating on streams.

At the moment, GitHub - andyferris/Dictionaries.jl: An alternative interface for dictionaries in Julia, for improved productivity and performance by @andyferris seems to me the best solution. Does anyone knows better solution / works on something like this?

andyferris · June 11, 2021, 10:59pm

Hmm - I’m not sure there is a function for nested indexing over streams like that.

You can always write the code directly, and it would work using either Dicts or Dictionarys. I’m not exactly sure about the specifics of your problem, and what you can afford to keep in-core vs out-of-core, but I’d start with something like:

index = Dict{Word, Set{DocId}}()
for (doc_id, doc) in stream
    for word in doc
        push!(
            get!(Set{DocId}, index, word),  # if word is not yet in index, add a new empty set of doc ids
            doc_id  # put doc_id into the new or existing set
        )
    end
end

(or with Dictionaries.jl its very similar, with set! instead of push!, Dictionary instead of Dict, Indices instead of Set…).

Stephen_Vavasis · June 12, 2021, 1:36am

The SortedMultiDict structure from Julia Collections might help. SortedMultiDict is slower than Dict because the time per insertion grows logarithmically with n. On the other hand, it avoids the allocation overhead of arrays-of-arrays in the Dict solution because it encodes the data as 1=>doc1, 1=>doc3, 2=>doc2, 3=>doc1, 3=>doc2, etc.

Tomas_Pevny · June 12, 2021, 6:00am

This seems interesting. Your Dictionary was much faster then Dict. I am having many cores, therefore I was thinking about preparing dictionaries in parallel and merging them.

Tomas_Pevny · June 12, 2021, 6:02am

This seems an interesting option as well. Does SortedMultiDict support efficient merging of two dicts? That would be nice for map-reduce type of approach

Tomas_Pevny · June 12, 2021, 7:15am

As I have hinted above, I would like to distribute the computation. Therefore I am thinking about converting each doc into two arrays (keys-values) sorted according to key, as was proposed in SortedMultiDict, and use merging of sorted arrays to keep them sorted after merging. This way, the complexity of merging should be O(m+n), where m, n are length of corresponding arrays. This should also nicely parallelize (ideally using distributed reduction in Transducers). On the end, I will build the Dict in single sweep.

Is the above a bad idea? Anyone did something similar?

andyferris · June 12, 2021, 10:20am

It seems like a reasonable approach to me.

Stephen_Vavasis · June 12, 2021, 2:13pm

Yes, merge and merge! operations are available for SortedMultiDict. The documentation is here: Sorted Containers · DataStructures.jl

Tomas_Pevny · June 12, 2021, 3:49pm

I have checked the implementation, but they seems to merge by calling insert sequentially, which is why I try to avoid (merge links to this one

github.com

JuliaCollections/DataStructures.jl/blob/0230564f0c88a3f65815cb98a8fe78c2658d8db8/src/sorted_multi_dict.jl#L378


      
                  if !eq(ord,k1,k2) || !isequal(d1,d2)
                      return false
                  end
                  p1 = advance((m1,p1))
                  p2 = advance((m2,p2))
              end
          end
          
          
const SDorAbstractDict = Union{AbstractDict,SortedMultiDict}
          
          
function mergetwo!(m::SortedMultiDict{K,D,Ord},
                             m2::SDorAbstractDict) where {K,D,Ord <: Ordering}
              for (k,v) in m2
                  insert!(m.bt, convert(K,k), convert(D,v), true)
              end
          end
          
          
# Standard copy functions use packcopy - that is, they retain elements but not
          # the identical structure.
          Base.copymutable(m::SortedMultiDict) = packcopy(m)
          Base.copy(m::SortedMultiDict) = packcopy(m)

which seems to me to be the sequential call of insert!.

Once I will be properly behind computer, I will try it.

Stephen_Vavasis · June 12, 2021, 4:10pm

Yes, this merge routine requires N*log N operations, where N is the total size of the two dictionaries. A more complicated code (loop over both dictionaries with two iterators concurrently to copy the data to new arrays and then build a tree structure above the sorted merged arrays) would require only linear time, but nobody has coded this. You can open an issue in DataStructures github page to request a more efficient implementation of merge.

Tomas_Pevny · June 15, 2021, 8:54am

I have compared two implementations.

The first one kept the intermediate solutions in a sorted list of pairs and I had written a merge with the complexity O(M+N) where M and N are length of two merge lists. The processing was distributed using Transducers as

t = @elapsed c = foldxd(merge_sorted, Map(s -> dictfile(s, reformat_json)), chunks) |> Dictionary```

And on the end I put list to Dictionary from Dictionaries.jl.

Processing 3247 lead to 2_043_723_639 pairs (the problem is large) and the above has finished in 1200 seconds on a machine with 28 cores / 56 threads.

In the second approach, after processing one chunk I have store it in sortedmultidict, and process it in the same way as

@elapsed d = foldxd(Map(s -> dictfile(s, reformat_json)), chunks) do a, b
        t = @elapsed DataStructures.merge!(a,b)
        @info "merging took $(t)s"
        a
    end

This approach did not finished, as I ran out of the memory (the computational node had about 380Gb of mem).

Overall, I am quite happy how well it went. I do not understand well trees used in sortedmultidict, but I can see a space for an improvement there.

Tomas

Topic		Replies	Views
[ANN] Dictionaries.jl - Improved productivity and performance of dictionaries in Julia Package Announcements dictionary , dictionaries	22	6142	December 15, 2019
Poor time performance on Dict? Performance	26	18979	March 12, 2018
How to efficiently save and read dictionaries in Julia? Performance question , dictionary , io , dictionaries	3	614	May 4, 2023
Performance of Dictionaries.jl vs Base.Dict Internals & Design package , performance , dictionary , dictionaries	32	3156	December 13, 2024
Fastest way to do memoization? Performance performance	11	597	February 23, 2023

Indexing a document using Dictionary

Related topics