TextAnalysis and individual TF-IDF

I have a case with fixed corpus of documents and one document which is a query. The task is to calculate similarity between these fixed documents and a document sent by a user. For now, we are using TextAnalysis with Corpus/DocumentMatrix and have a code something like this:

function count_tfidf(crps::Corpus, request::TokenDocument)
  # take a copy of fixed corpus
  local l_crps = Corpus(documents(crps) |> copy)

  # add new doc into a copy for processing
  push!(l_crps, request)
  
  # merge terms to avoid recalculation for all the docs
  local terms = keys(merge(lexicon(crps), ngrams(request))) |> collect 

  # calc full tf-idf
  local doc_tf_idf = DocumentTermMatrix(l_crps, terms) |> tf_idf
  # take a row with our doc
  local val = doc_tf_idf[end, :] 

  #calculate similarity (with Distances.cosine_dist as a sample only)
  return map eachrow(doc_tf_idf[1:end-1, :]) do row
    cosine_dist(row, val)
  end
end

The issue here, even when we are avoiding full lexicon update for new documents, as we are using the lexicon from a fixed corpus, the DocumentTermMatrix and tf_idf methods perform calculation for all the documents.

It would be good to have some individual tf_idf method which can use and existing matrix and calculate weights for a requested set of terms or tokens only. And we don’t need to store there query documents or use them for recalculating lexicon/dtm of our static corpus.

Technically I see way for calculating a DocumentTermMatrix for that the only document if I’m creating a separate Corpus - https://github.com/JuliaText/TextAnalysis.jl/blob/master/src/dtm.jl#L57. Also I see a Features · TextAnalysis with following code for getting terms row for the only document from the corpus and for the given lexicon:

julia> dtv(crps[1], lexicon(crps))
1Ă—6 Array{Int64,2}:
 1  2  0  1  1  1

But I don’t see any way to calculate tf_idf or tf_bm25 for new line only without recalculating of all weights for all documents. I cannot do it in real time but I need to create a response any case.

So, is any way to calculate TF/TF-IDF/BM25 other than reuse directly the code:

function tf_idf!(dtm::SparseMatrixCSC{T}, tfidf::SparseMatrixCSC{F}) where {T <: Real, F <: AbstractFloat}
    rows = rowvals(dtm)
    dtmvals = nonzeros(dtm)
    tfidfvals = nonzeros(tfidf)
    @assert size(dtmvals) == size(tfidfvals)

    n, p = size(dtm)

    # TF tells us what proportion of a document is defined by a term
    words_in_documents = F.(sum(dtm, dims=2))
    oneval = one(F)

    # IDF tells us how rare a term is in the corpus
    documents_containing_term = vec(sum(dtm .> 0, dims=1))
    idf = log.(n ./ documents_containing_term)

    for i = 1:p
       for j in nzrange(dtm, i)
          row = rows[j]
          tfidfvals[j] = dtmvals[j] / max(words_in_documents[row], oneval) * idf[i]
       end
    end

    return tfidf
end

from https://github.com/JuliaText/TextAnalysis.jl/blob/master/src/tf_idf.jl#L134

Have a look at
https://zgornel.github.io/StringAnalysis.jl/dev/examples/#Dimensionality-reduction-1
and
https://zgornel.github.io/StringAnalysis.jl/dev/examples/#Semantic-Analysis-1

By using the “no projection hack” you can use a random projection model to embed a query, the corpus and calculate the similarity between them.

1 Like

Thank you. Will check it. BTW, what is the reason of implementing of the functions like TextAnalysis but as a separate project?

TextAnalysis is a big package with some issues (personal opinion ofcourse) in code quality, functionality, features and development/PR merge speed.

Basically, I needed fast deployment of a stable version, better typing, information retrieval-oriented methods, less large dependencies i.e. Flux.jl and cleaner API so hard forked it.

1 Like