I have a case with fixed corpus of documents and one document which is a query. The task is to calculate similarity between these fixed documents and a document sent by a user. For now, we are using TextAnalysis with Corpus/DocumentMatrix and have a code something like this:
function count_tfidf(crps::Corpus, request::TokenDocument)
# take a copy of fixed corpus
local l_crps = Corpus(documents(crps) |> copy)
# add new doc into a copy for processing
push!(l_crps, request)
# merge terms to avoid recalculation for all the docs
local terms = keys(merge(lexicon(crps), ngrams(request))) |> collect
# calc full tf-idf
local doc_tf_idf = DocumentTermMatrix(l_crps, terms) |> tf_idf
# take a row with our doc
local val = doc_tf_idf[end, :]
#calculate similarity (with Distances.cosine_dist as a sample only)
return map eachrow(doc_tf_idf[1:end-1, :]) do row
cosine_dist(row, val)
end
end
The issue here, even when we are avoiding full lexicon update for new documents, as we are using the lexicon from a fixed corpus, the DocumentTermMatrix
and tf_idf
methods perform calculation for all the documents.
It would be good to have some individual tf_idf method which can use and existing matrix and calculate weights for a requested set of terms or tokens only. And we don’t need to store there query documents or use them for recalculating lexicon/dtm of our static corpus.
Technically I see way for calculating a DocumentTermMatrix
for that the only document if I’m creating a separate Corpus
- https://github.com/JuliaText/TextAnalysis.jl/blob/master/src/dtm.jl#L57. Also I see a Features · TextAnalysis with following code for getting terms row for the only document from the corpus and for the given lexicon:
julia> dtv(crps[1], lexicon(crps))
1Ă—6 Array{Int64,2}:
1 2 0 1 1 1
But I don’t see any way to calculate tf_idf
or tf_bm25
for new line only without recalculating of all weights for all documents. I cannot do it in real time but I need to create a response any case.
So, is any way to calculate TF/TF-IDF/BM25 other than reuse directly the code:
function tf_idf!(dtm::SparseMatrixCSC{T}, tfidf::SparseMatrixCSC{F}) where {T <: Real, F <: AbstractFloat}
rows = rowvals(dtm)
dtmvals = nonzeros(dtm)
tfidfvals = nonzeros(tfidf)
@assert size(dtmvals) == size(tfidfvals)
n, p = size(dtm)
# TF tells us what proportion of a document is defined by a term
words_in_documents = F.(sum(dtm, dims=2))
oneval = one(F)
# IDF tells us how rare a term is in the corpus
documents_containing_term = vec(sum(dtm .> 0, dims=1))
idf = log.(n ./ documents_containing_term)
for i = 1:p
for j in nzrange(dtm, i)
row = rows[j]
tfidfvals[j] = dtmvals[j] / max(words_in_documents[row], oneval) * idf[i]
end
end
return tfidf
end
from https://github.com/JuliaText/TextAnalysis.jl/blob/master/src/tf_idf.jl#L134