[ANN] New package: WordFrequencyDistributions.jl

I’d like to belatedly announce a little package that I made, WordFrequencyDistributions.jl: a Julia implementation of some of the techniques for estimating and analyzing word frequency statistics, from R. Harald Baayen’s book of the same name (Springer, 2001).

The main alternative implementation of these methods, to my knowledge, is R language’s zipfR. Although my package is more limited in scope, it performs much better, especially in running Monte Carlo simulations.

I’ve recently updated the package with some benchmarking and have found that for an entire workflow that consists of the following:

  1. Reading and tokenizing the complete works of Shakespeare into a Vector{String} (that’s just short of a million words)
  2. Initializing a Corpus struct from it
  3. Computing some statistics, such as the finding 10 most frequent words and calculating the rate of vocabulary growth at 20 equispaced points in the text
  4. Interpolating an expected vocabulary size at various points in the text
  5. Checking the dispersion of a given word in the text across 40 equally-sized chunks
  6. Running a Monte Carlo simulation in which we randomly shuffle the text 100 times, and analyze the distribution of a given word in the actual text compared to in the 100 shuffled versions

My package does steps 2 through 6 in ~1.4 seconds, while zipfR takes about twice that. In Python’s nltk it takes about 33 times longer to do the same routine. (I didn’t measure the loading and tokenizing time, because that’s outside the scope of my package. But that part was comparable in all three languages.)

It’s also memory-efficient. The Corpus struct representing the text only occupies about 3.52 MB of RAM, which is about 65% of the size of the text itself. (zipfR is comparable in this respect.)

This package is also more generic. Although the intended use is for counting words/tokens in a Vector{String}, it could theoretically work for any sort of item (any Vector{T}) for which you might want to analyze frequency statistics.

10 Likes