[ANN] New package: WordFrequencyDistributions.jl

myersm0 · September 13, 2025, 10:28pm

I’d like to belatedly announce a little package that I made, WordFrequencyDistributions.jl: a Julia implementation of some of the techniques for estimating and analyzing word frequency statistics, from R. Harald Baayen’s book of the same name (Springer, 2001).

The main alternative implementation of these methods, to my knowledge, is R language’s zipfR. Although my package is more limited in scope, it performs much better, especially in running Monte Carlo simulations.

I’ve recently updated the package with some benchmarking and have found that for an entire workflow that consists of the following:

Reading and tokenizing the complete works of Shakespeare into a Vector{String} (that’s just short of a million words)
Initializing a Corpus struct from it
Computing some statistics, such as the finding 10 most frequent words and calculating the rate of vocabulary growth at 20 equispaced points in the text
Interpolating an expected vocabulary size at various points in the text
Checking the dispersion of a given word in the text across 40 equally-sized chunks
Running a Monte Carlo simulation in which we randomly shuffle the text 100 times, and analyze the distribution of a given word in the actual text compared to in the 100 shuffled versions

My package does steps 2 through 6 in ~1.4 seconds, while zipfR takes about twice that. In Python’s nltk it takes about 33 times longer to do the same routine. (I didn’t measure the loading and tokenizing time, because that’s outside the scope of my package. But that part was comparable in all three languages.)

It’s also memory-efficient. The Corpus struct representing the text only occupies about 3.52 MB of RAM, which is about 65% of the size of the text itself. (zipfR is comparable in this respect.)

This package is also more generic. Although the intended use is for counting words/tokens in a Vector{String}, it could theoretically work for any sort of item (any Vector{T}) for which you might want to analyze frequency statistics.

Topic		Replies	Views
Count words challenge Performance	24	2286	March 23, 2021
Best way to optimize code Performance question	29	2429	May 21, 2020
Implementation of Norvigs spellchecker (Code critique + performance help) Performance question , performance	21	2304	January 14, 2020
Find frequent words New to Julia question , loops , broadcasting	3	932	September 27, 2021
Writing a fast nlp tokenizer in Julia Performance nlp	12	2461	February 2, 2021

[ANN] New package: WordFrequencyDistributions.jl

Related topics