Kullback-Leibler divergence for vector and normal distribution

I want to calculate the Kullback-Leibler divergence of data I collected in a vector x, which I interpret as samples from an unknown distribution, and the standard normal distribution. The maths behind the KL divergence are straightforward. My naive approach would be to

  • choose a number of bins
  • make a histogram of x
  • discretize the density of the normal distribution according to the bins
  • calculate the KL divergence of two vectors using for example kldivergence from StatsBase

I wonder how good of an approach that is (conceptually and implementation wise). Is there a Julia package with more refined methods? What about the sensitivity with respect to the number of bins?

Thanks for all the answers!

I want to calculate the Kullback-Leibler divergence

It’s implemented in Distances.jl (not the other cool package, just used there): this package (one of the metrics): SequencerJ.jl/index.md at master · turingtest37/SequencerJ.jl · GitHub

and in the original Python version. See: sequencer.org

@Palli thanks for your quick answer. If I understood correctly, the KL-divergence in Distances.jl is calculating the distance between two vectors. So conceptually it is doing the same as kldivergence from StatsBase.

I did not understand the use of Sequencer.jl package. What problem does it solve and how would I use it for my case?

Discretizing the normal is the right thing to do.
I’m sure you know this but you want to normalize it as a discrete distribution, not as a density (i.e. ignore the bin widths). Also, you can add a very small constant to everything to avoid numerical issues if there are any zero bins.
I’ve heard it said that a good number of bins is the square root of the number of samples.

If you need to compare two densities, specially in high-dimensions, consider using a density ratio: GitHub - JuliaEarth/DensityRatioEstimation.jl: Density ratio estimation in Julia

You can express the KL-divergence in terms of the estimated ratio and that is usually more robust. All you need are samples from the two densities, no need to create bins.

1 Like

I did not understand the use of Sequencer.jl package.

It’s off-topic but cool. I didn’t read your question too carefully, and thought you were looking for an implementation, and I remembered (used) it there, but then I realized only in a dependency and edit my answer.