Kullback-Leibler divergence for vector and normal distribution

jamblejoe · March 30, 2021, 3:51pm

I want to calculate the Kullback-Leibler divergence of data I collected in a vector x, which I interpret as samples from an unknown distribution, and the standard normal distribution. The maths behind the KL divergence are straightforward. My naive approach would be to

choose a number of bins
make a histogram of x
discretize the density of the normal distribution according to the bins
calculate the KL divergence of two vectors using for example kldivergence from StatsBase

I wonder how good of an approach that is (conceptually and implementation wise). Is there a Julia package with more refined methods? What about the sensitivity with respect to the number of bins?

Thanks for all the answers!

Palli · March 30, 2021, 5:33pm

I want to calculate the Kullback-Leibler divergence

It’s implemented in Distances.jl (not the other cool package, just used there): ~~this~~ package (one of the metrics): https://github.com/turingtest37/SequencerJ.jl/blob/master/docs/src/index.md

and in the original Python version. See: sequencer.org

jamblejoe · March 30, 2021, 9:30pm

@Palli thanks for your quick answer. If I understood correctly, the KL-divergence in Distances.jl is calculating the distance between two vectors. So conceptually it is doing the same as kldivergence from StatsBase.

I did not understand the use of Sequencer.jl package. What problem does it solve and how would I use it for my case?

Adriel · March 30, 2021, 11:24pm

Discretizing the normal is the right thing to do.
I’m sure you know this but you want to normalize it as a discrete distribution, not as a density (i.e. ignore the bin widths). Also, you can add a very small constant to everything to avoid numerical issues if there are any zero bins.
I’ve heard it said that a good number of bins is the square root of the number of samples.

juliohm · March 31, 2021, 10:06am

If you need to compare two densities, specially in high-dimensions, consider using a density ratio: GitHub - JuliaML/DensityRatioEstimation.jl: Density ratio estimation in Julia

You can express the KL-divergence in terms of the estimated ratio and that is usually more robust. All you need are samples from the two densities, no need to create bins.

Palli · April 1, 2021, 1:22am

I did not understand the use of Sequencer.jl package.

It’s off-topic but cool. I didn’t read your question too carefully, and thought you were looking for an implementation, and I remembered (used) it there, but then I realized only in a dependency and edit my answer.

Rehana_popy · June 6, 2024, 8:38pm

Hi, I have a question regarding KL divergence calculation between two bivariate distributions P(x,y) and Q(x,y) where x is discrete but y is continuous. My strategy is to discretize the continuous variable using bins and then calculate their joint probability distributions. Does this sound okay?

lrnv · June 8, 2024, 6:22am

Hi @Rehana_popy welcome to the discourse ! Maybe you should open a new thread instead of resurecting a 2 years old one ? Also have a look at Please read: make it easier to help you

Otherwise yes, your strategy looks globally OK but without a MWE I cannot judge much more.

Topic		Replies	Views
How to bin data properly for further study, by example to plot it estimated PDf and CDF? New to Julia	12	3868	July 21, 2019
[ANN] DensityRatioEstimation.jl Package Announcements package , announcement , statistics , machine-learning	3	629	December 4, 2024
Kernel Density Estimate boundary problems Statistics question	3	275	June 5, 2024
How to fit a normal approximation to data in Julia Statistics	21	4703	January 8, 2021
Normal probability plot Visualization statistics , distributions , statsplots	7	2307	January 16, 2022

Kullback-Leibler divergence for vector and normal distribution

Related topics