Calculating Entropy from data

v-i-s-h · August 21, 2017, 12:55pm

I have a vector of data sampled from a random process and want to calculate the entropy of the random variable.
I wrote this code

using Distributions
using StatsBase

function calc_entropy1( x::Vector{Float64} )
    N       = length( x );
    h       = fit( Histogram, x );
    bin_size    = h.edges[1].step.hi;
    p       = h.weights ./ N;
    # Now I have to calculte \sum p_i \log(p_i), but need to normalize 
    # the p_i s with the bin length to get densities. So ideally, I should have
    # p_i -> p_i/bin_length and then use this to calculate entropy. Of course
    # I have to muliply with the bin length to emulate the integration, 
    # so one of those 'bin_length's in the denominator will go away
    return -mapreduce( y->(y==0)?0.0:y*log(2,y/bin_size), +, p );
end

function calc_entropy2( x::Vector{Float64} )
    N       = length( x );
    h       = fit( Histogram, x );
    bin_size    = h.edges[1].step.hi;
    p       = h.weights ./ N;
    # Now I have to calculte \sum p_i \log(p_i), but no normalizations
    return -mapreduce( y->(y==0)?0.0:y*log(2,y), +, p );
end

X   = Normal( 0, 1 );
e_t = entropy( X );
println( "Actual entropy = ", e_t ); # Gives 1.4189385332046727

Now if I run the empirical calculations, I get

julia> x = rand(X,10); calc_entropy1( x ), calc_entropy2( x )
(2.4464393446710155, 2.4464393446710155)

julia> x = rand(X,100); calc_entropy1( x ), calc_entropy2( x )
(2.0081915715864564, 2.0081915715864564)

julia> x = rand(X,1000); calc_entropy1( x ), calc_entropy2( x )
(2.1493245852107323, 2.1493245852107323)

julia> x = rand(X,10000); calc_entropy1( x ), calc_entropy2( x )
(2.0558271324757027, 3.0558271324757023)

julia> x = rand(X,100000); calc_entropy1( x ), calc_entropy2( x )
(2.0589125311519716, 3.058912531151971)

julia> x = rand(X,1000000); calc_entropy1( x ), calc_entropy2( x )
(2.0617080562382912, 3.0617080562382917)

julia> x = rand(X,10000000); calc_entropy1( x ), calc_entropy2( x )
(2.062396029108722, 3.0623960291087227)

As you can see, both the calculations are wrong!!!
So what is the actual way to calculate entropy from data??

Thanks,
vish

juliohm · August 21, 2017, 3:39pm

I didn’t look carefully at the code, but I think we need a consensus of a standard stable implementation of entropy. I have written one a long time ago in Images.jl for N-dimensional grayscale images: https://github.com/JuliaImages/Images.jl/blob/master/src/algorithms.jl#L94-L139

StatsBase also has one here: https://github.com/JuliaStats/StatsBase.jl/blob/master/src/scalarstats.jl#L390-L414 They also have other more generalized versions.

Perhaps we’ve hit a numerical issue? Or perhaps it is a bug in some of these implementations? It would be great to benchmark all of them, decide for one, put it in StatsBase and eliminate all the rest.

If you have the time to debug this, please come up with a benchmark, that would be super helpful.

v-i-s-h · August 22, 2017, 2:38am

Hi @juliohm, thanks for the pointers.
It seems I was making a silly mistake - the entropy() function calculate it in nats while I was doing it in bits. And since I’m having a continuous distribution, I have to follow calc_entropy1() for the calculations.

Topic		Replies	Views
Entropy calculation from data - comparison with matlab Statistics statistics	2	3916	August 22, 2017
How to bin data properly for further study, by example to plot it estimated PDf and CDF? New to Julia	12	3758	July 21, 2019
Julia equivalent of scipy binned_statistic_dd Statistics question , package	4	888	July 9, 2021
Hist() and hits() in Julia? Statistics question	2	3984	December 30, 2016
Kullback-Leibler divergence for vector and normal distribution Statistics	7	2465	June 8, 2024

Calculating Entropy from data

Related topics