Calculating Entropy from data

statistics

#1

I have a vector of data sampled from a random process and want to calculate the entropy of the random variable.
I wrote this code

using Distributions
using StatsBase

function calc_entropy1( x::Vector{Float64} )
    N       = length( x );
    h       = fit( Histogram, x );
    bin_size    = h.edges[1].step.hi;
    p       = h.weights ./ N;
    # Now I have to calculte \sum p_i \log(p_i), but need to normalize 
    # the p_i s with the bin length to get densities. So ideally, I should have
    # p_i -> p_i/bin_length and then use this to calculate entropy. Of course
    # I have to muliply with the bin length to emulate the integration, 
    # so one of those 'bin_length's in the denominator will go away
    return -mapreduce( y->(y==0)?0.0:y*log(2,y/bin_size), +, p );
end

function calc_entropy2( x::Vector{Float64} )
    N       = length( x );
    h       = fit( Histogram, x );
    bin_size    = h.edges[1].step.hi;
    p       = h.weights ./ N;
    # Now I have to calculte \sum p_i \log(p_i), but no normalizations
    return -mapreduce( y->(y==0)?0.0:y*log(2,y), +, p );
end

X   = Normal( 0, 1 );
e_t = entropy( X );
println( "Actual entropy = ", e_t ); # Gives 1.4189385332046727

Now if I run the empirical calculations, I get

julia> x = rand(X,10); calc_entropy1( x ), calc_entropy2( x )
(2.4464393446710155, 2.4464393446710155)

julia> x = rand(X,100); calc_entropy1( x ), calc_entropy2( x )
(2.0081915715864564, 2.0081915715864564)

julia> x = rand(X,1000); calc_entropy1( x ), calc_entropy2( x )
(2.1493245852107323, 2.1493245852107323)

julia> x = rand(X,10000); calc_entropy1( x ), calc_entropy2( x )
(2.0558271324757027, 3.0558271324757023)

julia> x = rand(X,100000); calc_entropy1( x ), calc_entropy2( x )
(2.0589125311519716, 3.058912531151971)

julia> x = rand(X,1000000); calc_entropy1( x ), calc_entropy2( x )
(2.0617080562382912, 3.0617080562382917)

julia> x = rand(X,10000000); calc_entropy1( x ), calc_entropy2( x )
(2.062396029108722, 3.0623960291087227)

As you can see, both the calculations are wrong!!!
So what is the actual way to calculate entropy from data??

Thanks,
vish


#2

I didn’t look carefully at the code, but I think we need a consensus of a standard stable implementation of entropy. I have written one a long time ago in Images.jl for N-dimensional grayscale images: https://github.com/JuliaImages/Images.jl/blob/master/src/algorithms.jl#L94-L139

StatsBase also has one here: https://github.com/JuliaStats/StatsBase.jl/blob/master/src/scalarstats.jl#L390-L414 They also have other more generalized versions.

Perhaps we’ve hit a numerical issue? Or perhaps it is a bug in some of these implementations? It would be great to benchmark all of them, decide for one, put it in StatsBase and eliminate all the rest.

If you have the time to debug this, please come up with a benchmark, that would be super helpful.


#3

Hi @juliohm, thanks for the pointers.
It seems I was making a silly mistake - the entropy() function calculate it in nats while I was doing it in bits. And since I’m having a continuous distribution, I have to follow calc_entropy1() for the calculations.