I have a vector of data sampled from a random process and want to calculate the entropy of the random variable.
I wrote this code
using Distributions
using StatsBase
function calc_entropy1( x::Vector{Float64} )
N = length( x );
h = fit( Histogram, x );
bin_size = h.edges[1].step.hi;
p = h.weights ./ N;
# Now I have to calculte \sum p_i \log(p_i), but need to normalize
# the p_i s with the bin length to get densities. So ideally, I should have
# p_i -> p_i/bin_length and then use this to calculate entropy. Of course
# I have to muliply with the bin length to emulate the integration,
# so one of those 'bin_length's in the denominator will go away
return -mapreduce( y->(y==0)?0.0:y*log(2,y/bin_size), +, p );
end
function calc_entropy2( x::Vector{Float64} )
N = length( x );
h = fit( Histogram, x );
bin_size = h.edges[1].step.hi;
p = h.weights ./ N;
# Now I have to calculte \sum p_i \log(p_i), but no normalizations
return -mapreduce( y->(y==0)?0.0:y*log(2,y), +, p );
end
X = Normal( 0, 1 );
e_t = entropy( X );
println( "Actual entropy = ", e_t ); # Gives 1.4189385332046727
Now if I run the empirical calculations, I get
julia> x = rand(X,10); calc_entropy1( x ), calc_entropy2( x )
(2.4464393446710155, 2.4464393446710155)
julia> x = rand(X,100); calc_entropy1( x ), calc_entropy2( x )
(2.0081915715864564, 2.0081915715864564)
julia> x = rand(X,1000); calc_entropy1( x ), calc_entropy2( x )
(2.1493245852107323, 2.1493245852107323)
julia> x = rand(X,10000); calc_entropy1( x ), calc_entropy2( x )
(2.0558271324757027, 3.0558271324757023)
julia> x = rand(X,100000); calc_entropy1( x ), calc_entropy2( x )
(2.0589125311519716, 3.058912531151971)
julia> x = rand(X,1000000); calc_entropy1( x ), calc_entropy2( x )
(2.0617080562382912, 3.0617080562382917)
julia> x = rand(X,10000000); calc_entropy1( x ), calc_entropy2( x )
(2.062396029108722, 3.0623960291087227)
As you can see, both the calculations are wrong!!!
So what is the actual way to calculate entropy from data??
Thanks,
vish