Entropy calculation from data - comparison with matlab

statistics

#1

I want to calculate the entropy of a continuous random variable from a set of observations.

First I wrote the following code in Julia:

using Distributions
using StatsBase
using PyPlot

function calc_entropy( x::Vector{Float64} )
    N       = length( x );
    h       = fit( Histogram, x );
    bin_size    = h.edges[1].step.hi;
    p       = h.weights ./ N;
    # H(X) = ∫f(x) log(f(x)) dx ≈ ∑ p(x)/δ log( p(x)/δ ) δ
    sum     = 0.00;
    for y ∈ p
        if y > 0.00
            sum += y*log(y/bin_size)
        end
    end
    return -sum;
end

X   = Normal( 0, 1 );
e_t = entropy( X );

n_samples   = 50:50:20000;
n_iter      = 200;

entropies   = zeros( length(n_samples) );
for i = 1:length(n_samples)
    for iter = 1:n_iter
        x       = rand( X, n_samples[i] );
        e_cal   = calc_entropy( x );    # Empirical Entropy
        err     = (e_t-e_cal)^2;
        entropies[i] += e_cal;
    end
end

# Calculate avg MSE
entropies ./=n_iter;

figure();
plot( n_samples, entropies, "b-", label = "empirical entropy" );
plot( n_samples, e_t*ones(length(n_samples)), "b--", label = "theoretical" );
xlabel( "#Samples" );
ylabel( "H(X)" );
legend();   grid();

nothing

and got this result:

Here you can see that there is some numerical instability happening at a x = 2750, x = 7700, x = 16250 etc.,

Now recreated the whole experiment in matlab with the code

% ----------- File : calc_entropy.m ----------------------
function [ emp_entropy ] = calc_entropy( samples )
%CALC_ENTROPY 
    [ w, e ] = histcounts( samples );
    bin_size    = e(2)-e(1);
    w = w ./ sum(w);
    emp_entropy = 0.00;
    for i = 1:length(w)
        if( w(i) > 0.0 )
            emp_entropy = emp_entropy + (w(i) * log(w(i)/bin_size));
        end
    end
    emp_entropy = -1 * emp_entropy;
end
% ------------ End of calc_entropy.m ---------------------

% ----------------- File: main.m -------------------------
% Calculate the entropy with respect to sample size

X_mu    = 0;
X_sigma = 1;

% Theoretical entropy -- in nats
H_theoretical   = 1/2 * log( 2 * pi * exp(1) * (X_sigma^2) );

% Do Monte-Carlo
n_samples   = 50:50:20000;
n_iter      = 200;

H_empirical     = zeros( length(n_samples), 1 );
for n_sample = 1:length(n_samples)
    for i = 1:n_iter
        x = X_sigma * randn(n_sample,1) + X_mu; % Create samples
        h = calc_entropy( x );
        H_empirical(n_sample,1) = H_empirical(n_sample,1) + h;
    end
end
H_empirical = H_empirical ./ n_iter;

figure();
hold on;
plot( n_samples, H_theoretical * ones(1,length(n_samples)), 'b--', 'DisplayName', 'Theoretical' );
plot( n_samples, H_empirical, 'b-', 'DisplayName', 'Empirical' );
grid on;
xlabel( '#Samples' );
ylabel( 'H(X)' );
legend( 'show' );
% ----------------- End of main.m ------------------------

And the result looks like:

Here it’s smooth with no jumps in graph and is converging to the actual entropy.

But the julia code, the convergence is not there (the tail goes parallel to theoretical, possibly numerical problem).

Any thoughts on why this is happening is welcome.

Another thing I found is the run time.
In Julia (after running it 2/3 times so that pre compilations are over),

julia> @time include( "main.jl" )
 91.812708 seconds (812.40 M allocations: 18.993 GiB, 2.80% gc time)

But in Matlab,

>> tic; main; toc
Elapsed time is 4.372108 seconds.

That’s a huge difference!! How can I speed up my julia code??

Thanks,
vish


#2

https://docs.julialang.org/en/latest/manual/performance-tips/

Also, to make the examples comparable, factor out the function that works on the counts. Are you actually sure that histograms are calculated and interpreted the same way for both pieces of code?


#3

OT but potentially helpful style advice:

  • ; are not needed in Julia
  • it’s a bit strange to call a variable sum, as you are overwriting a function that does what you want. For example
    sum     = 0.00;
    for y ∈ p
        if y > 0.00
            sum += y*log(y/bin_size)
        end
    end
    return -sum;

could be replaced by the far more readable:

-sum(y*log(y/bin_size) for y in p if y > 0)

Again OT, I’m also unsure that a histogram is the best way to estimate entropy, maybe approximating the density in a different way could be more helpful (for example kde ).

As for the difference of the two outputs, check that Julia and Matlab are using the same number of bins. As far as I know Julia uses sturges rule, which makes the number of bins grow very slowly. Maybe run a test where you give the number of bins by hand and compare the results again.