Entropy calculation from data - comparison with matlab

v-i-s-h · August 22, 2017, 6:07am

I want to calculate the entropy of a continuous random variable from a set of observations.

First I wrote the following code in Julia:

using Distributions
using StatsBase
using PyPlot

function calc_entropy( x::Vector{Float64} )
    N       = length( x );
    h       = fit( Histogram, x );
    bin_size    = h.edges[1].step.hi;
    p       = h.weights ./ N;
    # H(X) = ∫f(x) log(f(x)) dx ≈ ∑ p(x)/δ log( p(x)/δ ) δ
    sum     = 0.00;
    for y ∈ p
        if y > 0.00
            sum += y*log(y/bin_size)
        end
    end
    return -sum;
end

X   = Normal( 0, 1 );
e_t = entropy( X );

n_samples   = 50:50:20000;
n_iter      = 200;

entropies   = zeros( length(n_samples) );
for i = 1:length(n_samples)
    for iter = 1:n_iter
        x       = rand( X, n_samples[i] );
        e_cal   = calc_entropy( x );    # Empirical Entropy
        err     = (e_t-e_cal)^2;
        entropies[i] += e_cal;
    end
end

# Calculate avg MSE
entropies ./=n_iter;

figure();
plot( n_samples, entropies, "b-", label = "empirical entropy" );
plot( n_samples, e_t*ones(length(n_samples)), "b--", label = "theoretical" );
xlabel( "#Samples" );
ylabel( "H(X)" );
legend();   grid();

nothing

and got this result:

Here you can see that there is some numerical instability happening at a x = 2750, x = 7700, x = 16250 etc.,

Now recreated the whole experiment in matlab with the code

% ----------- File : calc_entropy.m ----------------------
function [ emp_entropy ] = calc_entropy( samples )
%CALC_ENTROPY 
    [ w, e ] = histcounts( samples );
    bin_size    = e(2)-e(1);
    w = w ./ sum(w);
    emp_entropy = 0.00;
    for i = 1:length(w)
        if( w(i) > 0.0 )
            emp_entropy = emp_entropy + (w(i) * log(w(i)/bin_size));
        end
    end
    emp_entropy = -1 * emp_entropy;
end
% ------------ End of calc_entropy.m ---------------------

% ----------------- File: main.m -------------------------
% Calculate the entropy with respect to sample size

X_mu    = 0;
X_sigma = 1;

% Theoretical entropy -- in nats
H_theoretical   = 1/2 * log( 2 * pi * exp(1) * (X_sigma^2) );

% Do Monte-Carlo
n_samples   = 50:50:20000;
n_iter      = 200;

H_empirical     = zeros( length(n_samples), 1 );
for n_sample = 1:length(n_samples)
    for i = 1:n_iter
        x = X_sigma * randn(n_sample,1) + X_mu; % Create samples
        h = calc_entropy( x );
        H_empirical(n_sample,1) = H_empirical(n_sample,1) + h;
    end
end
H_empirical = H_empirical ./ n_iter;

figure();
hold on;
plot( n_samples, H_theoretical * ones(1,length(n_samples)), 'b--', 'DisplayName', 'Theoretical' );
plot( n_samples, H_empirical, 'b-', 'DisplayName', 'Empirical' );
grid on;
xlabel( '#Samples' );
ylabel( 'H(X)' );
legend( 'show' );
% ----------------- End of main.m ------------------------

And the result looks like:

Here it’s smooth with no jumps in graph and is converging to the actual entropy.

But the julia code, the convergence is not there (the tail goes parallel to theoretical, possibly numerical problem).

Any thoughts on why this is happening is welcome.

Another thing I found is the run time.
In Julia (after running it 2/3 times so that pre compilations are over),

julia> @time include( "main.jl" )
 91.812708 seconds (812.40 M allocations: 18.993 GiB, 2.80% gc time)

But in Matlab,

>> tic; main; toc
Elapsed time is 4.372108 seconds.

That’s a huge difference!! How can I speed up my julia code??

Thanks,
vish

Tamas_Papp · August 22, 2017, 7:14am

https://docs.julialang.org/en/latest/manual/performance-tips/

Also, to make the examples comparable, factor out the function that works on the counts. Are you actually sure that histograms are calculated and interpreted the same way for both pieces of code?

piever · August 22, 2017, 9:47am

OT but potentially helpful style advice:

; are not needed in Julia
it’s a bit strange to call a variable sum, as you are overwriting a function that does what you want. For example

    sum     = 0.00;
    for y ∈ p
        if y > 0.00
            sum += y*log(y/bin_size)
        end
    end
    return -sum;

could be replaced by the far more readable:

-sum(y*log(y/bin_size) for y in p if y > 0)

Again OT, I’m also unsure that a histogram is the best way to estimate entropy, maybe approximating the density in a different way could be more helpful (for example kde ).

As for the difference of the two outputs, check that Julia and Matlab are using the same number of bins. As far as I know Julia uses sturges rule, which makes the number of bins grow very slowly. Maybe run a test where you give the number of bins by hand and compare the results again.

Topic		Replies	Views
Calculating Entropy from data Statistics statistics	2	3084	August 22, 2017
A simple Monte Carlo in Julia and C++ New to Julia question	15	6894	March 15, 2018
[ANN] EntropyHub.jl Package Announcements package , announcement , time-series , math , nonlinear	11	1569	July 14, 2021
Quite bad performance of Julia 0.6.4 vs Python+Numpy General Usage	26	5332	November 13, 2018
Why is this Julia code considerably slower than Matlab New to Julia performance	64	8631	March 5, 2017

Entropy calculation from data - comparison with matlab

Related topics