Inconsistent results trying to mesure feature importance

Hello, I am trying to build a method to generically measure feature importance in ML models but I am having very inconsistent results, that depends on the individual run, and I am very p…off and I don’t understand why.

I am using two methods, mean accuracy decrease and explained variance using Sobol total index. In both cases, I compute them by randomly shuffling the specific column to omit (setting it to the column mean doesn’t change anything), refitting and measure, or, for the ML algorithm that support predicting ignoring columns, by specifying the column to ignore.
Still super random results, even if the fitting is quite good.

Here is the code (it needs BetaML master for the sobol_index and the ignore_dims keyword):

# Syntetic data generation
# x1: high importance, x2: little importance, x3: mixed effects with x1, x4: highly correlated with x1 but no effects on Y, x5 and x6: no effects on Y 
N     = 2000
D     = 6
xa    = rand(copy(TESTRNG),0:0.0001:10,N,3)
xb    = (xa[:,1] .* 2 .* rand(0.8:0.001:1.2)) .+ 10 
xc    = rand(copy(TESTRNG),0:0.0001:10,N,D-4)
x     = hcat(xa,xb,xc)  
y     = [10*r[1]-r[2]-0.1*r[3]*r[1] for r in eachrow(x) ]
((xtrain,xtest),(ytrain,ytest)) = partition([x,y],[0.8,0.2],rng=copy(TESTRNG))

# full cols model:
m = RandomForestEstimator(n_trees=300)
ŷtest = predict(m,xtest)
loss = norm(ytest-ŷtest)/length(ytest) # this is good

loss_by_cols  = zeros(D)
sobol_by_cols = zeros(D)
loss_by_cols2  = zeros(D)
sobol_by_cols2 = zeros(D)
diffest_bycols = zeros(D)
for d in 1:D
    println("Doing modelling without dimension $d ....")
    xd_train = hcat(xtrain[:,1:d-1],shuffle(xtrain[:,d]),xtrain[:,d+1:end])
    xd_test = hcat(xtest[:,1:d-1],shuffle(xtest[:,d]),xtest[:,d+1:end])  
    md = RandomForestEstimator(n_trees=300)
    ŷdtest = predict(md,xd_test)
    loss_by_cols[d]  = norm(ytest-ŷdtest)/length(ytest)
    sobol_by_cols[d] = sobol_index(ŷtest,ŷdtest) 
    ŷdtest2 = predict(m,xtest,ignore_dims=d)
    loss_by_cols2[d]  = norm(ytest-ŷdtest2)/length(ytest)
    sobol_by_cols2[d] = sobol_index(ŷtest,ŷdtest2) 
    diffest_bycols[d] = norm(ŷdtest-ŷdtest2)/length(ytest)
# Expected order of sortperm([losses]): ~ [5,6,4,3,2,1], but not actually

I have tried by using DecisionTree.jl model and Shapley values (method and example from the ShapML package) but I have the same problem… often variables without any effect on y are reported as important… in the example below the 5th column, that has no effect whatsoever in y, is given as important… is it the example I have chosen particularly hard for some reasons??

using ShapML
import BetaML: partition
using DecisionTree
using DataFrames
using Gadfly  # Plotting
using StableRNGs
using Statistics

# Syntetic data generation
# x1: high importance, x2: little importance, x3: mixed effects with x1, x4: highly correlated with x1 but no effects on Y, x5 and x6: no effects on Y 
TESTRNG = StableRNG(123)
N     = 2000
D     = 6
xa    = rand(copy(TESTRNG),0:0.0001:10,N,3) #  x1, x2, x3
xb    = (xa[:,1] .* 2 .* rand(0.8:0.001:1.2)) .+ 10 # x4
xc    = rand(copy(TESTRNG),0:0.0001:10,N,D-4) # x5, x6
x     = hcat(xa,xb,xc)  
y     = [10*r[1]-r[2]-0.1*r[3]*r[1] for r in eachrow(x) ]
((xtrain,xtest),(ytrain,ytest)) = partition([x,y],[0.8,0.2],rng=copy(TESTRNG))

# full cols model:
m = DecisionTreeRegressor()
fit!(m, xtrain, ytrain)
predict(m, xtrain)

function predict_function(model, data)
  data_pred = DataFrame(y_pred = predict(model, Matrix(data)))
  return data_pred

explain   = DataFrame(x[1:300, :],:auto) 
reference = DataFrame(x,:auto) 

sample_size = 60  # Number of Monte Carlo samples.
# Compute stochastic Shapley values.
data_shap = ShapML.shap(explain = explain,
                        reference = reference,
                        model = m,
                        predict_function = predict_function,
                        sample_size = sample_size,
                        seed = 1

show(data_shap, allcols = true)

data_plot =combine(groupby(data_shap,[:feature_name])) do subdf 
            (mean_effect = mean(abs.(subdf.shap_effect)),)

data_plot = sort(data_plot, order(:mean_effect, rev = true))
baseline = round(data_shap.intercept[1], digits = 1)
p = plot(data_plot, y = :feature_name, x = :mean_effect, Coord.cartesian(yflip = true),
         Scale.y_discrete, = :dodge, orientation = :horizontal),
         Theme(bar_spacing = 1mm),
         Guide.xlabel("|Shapley effect| (baseline = $baseline)"), Guide.ylabel(nothing),
         Guide.title("Feature Importance - Mean Absolute Shapley Value"))


Can’t help with your code, but perhaps you may utilize some ideas from my work centuries ago:
Back in the days, I used Shannon’s entropy calculated for every connection between two neurons, removed the least important connections in order until the error becomes too big, run incremental learning step, calculate entropies,… , and so on until the network wasn’t able to reproduce the learning set anymore, which meant, the network was too small for the task. With that I ended up with the smallest network which still was able to reproduce the problem.
It was about recurrent neural networks learning to reproduce nonlinear chemical reactions like Belousov–Zhabotinsky reaction - Wikipedia
To be able to analyze the NN for benefit it needs to be minimal, therefor reduction with Shannons entropy, but I also tried to linearize the NN (linear step function for each neuron), which also worked surprisingly good (within some limits). The idea here is that the typical step function is nearly linear around x=0 and if the input values of a neuron are always in this linear area, the non linear step step function can be replaced by a linear function with same gradient as the original non linear in x=0 (this may only apply to the recurrent neural networks when they are near their stable attractor).
There were some caveats and it wasn’t as trivial as it sounds here, but I don’t remember the details. And I don’t know if above idea is suitable for todays large networks.
The work is only available in German: “Neuronale Netzwerke mit Rückkopplung und nichtlineare chemische Reaktionen”. (if interested, I hope I can provide the text, it was LaTeX, but I don’t know if I have it still on my hard disk :grimacing: )


Thank you for your contribution, but it’s a bit too specific for my needs… I’m trying to understand if a generic approach for “feature importance” estimation for black box ML models is actually feasible… all the examples I found online are either not reproducible or they are not based on synthetic data from which we actually know the importance of the features for the label and some variable is correlated or has an interaction component, as in my example.

1 Like

Ah, ok. This is exclusion criterion for my approach.

Crazy, the issue was the StableRng random seed !!!
EDIT: it’s me I am an idi… to get reproducible results I got too much reproducible and used the same seed for x1,x2,x3 and x5,x6… .arghh… 2 days wasted in this way… :sob: :sob: