Inconsistent results trying to mesure feature importance

sylvaticus · May 6, 2024, 8:49am

Hello, I am trying to build a method to generically measure feature importance in ML models but I am having very inconsistent results, that depends on the individual run, and I am very p…off and I don’t understand why.

I am using two methods, mean accuracy decrease and explained variance using Sobol total index. In both cases, I compute them by randomly shuffling the specific column to omit (setting it to the column mean doesn’t change anything), refitting and measure, or, for the ML algorithm that support predicting ignoring columns, by specifying the column to ignore.
Still super random results, even if the fitting is quite good.

Here is the code (it needs BetaML master for the sobol_index and the ignore_dims keyword):

# Syntetic data generation
# x1: high importance, x2: little importance, x3: mixed effects with x1, x4: highly correlated with x1 but no effects on Y, x5 and x6: no effects on Y 
N     = 2000
D     = 6
xa    = rand(copy(TESTRNG),0:0.0001:10,N,3)
xb    = (xa[:,1] .* 2 .* rand(0.8:0.001:1.2)) .+ 10 
xc    = rand(copy(TESTRNG),0:0.0001:10,N,D-4)
x     = hcat(xa,xb,xc)  
y     = [10*r[1]-r[2]-0.1*r[3]*r[1] for r in eachrow(x) ]
((xtrain,xtest),(ytrain,ytest)) = partition([x,y],[0.8,0.2],rng=copy(TESTRNG))


# full cols model:
m = RandomForestEstimator(n_trees=300)
fit!(m,xtrain,ytrain)
ŷtest = predict(m,xtest)
loss = norm(ytest-ŷtest)/length(ytest) # this is good

loss_by_cols  = zeros(D)
sobol_by_cols = zeros(D)
loss_by_cols2  = zeros(D)
sobol_by_cols2 = zeros(D)
diffest_bycols = zeros(D)
for d in 1:D
    println("Doing modelling without dimension $d ....")
    xd_train = hcat(xtrain[:,1:d-1],shuffle(xtrain[:,d]),xtrain[:,d+1:end])
    xd_test = hcat(xtest[:,1:d-1],shuffle(xtest[:,d]),xtest[:,d+1:end])  
    md = RandomForestEstimator(n_trees=300)
    fit!(md,xd_train,ytrain)
    ŷdtest = predict(md,xd_test)
    loss_by_cols[d]  = norm(ytest-ŷdtest)/length(ytest)
    sobol_by_cols[d] = sobol_index(ŷtest,ŷdtest) 
    ŷdtest2 = predict(m,xtest,ignore_dims=d)
    loss_by_cols2[d]  = norm(ytest-ŷdtest2)/length(ytest)
    sobol_by_cols2[d] = sobol_index(ŷtest,ŷdtest2) 
    diffest_bycols[d] = norm(ŷdtest-ŷdtest2)/length(ytest)
end
# Expected order of sortperm([losses]): ~ [5,6,4,3,2,1], but not actually

sylvaticus · May 6, 2024, 10:45am

I have tried by using DecisionTree.jl model and Shapley values (method and example from the ShapML package) but I have the same problem… often variables without any effect on y are reported as important… in the example below the 5th column, that has no effect whatsoever in y, is given as important… is it the example I have chosen particularly hard for some reasons??

using ShapML
import BetaML: partition
using DecisionTree
using DataFrames
using Gadfly  # Plotting
using StableRNGs
using Statistics


# Syntetic data generation
# x1: high importance, x2: little importance, x3: mixed effects with x1, x4: highly correlated with x1 but no effects on Y, x5 and x6: no effects on Y 
TESTRNG = StableRNG(123)
N     = 2000
D     = 6
xa    = rand(copy(TESTRNG),0:0.0001:10,N,3) #  x1, x2, x3
xb    = (xa[:,1] .* 2 .* rand(0.8:0.001:1.2)) .+ 10 # x4
xc    = rand(copy(TESTRNG),0:0.0001:10,N,D-4) # x5, x6
x     = hcat(xa,xb,xc)  
y     = [10*r[1]-r[2]-0.1*r[3]*r[1] for r in eachrow(x) ]
((xtrain,xtest),(ytrain,ytest)) = partition([x,y],[0.8,0.2],rng=copy(TESTRNG))


# full cols model:
m = DecisionTreeRegressor()
fit!(m, xtrain, ytrain)
predict(m, xtrain)


function predict_function(model, data)
  data_pred = DataFrame(y_pred = predict(model, Matrix(data)))
  return data_pred
end

explain   = DataFrame(x[1:300, :],:auto) 
reference = DataFrame(x,:auto) 

sample_size = 60  # Number of Monte Carlo samples.
#------------------------------------------------------------------------------
# Compute stochastic Shapley values.
data_shap = ShapML.shap(explain = explain,
                        reference = reference,
                        model = m,
                        predict_function = predict_function,
                        sample_size = sample_size,
                        seed = 1
                        )

show(data_shap, allcols = true)

data_plot =combine(groupby(data_shap,[:feature_name])) do subdf 
            (mean_effect = mean(abs.(subdf.shap_effect)),)
end    

data_plot = sort(data_plot, order(:mean_effect, rev = true))
baseline = round(data_shap.intercept[1], digits = 1)
p = plot(data_plot, y = :feature_name, x = :mean_effect, Coord.cartesian(yflip = true),
         Scale.y_discrete, Geom.bar(position = :dodge, orientation = :horizontal),
         Theme(bar_spacing = 1mm),
         Guide.xlabel("|Shapley effect| (baseline = $baseline)"), Guide.ylabel(nothing),
         Guide.title("Feature Importance - Mean Absolute Shapley Value"))
p

oheil · May 6, 2024, 1:37pm

Can’t help with your code, but perhaps you may utilize some ideas from my work centuries ago:
Back in the days, I used Shannon’s entropy calculated for every connection between two neurons, removed the least important connections in order until the error becomes too big, run incremental learning step, calculate entropies,… , and so on until the network wasn’t able to reproduce the learning set anymore, which meant, the network was too small for the task. With that I ended up with the smallest network which still was able to reproduce the problem.
It was about recurrent neural networks learning to reproduce nonlinear chemical reactions like Belousov–Zhabotinsky reaction - Wikipedia
To be able to analyze the NN for benefit it needs to be minimal, therefor reduction with Shannons entropy, but I also tried to linearize the NN (linear step function for each neuron), which also worked surprisingly good (within some limits). The idea here is that the typical step function is nearly linear around x=0 and if the input values of a neuron are always in this linear area, the non linear step step function can be replaced by a linear function with same gradient as the original non linear in x=0 (this may only apply to the recurrent neural networks when they are near their stable attractor).
There were some caveats and it wasn’t as trivial as it sounds here, but I don’t remember the details. And I don’t know if above idea is suitable for todays large networks.
The work is only available in German: “Neuronale Netzwerke mit Rückkopplung und nichtlineare chemische Reaktionen”. (if interested, I hope I can provide the text, it was LaTeX, but I don’t know if I have it still on my hard disk )

sylvaticus · May 6, 2024, 1:46pm

Thank you for your contribution, but it’s a bit too specific for my needs… I’m trying to understand if a generic approach for “feature importance” estimation for black box ML models is actually feasible… all the examples I found online are either not reproducible or they are not based on synthetic data from which we actually know the importance of the features for the label and some variable is correlated or has an interaction component, as in my example.

oheil · May 6, 2024, 1:47pm

Ah, ok. This is exclusion criterion for my approach.

sylvaticus · May 6, 2024, 3:23pm

Crazy, the issue was the StableRng random seed !!!
EDIT: it’s me I am an idi… to get reproducible results I got too much reproducible and used the same seed for x1,x2,x3 and x5,x6… .arghh… 2 days wasted in this way…

Topic		Replies	Views
[ANN] FeatureRanking: learn which variables contribute the most to the estimation of black box models Package Announcements	0	235	May 15, 2024
Morris Vs Sobol Method Global Sensitivity analysis Statistics	25	2413	July 7, 2022
Missing imputation: comparision of BetaML, Python SKL, R Mice Data benchmark , missing-values	1	686	October 5, 2022
The difficult task of comparing supervised models Machine Learning	3	340	April 16, 2021
Repo for Scientific Machine Learning - UDE paper does not seem to give consistent results on rerunning Machine Learning question	3	336	May 2, 2023

Inconsistent results trying to mesure feature importance

Related topics