Hello, I am trying to build a method to generically measure feature importance in ML models but I am having very inconsistent results, that depends on the individual run, and I am very p…off and I don’t understand why.
I am using two methods, mean accuracy decrease and explained variance using Sobol total index. In both cases, I compute them by randomly shuffling the specific column to omit (setting it to the column mean doesn’t change anything), refitting and measure, or, for the ML algorithm that support predicting ignoring columns, by specifying the column to ignore.
Still super random results, even if the fitting is quite good.
Here is the code (it needs BetaML master for the sobol_index and the ignore_dims keyword):
# Syntetic data generation
# x1: high importance, x2: little importance, x3: mixed effects with x1, x4: highly correlated with x1 but no effects on Y, x5 and x6: no effects on Y
N = 2000
D = 6
xa = rand(copy(TESTRNG),0:0.0001:10,N,3)
xb = (xa[:,1] .* 2 .* rand(0.8:0.001:1.2)) .+ 10
xc = rand(copy(TESTRNG),0:0.0001:10,N,D-4)
x = hcat(xa,xb,xc)
y = [10*r[1]-r[2]-0.1*r[3]*r[1] for r in eachrow(x) ]
((xtrain,xtest),(ytrain,ytest)) = partition([x,y],[0.8,0.2],rng=copy(TESTRNG))
# full cols model:
m = RandomForestEstimator(n_trees=300)
fit!(m,xtrain,ytrain)
ŷtest = predict(m,xtest)
loss = norm(ytest-ŷtest)/length(ytest) # this is good
loss_by_cols = zeros(D)
sobol_by_cols = zeros(D)
loss_by_cols2 = zeros(D)
sobol_by_cols2 = zeros(D)
diffest_bycols = zeros(D)
for d in 1:D
println("Doing modelling without dimension $d ....")
xd_train = hcat(xtrain[:,1:d-1],shuffle(xtrain[:,d]),xtrain[:,d+1:end])
xd_test = hcat(xtest[:,1:d-1],shuffle(xtest[:,d]),xtest[:,d+1:end])
md = RandomForestEstimator(n_trees=300)
fit!(md,xd_train,ytrain)
ŷdtest = predict(md,xd_test)
loss_by_cols[d] = norm(ytest-ŷdtest)/length(ytest)
sobol_by_cols[d] = sobol_index(ŷtest,ŷdtest)
ŷdtest2 = predict(m,xtest,ignore_dims=d)
loss_by_cols2[d] = norm(ytest-ŷdtest2)/length(ytest)
sobol_by_cols2[d] = sobol_index(ŷtest,ŷdtest2)
diffest_bycols[d] = norm(ŷdtest-ŷdtest2)/length(ytest)
end
# Expected order of sortperm([losses]): ~ [5,6,4,3,2,1], but not actually