Hello, I did a small benchmarking of Julia BetaML (predictMissing - that uses GMM - and custom random forest prediction), SciKitLearn SI, II and KK-based imputators and R Mice, using the default values for each.
Below are the results, the code and the whole folder for replication.
The datasets are real data (from Eurostat) or datasets in R/Python where on each repetition I set as missed a given share of the data at random and then compared the imputed value with the true values.
I am not so happy with my BetaML as I would have liked… the new (“experimental”) SKL II is often (but not always) more precise than BetaML and above all always much faster :-/
BetaML seems however faster and more precise than MICE using random forest (when it works, as MICE - for the different imputator algorithms I did try - seems to have many more cases when it doesn’t impute)
The results
Datasetset #1 :
*** Missing imputation error:
Model seconds l-2 norm Mean relative error
FeatureMeans 0.0 596.6448409053057 1.002319122696992
DoubleMeans 0.0 391.3604130836722 0.4992976985073335
WeightedFeatureMeans 0.0002 185.54305731037456 0.24713646528063324
BetaML_GMM 0.0045 407.0153362387738 0.5848515520904161
BetaML_RF 0.5695 357.3661669306017 0.4124802945377765
SKL_SI 0.0013 596.6448409053057 1.002319122696992
SKL_II 0.2994 117.64702366023792 0.22315626247905915
SKL_KK 0.0068 382.00934994066654 0.3914208771982257
R_Mice 0.6232 458.2614093467284 0.5925242903090201
Dataset #2:
*** Missing imputation error:
Model seconds l-2 norm Mean relative error
FeatureMeans 0.0 903078.9798906296 1.25318222691199
DoubleMeans 0.0003 861754.4184396429 0.8043798346561916
WeightedFeatureMeans 0.0001 762695.5849791698 0.7154890655613636
BetaML_GMM 0.0048 938069.2558905445 0.999509592226039
BetaML_RF 0.6378 731256.0714149184 0.6282147232408608
SKL_SI 0.0012 903078.9798906296 1.25318222691199
SKL_II 0.2356 464588.8941069747 0.36446355762622834
SKL_KK 0.0066 818916.3868256389 0.7239568628528212
R_Mice 0.8301 927755.5437161991 1.0087824960054428
Dataset #5:
*** Missing imputation error:
Model seconds l-2 norm Mean relative error
FeatureMeans 0.0 143.7332322239149 0.35907456781270847
DoubleMeans 0.0001 246.59598230195132 0.629338588720551
WeightedFeatureMeans 0.0001 231.5981128929205 0.5843971246956442
BetaML_GMM 0.0521 150.66354278820586 0.34582233638053206
BetaML_RF 0.5843 120.33788747811619 0.2652753264263067
SKL_SI 0.0013 143.7332322239149 0.35907456781270847
SKL_II 0.0579 123.00558694040816 0.27674480047151734
SKL_KK 0.0042 124.55466487677543 0.27205170841370924
R_Mice 0.2425 196.63804019966136 0.4258825776275952
Dataset #3:
*** Missing imputation error:
Model seconds l-2 norm Mean relative error
FeatureMeans 0.0142 1.8487052538983752e6 1.0018178079202642
DoubleMeans 0.0346 2.0663045913534556e6 0.9269925535629163
WeightedFeatureMeans 0.0624 1.168453036458433e6 0.6236022143512215
BetaML_GMM 0.0768 1.835235185954893e6 0.9737602166227163
BetaML_RF 81.756 772775.0617528623 0.36588450784615
SKL_SI 0.0082 1.8487052538983752e6 1.0018178079202642
SKL_II 8.0322 180564.47240994236 0.08747282957993283
SKL_KK 0.0998 818424.1889356236 0.3898851587897578
R_Mice 9.3572 NA NA
The code
cd(@__DIR__)
using Pkg
Pkg.activate(".")
#Pkg.add("BetaML")
#Pkg.add("XLSX")
#Pkg.add("PyCall")
#Pkg.add("RCall")
#Pkg.add("DataFrames")
using XLSX, BetaML, Statistics, LinearAlgebra, PyCall, RCall, Dates, DataFrames
# Loading some datasets..
iDataFilename = "missing_forest_product_example2.xlsx" # From Eurostat
expValue = XLSX.readdata(iDataFilename,"Sheet 1","A11:BE46" )
replace!(expValue,":" => missing)
expValue = expValue[[1;5:end],[1;18:2:49]] # years 2000-2015
expQ = XLSX.readdata(iDataFilename,"Sheet 2","A11:BE46" )
replace!(expQ,":" => missing)
expQ = expQ[[1;5:end],[1;18:2:49]]
expP = similar(expQ)
expP[1,:] = expQ[1,:]
expP[:,1] = expQ[:,1]
expP[2:end,2:end] = expValue[2:end,2:end] ./ expQ[2:end,2:end]
replace!(expP,NaN => missing)
replace!(expP,Inf => missing)
nmRows = [all(.! ismissing.(expP[r,:])) for r in 1:size(expP,1)]
expP = expP[nmRows,:]
tDataFull = convert(Array{Float64,2},expP[2:end,2:end])
replace!(expQ,0 => missing)
nmRows = [all(.! ismissing.(expQ[r,:])) for r in 1:size(expQ,1)]
expQ = expQ[nmRows,:]
tDataFull1 = convert(Array{Float64,2},expQ[2:end,2:end])
iDataFilename = "missing_example_1.xlsx" # From Eurostat
expQ = XLSX.readdata(iDataFilename,"Sheet 1","A11:AO42")
replace!(expQ,":" => missing)
replace!(expQ,0 => missing)
expQ = expQ[[1;4:end],[1;2:2:end]]
nmRows = [all(.! ismissing.(expQ[r,:])) for r in 1:size(expQ,1)]
expQ = expQ[nmRows,:]
tDataFull2 = convert(Array{Float64,2},expQ[2:end,2:end])
iDataFilename = "agr_classes_by_nuts2.xlsx" # From Eurostat
iTable = XLSX.readdata(iDataFilename,"Sheet 1","A10:BV250")
replace!(iTable,":" => missing)
iTable = iTable[[1;3:end],:]
nmRows = [all(.! ismissing.(iTable[r,:])) for r in 1:size(iTable,1)]
iTable = iTable[nmRows,:]
tDataFull3 = convert(Array{Float64,2},iTable[2:end,2:end])
SKD = pyimport("sklearn.datasets")
X_full, y_full = SKD.fetch_california_housing(return_X_y=true)
tDataFull4 = X_full
R"""
data <- airquality
"""
tDataFull = rcopy(R"""data<-airquality""")
tDataFull5 = Matrix(dropmissing(tDataFull))
# Defining our benchmark function. Ugly, but I started with two cases and then extended...
function testImputation(tData;missingShare=0.05,printReport=true,repetitions=5)
tDataMissed = copy(tData)
tDataMissed = convert(Array{Union{Missing,Float64},2},tDataMissed)
nR,nC = size(tData)
nMissing = Int64(round(nR*nC*missingShare))
err_BetaML_GMM = 0.0
mre_BetaML_GMM = 0.0
sec_BetaML_GMM = 0.0
err_BetaML_RF = 0.0
mre_BetaML_RF = 0.0
sec_BetaML_RF = 0.0
err_FeatureMeans = 0.0
mre_FeatureMeans = 0.0
sec_FeatureMeans = 0.0
err_DoubleMeans = 0.0
mre_DoubleMeans = 0.0
sec_DoubleMeans = 0.0
err_WeightedFeatureMeans = 0.0
mre_WeightedFeatureMeans = 0.0
sec_WeightedFeatureMeans = 0.0
err_SKL_SI = 0.0
mre_SKL_SI = 0.0
sec_SKL_SI = 0.0
err_SKL_II = 0.0
mre_SKL_II = 0.0
sec_SKL_II = 0.0
err_SKL_KK = 0.0
mre_SKL_KK = 0.0
sec_SKL_KK = 0.0
err_Mice_RF = 0.0
mre_Mice_RF = 0.0
sec_Mice_RF = 0.0
println("$(Dates.format(now(), "HH:MM")): Starting repetition of repetitions.. ")
# Loading external libs..
R"""
library(mice)
"""
SKI = pyimport("sklearn.impute")
np = pyimport("numpy")
for rep in 1:repetitions
println("$(Dates.format(now(), "HH:MM") )) : Starting repetition $rep of $repetitions.. ")
mCellsID = shuffle(1:nR*nC)[1:nMissing]
mCellsRC = fill((0,0),nMissing)
for (i,cell) in enumerate(mCellsID)
r = Int64(ceil(cell/nC))
c = cell % nC
c = (c == 0) ? nC : c
mCellsRC[i] = (r,c)
tDataMissed[r,c] = missing
end
missingTrue = [tData[mCellsRC[i][1],mCellsRC[i][2]] for i in 1:nMissing]
nmCellsMask = .! ismissing.(tDataMissed)
sortedDims = sortperm(makeColVector(sum(nmCellsMask,dims=1))) # sorted from the dim with more missing values
# Simple average...
println("$(Dates.format(now(), "HH:MM")): Averages...")
timeBef = Dates.DateTime(now())
cMeans = [mean(skipmissing(tDataMissed[:,i])) for i in 1:nC]
imputedMissing_FeatureMeans = [cMeans[mCellsRC[i][2]] for i in 1:nMissing]
timeAfter = Dates.DateTime(now())
sec_FeatureMeans += (Dates.DateTime(timeAfter) - Dates.DateTime(timeBef)) / Millisecond(1000)
timeBef = Dates.DateTime(now())
cMeans = [mean(skipmissing(tDataMissed[:,i])) for i in 1:nC]
rMeans = [mean(skipmissing(tDataMissed[i,:])) for i in 1:nR]
imputedMissing_DoubleMeans = sqrt.( [ abs(rMeans[mCellsRC[i][1]] * cMeans[mCellsRC[i][2]]) for i in 1:nMissing] )
timeAfter = Dates.DateTime(now())
sec_DoubleMeans += (Dates.DateTime(timeAfter) - Dates.DateTime(timeBef)) / Millisecond(1000)
timeBef = Dates.DateTime(now())
cMeans = [mean(skipmissing(tDataMissed[:,i])) for i in 1:nC]
rL1 = [sum( skipmissing([abs(tDataMissed[r,c]) for c in 1:nC])) for r in 1:nR]
rWeights = rL1 ./ mean(rL1 )
imputedMissing_WeightedFeatureMeans = [cMeans[mCellsRC[i][2]] * rWeights[mCellsRC[i][1]] for i in 1:nMissing] # Feature avg by l1 norm of the row/average l1 norm of all rows
timeAfter = Dates.DateTime(now())
sec_WeightedFeatureMeans += (Dates.DateTime(timeAfter) - Dates.DateTime(timeBef)) / Millisecond(1000)
imputedData_FeatureMeans = copy(tData)
[imputedData_FeatureMeans[mc[1],mc[2]] = imputedMissing_FeatureMeans[i] for (i,mc) in enumerate(mCellsRC)]
imputedData_WeightedFeatureMeans = copy(tData)
[imputedData_WeightedFeatureMeans[mc[1],mc[2]] = imputedMissing_WeightedFeatureMeans[i] for (i,mc) in enumerate(mCellsRC)]
err_FeatureMeans += norm(missingTrue - imputedMissing_FeatureMeans)
err_DoubleMeans += norm(missingTrue - imputedMissing_DoubleMeans)
err_WeightedFeatureMeans += norm(missingTrue - imputedMissing_WeightedFeatureMeans)
mre_FeatureMeans += meanRelError(imputedMissing_FeatureMeans,missingTrue,normRec=false)
mre_DoubleMeans += meanRelError(imputedMissing_DoubleMeans,missingTrue,normRec=false)
mre_WeightedFeatureMeans += meanRelError(imputedMissing_WeightedFeatureMeans,missingTrue,normRec=false)
# BetaML..
println("$(Dates.format(now(), "HH:MM")): BetaML_GMM...")
timeBef = Dates.DateTime(now())
imputedData_BetaML_GMM = predictMissing(tDataMissed).XĚ‚
timeAfter = Dates.DateTime(now())
sec_BetaML_GMM += (Dates.DateTime(timeAfter) - Dates.DateTime(timeBef)) / Millisecond(1000)
imputedMissing_BetaML_GMM = [imputedData_BetaML_GMM[mCellsRC[i][1],mCellsRC[i][2]] for i in 1:nMissing]
err_BetaML_GMM += norm(missingTrue - imputedMissing_BetaML_GMM)
mre_BetaML_GMM += meanRelError(imputedMissing_BetaML_GMM,missingTrue,normRec=false)
println("$(Dates.format(now(), "HH:MM")): BetaML_FM...")
timeBef = Dates.DateTime(now())
forestModels = Array{Forest,1}(undef,nC)
imputedMissing_BetaML_RF = Array{Float64,1}(undef,nMissing)
workingData = imputedData_FeatureMeans
currentImputedMissing = [workingData[mCellsRC[i][1],mCellsRC[i][2]] for i in 1:nMissing]
for i in 1:2
for d in sortedDims
# training...
nmydata = workingData[.! ismissing.(tDataMissed[:,d]),:] # I train the RF model only with true y, while features may have imputed missing data
X = nmydata[:, [1:(d-1);(d+1):end] ]
y = convert(Array{Float64,1},nmydata[:,d])
dfor = buildForest(X,y) # ,maxDepth=5
forestModels[d] = dfor
# imputing missing values in d...
for i in 1:nMissing
mc = mCellsRC[i][2]
if mc != d
continue
end
mr = mCellsRC[i][1]
mFeatures = workingData[mr,[1:(mc-1);(mc+1):end] ]'
workingData[mr,mc] = predict(forestModels[mc],mFeatures)[1]
end
end
postEpochImputedMissing = [workingData[mCellsRC[i][1],mCellsRC[i][2]] for i in 1:nMissing]
currentImputedMissing = postEpochImputedMissing
end
imputedMissing_BetaML_RF = currentImputedMissing
timeAfter = Dates.DateTime(now())
sec_BetaML_RF += (Dates.DateTime(timeAfter) - Dates.DateTime(timeBef)) / Millisecond(1000)
err_BetaML_RF += norm(missingTrue - imputedMissing_BetaML_RF)
mre_BetaML_RF += meanRelError(imputedMissing_BetaML_RF,missingTrue,normRec=false)
# ScikitLearn..
tDataMissedPython = copy(tDataMissed)
replace!(tDataMissedPython,missing=>np.nan)
println("$(Dates.format(now(), "HH:MM")): SKL.SI...")
timeBef = Dates.DateTime(now())
imp = SKI.SimpleImputer(missing_values=np.nan, strategy="mean")
imp.fit(tDataMissedPython)
imputedData_SKL_SI = imp.transform(tDataMissedPython)
timeAfter = Dates.DateTime(now())
sec_SKL_SI += (Dates.DateTime(timeAfter) - Dates.DateTime(timeBef)) / Millisecond(1000)
imputedMissing_SKL_SI = [imputedData_SKL_SI[mCellsRC[i][1],mCellsRC[i][2]] for i in 1:nMissing]
err_SKL_SI += norm(missingTrue - imputedMissing_SKL_SI)
mre_SKL_SI += meanRelError(imputedMissing_SKL_SI,missingTrue,normRec=false)
println("$(Dates.format(now(), "HH:MM")): SKL.II...")
SKexp = pyimport("sklearn.experimental.enable_iterative_imputer")
timeBef = Dates.DateTime(now())
imp = SKI.IterativeImputer() # max_iter=10, random_state=0
imp.fit(tDataMissedPython)
imputedData_SKL_II = imp.transform(tDataMissedPython)
timeAfter = Dates.DateTime(now())
sec_SKL_II += (Dates.DateTime(timeAfter) - Dates.DateTime(timeBef)) / Millisecond(1000)
imputedMissing_SKL_II = [imputedData_SKL_II[mCellsRC[i][1],mCellsRC[i][2]] for i in 1:nMissing]
err_SKL_II += norm(missingTrue - imputedMissing_SKL_II)
mre_SKL_II += meanRelError(imputedMissing_SKL_II,missingTrue,normRec=false)
println("$(Dates.format(now(), "HH:MM")): SKL.KK...")
timeBef = Dates.DateTime(now())
imp = SKI.KNNImputer() # n_neighbors=2, weights="uniform"
imputedData_SKL_KK = imp.fit_transform(tDataMissedPython)
timeAfter = Dates.DateTime(now())
sec_SKL_KK += (Dates.DateTime(timeAfter) - Dates.DateTime(timeBef)) / Millisecond(1000)
imputedMissing_SKL_II = [imputedData_SKL_II[mCellsRC[i][1],mCellsRC[i][2]] for i in 1:nMissing]
imputedMissing_SKL_KK = [imputedData_SKL_KK[mCellsRC[i][1],mCellsRC[i][2]] for i in 1:nMissing]
err_SKL_KK += norm(missingTrue - imputedMissing_SKL_KK)
mre_SKL_KK += meanRelError(imputedMissing_SKL_KK,missingTrue,normRec=false)
# R Mice...
println("$(Dates.format(now(), "HH:MM")): R.Mice...")
timeBef = Dates.DateTime(now())
R"""
library(mice)
#data <- $(DataFrame(tDataMissed, :auto))
#impObject <- mice($(DataFrame(tDataMissed, :auto)),m=1,maxit=50,meth='rf',printFlag=FALSE)
impObject <- mice($(DataFrame(tDataMissed, :auto)),m=1,meth='rf',printFlag=FALSE)
completedData <- complete(impObject,1)
"""
imputedData_Mice_RF = Matrix(rcopy(R"""completedData"""))
timeAfter = Dates.DateTime(now())
sec_Mice_RF += (Dates.DateTime(timeAfter) - Dates.DateTime(timeBef)) / Millisecond(1000)
imputedMissing_Mice_RF = [imputedData_Mice_RF[mCellsRC[i][1],mCellsRC[i][2]] for i in 1:nMissing]
err_Mice_RF += norm(missingTrue - imputedMissing_Mice_RF)
mre_Mice_RF += meanRelError(imputedMissing_Mice_RF,missingTrue,normRec=false)
end
println("$(Dates.format(now(), "HH:MM")): Done all the $repetitions repetitions..")
err_FeatureMeans = err_FeatureMeans / repetitions
mre_FeatureMeans = mre_FeatureMeans / repetitions
sec_FeatureMeans = sec_FeatureMeans /repetitions
err_WeightedFeatureMeans = err_WeightedFeatureMeans / repetitions
mre_WeightedFeatureMeans = mre_WeightedFeatureMeans / repetitions
sec_WeightedFeatureMeans = sec_WeightedFeatureMeans/repetitions
err_DoubleMeans = err_DoubleMeans / repetitions
mre_DoubleMeans = mre_DoubleMeans / repetitions
sec_DoubleMeans = sec_DoubleMeans /repetitions
err_BetaML_GMM = err_BetaML_GMM / repetitions
mre_BetaML_GMM = mre_BetaML_GMM / repetitions
sec_BetaML_GMM = sec_BetaML_GMM /repetitions
err_BetaML_RF = err_BetaML_RF / repetitions
mre_BetaML_RF = mre_BetaML_RF / repetitions
sec_BetaML_RF = sec_BetaML_RF /repetitions
err_SKL_SI = err_SKL_SI / repetitions
mre_SKL_SI = mre_SKL_SI / repetitions
sec_SKL_SI = sec_SKL_SI /repetitions
err_SKL_II = err_SKL_II / repetitions
mre_SKL_II = mre_SKL_II / repetitions
sec_SKL_II = sec_SKL_II /repetitions
err_SKL_KK = err_SKL_KK / repetitions
mre_SKL_KK = mre_SKL_KK / repetitions
sec_SKL_KK = sec_SKL_KK /repetitions
err_Mice_RF = err_Mice_RF / repetitions
mre_Mice_RF = mre_Mice_RF / repetitions
sec_Mice_RF = sec_Mice_RF /repetitions
if(printReport)
println("*** Missing imputation error:")
println("Model \t\t seconds \t l-2 norm \t Mean relative error")
println("FeatureMeans \t\t $sec_FeatureMeans\t $err_FeatureMeans\t $mre_FeatureMeans")
println("DoubleMeans \t\t $sec_DoubleMeans\t $err_DoubleMeans\t $mre_DoubleMeans")
println("WeightedFeatureMeans \t\t $sec_WeightedFeatureMeans\t $err_WeightedFeatureMeans\t $mre_WeightedFeatureMeans")
println("BetaML_GMM \t\t $sec_BetaML_GMM\t $err_BetaML_GMM \t $mre_BetaML_GMM")
println("BetaML_RF \t\t $sec_BetaML_RF\t $err_BetaML_RF \t $mre_BetaML_RF")
println("SKL_SI \t\t $sec_SKL_SI\t $err_SKL_SI\t $mre_SKL_SI")
println("SKL_II \t\t $sec_SKL_II\t $err_SKL_II\t $mre_SKL_II")
println("SKL_KK \t\t $sec_SKL_KK\t $err_SKL_KK\t $mre_SKL_KK")
println("R_Mice \t\t $sec_Mice_RF\t $err_Mice_RF\t $mre_Mice_RF")
end
return (FeatureMeans = err_FeatureMeans, DoubleMeans = err_DoubleMeans, WeightedFeatureMeans = err_WeightedFeatureMeans, BetaML_GMM = err_BetaML_GMM, BetaML_RF = err_BetaML_RF, SKL_SI = err_SKL_SI, SKL_II = err_SKL_II, SKL_KK = err_SKL_KK, Mice_RF = err_Mice_RF)
end
testImputation(tDataFull1,repetitions=1, missingShare=0.05) # This measure compile time !
testImputation(tDataFull1,repetitions=10, missingShare=0.02)
testImputation(tDataFull2,repetitions=10, missingShare=0.02)
testImputation(tDataFull5,repetitions=10, missingShare=0.02)
testImputation(tDataFull3,repetitions=5, missingShare=0.02) # warning: take several minutes on a good pc
#testImputation(tDataFull4,repetitions=2, missingShare=0.05) # warning: takes several hours on a good pc
The zipped folder: Nextcloud