Missing imputation: comparision of BetaML, Python SKL, R Mice

Hello, I did a small benchmarking of Julia BetaML (predictMissing - that uses GMM - and custom random forest prediction), SciKitLearn SI, II and KK-based imputators and R Mice, using the default values for each.
Below are the results, the code and the whole folder for replication.

The datasets are real data (from Eurostat) or datasets in R/Python where on each repetition I set as missed a given share of the data at random and then compared the imputed value with the true values.

I am not so happy with my BetaML as I would have liked… the new (“experimental”) SKL II is often (but not always) more precise than BetaML and above all always much faster :-/
BetaML seems however faster and more precise than MICE using random forest (when it works, as MICE - for the different imputator algorithms I did try - seems to have many more cases when it doesn’t impute)

The results

Datasetset #1 : 

*** Missing imputation error:
Model            	seconds         l-2 norm        Mean relative error
FeatureMeans             0.0     596.6448409053057       1.002319122696992
DoubleMeans              0.0     391.3604130836722       0.4992976985073335
WeightedFeatureMeans     0.0002  185.54305731037456      0.24713646528063324
BetaML_GMM               0.0045  407.0153362387738       0.5848515520904161
BetaML_RF                0.5695  357.3661669306017       0.4124802945377765
SKL_SI           	  0.0013  596.6448409053057       1.002319122696992
SKL_II           	  0.2994  117.64702366023792      0.22315626247905915
SKL_KK           	  0.0068  382.00934994066654      0.3914208771982257
R_Mice           	  0.6232  458.2614093467284       0.5925242903090201


Dataset #2:
*** Missing imputation error:
Model            	seconds         l-2 norm        Mean relative error
FeatureMeans             0.0     903078.9798906296       1.25318222691199
DoubleMeans              0.0003  861754.4184396429       0.8043798346561916
WeightedFeatureMeans     0.0001  762695.5849791698       0.7154890655613636
BetaML_GMM               0.0048  938069.2558905445       0.999509592226039
BetaML_RF                0.6378  731256.0714149184       0.6282147232408608
SKL_SI           	  0.0012  903078.9798906296       1.25318222691199
SKL_II           	  0.2356  464588.8941069747       0.36446355762622834
SKL_KK           	  0.0066  818916.3868256389       0.7239568628528212
R_Mice           	  0.8301  927755.5437161991       1.0087824960054428


Dataset #5:
*** Missing imputation error:
Model          	  seconds         l-2 norm        Mean relative error
FeatureMeans             0.0     143.7332322239149       0.35907456781270847
DoubleMeans              0.0001  246.59598230195132      0.629338588720551
WeightedFeatureMeans     0.0001  231.5981128929205       0.5843971246956442
BetaML_GMM               0.0521  150.66354278820586      0.34582233638053206
BetaML_RF                0.5843  120.33788747811619      0.2652753264263067
SKL_SI           	  0.0013  143.7332322239149       0.35907456781270847
SKL_II           	  0.0579  123.00558694040816      0.27674480047151734
SKL_KK           	  0.0042  124.55466487677543      0.27205170841370924
R_Mice           	  0.2425  196.63804019966136      0.4258825776275952

Dataset #3:
*** Missing imputation error:
Model            	  seconds         l-2 norm        Mean relative error
FeatureMeans             0.0142  1.8487052538983752e6    1.0018178079202642
DoubleMeans              0.0346  2.0663045913534556e6    0.9269925535629163
WeightedFeatureMeans     0.0624  1.168453036458433e6     0.6236022143512215
BetaML_GMM               0.0768  1.835235185954893e6     0.9737602166227163
BetaML_RF                81.756  772775.0617528623       0.36588450784615
SKL_SI           	  0.0082  1.8487052538983752e6    1.0018178079202642
SKL_II           	  8.0322  180564.47240994236      0.08747282957993283
SKL_KK           	  0.0998  818424.1889356236       0.3898851587897578
R_Mice           	  9.3572  NA			   NA

The code

cd(@__DIR__)
using Pkg
Pkg.activate(".")
#Pkg.add("BetaML")
#Pkg.add("XLSX")
#Pkg.add("PyCall")
#Pkg.add("RCall")
#Pkg.add("DataFrames")

using XLSX, BetaML, Statistics, LinearAlgebra, PyCall, RCall, Dates, DataFrames

# Loading some datasets..

iDataFilename     = "missing_forest_product_example2.xlsx" # From Eurostat
expValue          = XLSX.readdata(iDataFilename,"Sheet 1","A11:BE46" )
replace!(expValue,":" => missing)
expValue          = expValue[[1;5:end],[1;18:2:49]] # years 2000-2015
expQ              = XLSX.readdata(iDataFilename,"Sheet 2","A11:BE46" )
replace!(expQ,":" => missing)
expQ              = expQ[[1;5:end],[1;18:2:49]]
expP              = similar(expQ)
expP[1,:]         = expQ[1,:]
expP[:,1]         = expQ[:,1]
expP[2:end,2:end] = expValue[2:end,2:end] ./ expQ[2:end,2:end]
replace!(expP,NaN => missing)
replace!(expP,Inf => missing)
nmRows            = [all(.! ismissing.(expP[r,:])) for r in 1:size(expP,1)]
expP              = expP[nmRows,:]
tDataFull         = convert(Array{Float64,2},expP[2:end,2:end])
replace!(expQ,0 => missing)
nmRows            = [all(.! ismissing.(expQ[r,:])) for r in 1:size(expQ,1)]
expQ              = expQ[nmRows,:]
tDataFull1        = convert(Array{Float64,2},expQ[2:end,2:end])


iDataFilename = "missing_example_1.xlsx" # From Eurostat
expQ = XLSX.readdata(iDataFilename,"Sheet 1","A11:AO42")
replace!(expQ,":" => missing)
replace!(expQ,0 => missing)
expQ = expQ[[1;4:end],[1;2:2:end]]
nmRows = [all(.! ismissing.(expQ[r,:])) for r in 1:size(expQ,1)]
expQ = expQ[nmRows,:]
tDataFull2 = convert(Array{Float64,2},expQ[2:end,2:end])


iDataFilename = "agr_classes_by_nuts2.xlsx" # From Eurostat
iTable = XLSX.readdata(iDataFilename,"Sheet 1","A10:BV250")
replace!(iTable,":" => missing)
iTable = iTable[[1;3:end],:]
nmRows = [all(.! ismissing.(iTable[r,:])) for r in 1:size(iTable,1)]
iTable = iTable[nmRows,:]
tDataFull3 = convert(Array{Float64,2},iTable[2:end,2:end])


SKD            = pyimport("sklearn.datasets")
X_full, y_full = SKD.fetch_california_housing(return_X_y=true)
tDataFull4      = X_full

R"""
data <- airquality
"""
tDataFull = rcopy(R"""data<-airquality""")
tDataFull5 = Matrix(dropmissing(tDataFull))

# Defining our benchmark function. Ugly, but I started with two cases and then extended...
function testImputation(tData;missingShare=0.05,printReport=true,repetitions=5)
    tDataMissed = copy(tData)
    tDataMissed = convert(Array{Union{Missing,Float64},2},tDataMissed)
    nR,nC = size(tData)

    nMissing = Int64(round(nR*nC*missingShare))

    err_BetaML_GMM    = 0.0
    mre_BetaML_GMM    = 0.0
    sec_BetaML_GMM    = 0.0
    err_BetaML_RF     = 0.0
    mre_BetaML_RF     = 0.0
    sec_BetaML_RF    = 0.0
    err_FeatureMeans  = 0.0
    mre_FeatureMeans  = 0.0
    sec_FeatureMeans  = 0.0
    err_DoubleMeans   = 0.0
    mre_DoubleMeans   = 0.0
    sec_DoubleMeans   = 0.0
    err_WeightedFeatureMeans  = 0.0
    mre_WeightedFeatureMeans  = 0.0
    sec_WeightedFeatureMeans  = 0.0
    err_SKL_SI        = 0.0
    mre_SKL_SI        = 0.0
    sec_SKL_SI        = 0.0
    err_SKL_II        = 0.0
    mre_SKL_II        = 0.0
    sec_SKL_II        = 0.0
    err_SKL_KK        = 0.0
    mre_SKL_KK        = 0.0
    sec_SKL_KK        = 0.0
    err_Mice_RF       = 0.0
    mre_Mice_RF       = 0.0
    sec_Mice_RF       = 0.0

    println("$(Dates.format(now(), "HH:MM")): Starting repetition  of repetitions.. ")

    # Loading external libs..
    R"""
    library(mice)
    """
    SKI = pyimport("sklearn.impute")
    np  = pyimport("numpy")

    for rep in 1:repetitions
        println("$(Dates.format(now(), "HH:MM") )) : Starting repetition $rep of $repetitions.. ")
        mCellsID = shuffle(1:nR*nC)[1:nMissing]
        mCellsRC  =  fill((0,0),nMissing)
        for (i,cell) in enumerate(mCellsID)
            r = Int64(ceil(cell/nC))
            c = cell % nC
            c = (c == 0) ? nC : c
            mCellsRC[i] = (r,c)
            tDataMissed[r,c] = missing
        end

        missingTrue = [tData[mCellsRC[i][1],mCellsRC[i][2]] for i in 1:nMissing]
        nmCellsMask = .! ismissing.(tDataMissed)
        sortedDims = sortperm(makeColVector(sum(nmCellsMask,dims=1))) # sorted from the dim with more missing values

        # Simple average...
        println("$(Dates.format(now(), "HH:MM")): Averages...")
        timeBef   = Dates.DateTime(now())
        cMeans    = [mean(skipmissing(tDataMissed[:,i])) for i in 1:nC]
        imputedMissing_FeatureMeans  =   [cMeans[mCellsRC[i][2]] for i in 1:nMissing]
        timeAfter   = Dates.DateTime(now())
        sec_FeatureMeans += (Dates.DateTime(timeAfter) - Dates.DateTime(timeBef)) / Millisecond(1000)

        timeBef   = Dates.DateTime(now())
        cMeans    = [mean(skipmissing(tDataMissed[:,i])) for i in 1:nC]
        rMeans    = [mean(skipmissing(tDataMissed[i,:])) for i in 1:nR]
        imputedMissing_DoubleMeans  = sqrt.( [ abs(rMeans[mCellsRC[i][1]] * cMeans[mCellsRC[i][2]]) for i in 1:nMissing]  )
        timeAfter   = Dates.DateTime(now())
        sec_DoubleMeans += (Dates.DateTime(timeAfter) - Dates.DateTime(timeBef)) / Millisecond(1000)

        timeBef   = Dates.DateTime(now())
        cMeans = [mean(skipmissing(tDataMissed[:,i])) for i in 1:nC]
        rL1    = [sum(  skipmissing([abs(tDataMissed[r,c]) for c in 1:nC])) for r in 1:nR]
        rWeights  = rL1 ./ mean(rL1 )
        imputedMissing_WeightedFeatureMeans =  [cMeans[mCellsRC[i][2]] * rWeights[mCellsRC[i][1]] for i in 1:nMissing] # Feature avg by l1 norm of the row/average l1 norm of all rows
        timeAfter   = Dates.DateTime(now())
        sec_WeightedFeatureMeans += (Dates.DateTime(timeAfter) - Dates.DateTime(timeBef)) / Millisecond(1000)

        imputedData_FeatureMeans = copy(tData)
        [imputedData_FeatureMeans[mc[1],mc[2]] =  imputedMissing_FeatureMeans[i]  for (i,mc) in enumerate(mCellsRC)]
        imputedData_WeightedFeatureMeans = copy(tData)
        [imputedData_WeightedFeatureMeans[mc[1],mc[2]] =  imputedMissing_WeightedFeatureMeans[i]  for (i,mc) in enumerate(mCellsRC)]
        err_FeatureMeans +=  norm(missingTrue - imputedMissing_FeatureMeans)
        err_DoubleMeans  +=  norm(missingTrue - imputedMissing_DoubleMeans)
        err_WeightedFeatureMeans  +=  norm(missingTrue - imputedMissing_WeightedFeatureMeans)
        mre_FeatureMeans +=  meanRelError(imputedMissing_FeatureMeans,missingTrue,normRec=false)
        mre_DoubleMeans  +=  meanRelError(imputedMissing_DoubleMeans,missingTrue,normRec=false)
        mre_WeightedFeatureMeans  +=  meanRelError(imputedMissing_WeightedFeatureMeans,missingTrue,normRec=false)

        # BetaML..
        println("$(Dates.format(now(), "HH:MM")): BetaML_GMM...")
        timeBef   = Dates.DateTime(now())
        imputedData_BetaML_GMM = predictMissing(tDataMissed).XĚ‚
        timeAfter   = Dates.DateTime(now())
        sec_BetaML_GMM += (Dates.DateTime(timeAfter) - Dates.DateTime(timeBef)) / Millisecond(1000)

        imputedMissing_BetaML_GMM = [imputedData_BetaML_GMM[mCellsRC[i][1],mCellsRC[i][2]] for i in 1:nMissing]
        err_BetaML_GMM += norm(missingTrue - imputedMissing_BetaML_GMM)
        mre_BetaML_GMM += meanRelError(imputedMissing_BetaML_GMM,missingTrue,normRec=false)

        println("$(Dates.format(now(), "HH:MM")): BetaML_FM...")
        timeBef   = Dates.DateTime(now())
        forestModels = Array{Forest,1}(undef,nC)
        imputedMissing_BetaML_RF = Array{Float64,1}(undef,nMissing)

        workingData = imputedData_FeatureMeans
        currentImputedMissing = [workingData[mCellsRC[i][1],mCellsRC[i][2]] for i in 1:nMissing]
        for i in 1:2
        for d in sortedDims
            # training...
            nmydata = workingData[.! ismissing.(tDataMissed[:,d]),:] # I train the RF model only with true y, while features may have imputed missing data
            X    = nmydata[:, [1:(d-1);(d+1):end] ]
            y    = convert(Array{Float64,1},nmydata[:,d])
            dfor = buildForest(X,y) # ,maxDepth=5
            forestModels[d] = dfor 
            # imputing missing values in d...
            for i in 1:nMissing
                mc  = mCellsRC[i][2]
                if mc != d
                    continue
                end
                mr  = mCellsRC[i][1]
                mFeatures = workingData[mr,[1:(mc-1);(mc+1):end] ]'
                workingData[mr,mc] = predict(forestModels[mc],mFeatures)[1]
            end
        end
        postEpochImputedMissing = [workingData[mCellsRC[i][1],mCellsRC[i][2]] for i in 1:nMissing]
        currentImputedMissing = postEpochImputedMissing
        end
        imputedMissing_BetaML_RF = currentImputedMissing

        timeAfter   = Dates.DateTime(now())
        sec_BetaML_RF += (Dates.DateTime(timeAfter) - Dates.DateTime(timeBef)) / Millisecond(1000)
        err_BetaML_RF +=  norm(missingTrue - imputedMissing_BetaML_RF)
        mre_BetaML_RF +=  meanRelError(imputedMissing_BetaML_RF,missingTrue,normRec=false)

        # ScikitLearn..
        tDataMissedPython = copy(tDataMissed)
        replace!(tDataMissedPython,missing=>np.nan)
        println("$(Dates.format(now(), "HH:MM")): SKL.SI...")
        timeBef   = Dates.DateTime(now())
        imp = SKI.SimpleImputer(missing_values=np.nan, strategy="mean")
        imp.fit(tDataMissedPython)
        imputedData_SKL_SI = imp.transform(tDataMissedPython)
        timeAfter   = Dates.DateTime(now())
        sec_SKL_SI += (Dates.DateTime(timeAfter) - Dates.DateTime(timeBef)) / Millisecond(1000)
        imputedMissing_SKL_SI = [imputedData_SKL_SI[mCellsRC[i][1],mCellsRC[i][2]] for i in 1:nMissing]
        err_SKL_SI +=  norm(missingTrue - imputedMissing_SKL_SI)
        mre_SKL_SI +=  meanRelError(imputedMissing_SKL_SI,missingTrue,normRec=false)

        println("$(Dates.format(now(), "HH:MM")): SKL.II...")
        SKexp = pyimport("sklearn.experimental.enable_iterative_imputer")
        timeBef   = Dates.DateTime(now())
        imp = SKI.IterativeImputer() # max_iter=10, random_state=0
        imp.fit(tDataMissedPython)
        imputedData_SKL_II = imp.transform(tDataMissedPython)
        timeAfter   = Dates.DateTime(now())
        sec_SKL_II += (Dates.DateTime(timeAfter) - Dates.DateTime(timeBef)) / Millisecond(1000)
        imputedMissing_SKL_II = [imputedData_SKL_II[mCellsRC[i][1],mCellsRC[i][2]] for i in 1:nMissing]
        err_SKL_II +=  norm(missingTrue - imputedMissing_SKL_II)
        mre_SKL_II +=  meanRelError(imputedMissing_SKL_II,missingTrue,normRec=false)

        println("$(Dates.format(now(), "HH:MM")): SKL.KK...")
        timeBef   = Dates.DateTime(now())
        imp = SKI.KNNImputer() # n_neighbors=2, weights="uniform"
        imputedData_SKL_KK = imp.fit_transform(tDataMissedPython)
        timeAfter   = Dates.DateTime(now())
        sec_SKL_KK += (Dates.DateTime(timeAfter) - Dates.DateTime(timeBef)) / Millisecond(1000)
        imputedMissing_SKL_II = [imputedData_SKL_II[mCellsRC[i][1],mCellsRC[i][2]] for i in 1:nMissing]
        imputedMissing_SKL_KK = [imputedData_SKL_KK[mCellsRC[i][1],mCellsRC[i][2]] for i in 1:nMissing]
        err_SKL_KK +=  norm(missingTrue - imputedMissing_SKL_KK)
        mre_SKL_KK +=  meanRelError(imputedMissing_SKL_KK,missingTrue,normRec=false)

        # R Mice...
        println("$(Dates.format(now(), "HH:MM")): R.Mice...")
        timeBef   = Dates.DateTime(now())
        R"""
        library(mice)
        #data <- $(DataFrame(tDataMissed, :auto))
        #impObject <- mice($(DataFrame(tDataMissed, :auto)),m=1,maxit=50,meth='rf',printFlag=FALSE)
        impObject <- mice($(DataFrame(tDataMissed, :auto)),m=1,meth='rf',printFlag=FALSE)
        completedData <- complete(impObject,1)
        """
        imputedData_Mice_RF = Matrix(rcopy(R"""completedData"""))
        timeAfter   = Dates.DateTime(now())
        sec_Mice_RF += (Dates.DateTime(timeAfter) - Dates.DateTime(timeBef)) / Millisecond(1000)
        imputedMissing_Mice_RF = [imputedData_Mice_RF[mCellsRC[i][1],mCellsRC[i][2]] for i in 1:nMissing]
        err_Mice_RF += norm(missingTrue - imputedMissing_Mice_RF)
        mre_Mice_RF += meanRelError(imputedMissing_Mice_RF,missingTrue,normRec=false)
    end

    println("$(Dates.format(now(), "HH:MM")): Done all the $repetitions repetitions..")

    err_FeatureMeans  = err_FeatureMeans / repetitions
    mre_FeatureMeans  = mre_FeatureMeans / repetitions
    sec_FeatureMeans = sec_FeatureMeans /repetitions
    err_WeightedFeatureMeans  = err_WeightedFeatureMeans / repetitions
    mre_WeightedFeatureMeans  = mre_WeightedFeatureMeans / repetitions
    sec_WeightedFeatureMeans = sec_WeightedFeatureMeans/repetitions
    err_DoubleMeans   = err_DoubleMeans / repetitions
    mre_DoubleMeans   = mre_DoubleMeans / repetitions
    sec_DoubleMeans = sec_DoubleMeans /repetitions
    err_BetaML_GMM    = err_BetaML_GMM / repetitions
    mre_BetaML_GMM    = mre_BetaML_GMM / repetitions
    sec_BetaML_GMM = sec_BetaML_GMM /repetitions
    err_BetaML_RF     = err_BetaML_RF / repetitions
    mre_BetaML_RF     = mre_BetaML_RF / repetitions
    sec_BetaML_RF = sec_BetaML_RF /repetitions
    err_SKL_SI        = err_SKL_SI / repetitions
    mre_SKL_SI        = mre_SKL_SI / repetitions
    sec_SKL_SI = sec_SKL_SI /repetitions
    err_SKL_II        = err_SKL_II / repetitions
    mre_SKL_II        = mre_SKL_II / repetitions
    sec_SKL_II = sec_SKL_II /repetitions
    err_SKL_KK        = err_SKL_KK / repetitions
    mre_SKL_KK        = mre_SKL_KK / repetitions
    sec_SKL_KK        = sec_SKL_KK /repetitions
    err_Mice_RF       = err_Mice_RF / repetitions
    mre_Mice_RF       = mre_Mice_RF / repetitions
    sec_Mice_RF       = sec_Mice_RF /repetitions

    if(printReport)
        println("*** Missing imputation error:")
        println("Model \t\t seconds \t l-2 norm \t Mean relative error")
        println("FeatureMeans \t\t $sec_FeatureMeans\t $err_FeatureMeans\t $mre_FeatureMeans")
        println("DoubleMeans \t\t $sec_DoubleMeans\t $err_DoubleMeans\t $mre_DoubleMeans")
        println("WeightedFeatureMeans \t\t $sec_WeightedFeatureMeans\t $err_WeightedFeatureMeans\t $mre_WeightedFeatureMeans")
        println("BetaML_GMM \t\t $sec_BetaML_GMM\t $err_BetaML_GMM \t $mre_BetaML_GMM")
        println("BetaML_RF \t\t $sec_BetaML_RF\t $err_BetaML_RF \t $mre_BetaML_RF")
        println("SKL_SI \t\t $sec_SKL_SI\t $err_SKL_SI\t $mre_SKL_SI")
        println("SKL_II \t\t $sec_SKL_II\t $err_SKL_II\t $mre_SKL_II")
        println("SKL_KK \t\t $sec_SKL_KK\t $err_SKL_KK\t $mre_SKL_KK")
        println("R_Mice \t\t $sec_Mice_RF\t $err_Mice_RF\t $mre_Mice_RF")
    end

    return (FeatureMeans = err_FeatureMeans, DoubleMeans = err_DoubleMeans, WeightedFeatureMeans = err_WeightedFeatureMeans, BetaML_GMM = err_BetaML_GMM, BetaML_RF = err_BetaML_RF, SKL_SI = err_SKL_SI, SKL_II = err_SKL_II, SKL_KK = err_SKL_KK, Mice_RF = err_Mice_RF)
end

testImputation(tDataFull1,repetitions=1, missingShare=0.05) # This measure compile time !

testImputation(tDataFull1,repetitions=10, missingShare=0.02)
testImputation(tDataFull2,repetitions=10, missingShare=0.02)
testImputation(tDataFull5,repetitions=10, missingShare=0.02)
testImputation(tDataFull3,repetitions=5, missingShare=0.02) # warning: take several minutes on a good pc
#testImputation(tDataFull4,repetitions=2, missingShare=0.05) # warning: takes several hours on a good pc

The zipped folder: Nextcloud

1 Like