How can I use a linear model with NaN parameters due to missing train data?

Hi all!

I am trying to build a model to predict electrical powers from a sequence of binary computer instructions. I thought to use a multiple linear regression, as from each instruction there are a lot of variables I can extract. To give a little context:

  • which opcode it is
  • what is its binary weight
  • if it has an immediate operand
  • what is the operand
  • if the destination is equal to the source

To have the best fit, I need to fit against different possible combinations of these prediction variables. My current formula is this:

julia> formula = Term(mean_power_sym) ~ ConstantTerm(1) +
           (Term(:mnemonic) & (ConstantTerm(1) + Term(binary_weight_sym))) * (
               Term(apsr_sym) + Term(conditional_sym) + Term(dest_source_eq_sym) +
               Term(barrel_shift_sym) * Term(has_barrel_shift_sym) + 
               Term(has_immediate_sym)
           )
FormulaTerm
Response:
  Base power mean (W)(unknown)
Predictors:
  1
  mnemonic(unknown)
  APSR (s flag)(unknown)
  Is conditional(unknown)
  Dest reg == source reg(unknown)
  Barrel shift amount(unknown)
  Has barrel shift(unknown)
  Has immediate operand(unknown)
  mnemonic(unknown) & Binary weight(unknown)
  Barrel shift amount(unknown) & Has barrel shift(unknown)
  mnemonic(unknown) & APSR (s flag)(unknown)
  mnemonic(unknown) & Is conditional(unknown)
  mnemonic(unknown) & Dest reg == source reg(unknown)
  mnemonic(unknown) & Barrel shift amount(unknown)
  mnemonic(unknown) & Has barrel shift(unknown)
  mnemonic(unknown) & Has immediate operand(unknown)
  mnemonic(unknown) & Barrel shift amount(unknown) & Has barrel shift(unknown)
  mnemonic(unknown) & Binary weight(unknown) & APSR (s flag)(unknown)
  mnemonic(unknown) & Binary weight(unknown) & Is conditional(unknown)
  mnemonic(unknown) & Binary weight(unknown) & Dest reg == source reg(unknown)
  mnemonic(unknown) & Binary weight(unknown) & Barrel shift amount(unknown)
  mnemonic(unknown) & Binary weight(unknown) & Has barrel shift(unknown)
  mnemonic(unknown) & Binary weight(unknown) & Has immediate operand(unknown)
  mnemonic(unknown) & Binary weight(unknown) & Barrel shift amount(unknown) & Has barrel shift(unknown)

Even if I collected a large number of data points (around 11’000), not all the combinations are present in the dataset. I end up with lots of parameters being NaN (I report here a few excerpts):

julia> model = lm(formula, df_temp)
StatsModels.TableRegressionModel{LinearModel{GLM.LmResp{Vector{Float64}}, GLM.DensePredChol{Float64, LinearAlgebra.CholeskyPivoted{Float64, Matrix{Float64}, Vector{Int64}}}}, Matrix{Float64}}

Base power mean (W) ~ 1 + mnemonic + APSR (s flag) + Is conditional + Dest reg == source reg + Barrel shift amount + Has barrel shift + Has immediate operand + mnemonic & Binary weight + Barrel shift amount & Has barrel shift + mnemonic & APSR (s flag) + mnemonic & Is conditional + mnemonic & Dest reg == source reg + mnemonic & Barrel shift amount + mnemonic & Has barrel shift + mnemonic & Has immediate operand + mnemonic & Barrel shift amount & Has barrel shift + mnemonic & Binary weight & APSR (s flag) + mnemonic & Binary weight & Is conditional + mnemonic & Binary weight & Dest reg == source reg + mnemonic & Binary weight & Barrel shift amount + mnemonic & Binary weight & Has barrel shift + mnemonic & Binary weight & Has immediate operand + mnemonic & Binary weight & Barrel shift amount & Has barrel shift

Coefficients:
──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
                                                                                  Coef.     Std. Error       t  Pr(>|t|)      Lower 95%      Upper 95%
──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
(Intercept)                                                                 0.0804049      0.0305935      2.63    0.0086    0.0204363      0.140373
mnemonic: add                                                              -0.0209991      0.0306201     -0.69    0.4929   -0.0810198      0.0390215
mnemonic: and                                                               0.000384739    0.0309979      0.01    0.9901   -0.0603766      0.0611461
mnemonic: asr                                                              -5.10902e-6     0.0385573     -0.00    0.9999   -0.0755842      0.0755739
mnemonic: b                                                                 0.00127954     0.00981586     0.13    0.8963   -0.0179612      0.0205203
mnemonic: bfc                                                               0.0          NaN            NaN       NaN     NaN            NaN
mnemonic: bfi                                                               0.0          NaN            NaN       NaN     NaN            NaN
mnemonic: bic                                                               0.0          NaN            NaN       NaN     NaN            NaN
mnemonic: bl                                                                0.0146442      0.0162697      0.90    0.3681   -0.0172472      0.0465356
mnemonic: blx                                                               0.0          NaN            NaN       NaN     NaN            NaN
mnemonic: bx                                                                0.170029       0.0430545      3.95    <1e-04    0.0856352      0.254424
mnemonic: clz                                                              -0.000283997    0.0478817     -0.01    0.9953   -0.0941403      0.0935723
mnemonic: cmn                                                               0.0          NaN            NaN       NaN     NaN            NaN
mnemonic: cmp                                                              -0.00888138     0.0309722     -0.29    0.7743   -0.0695923      0.0518295
mnemonic: cpsid                                                             0.0          NaN            NaN       NaN     NaN            NaN
mnemonic: cpsie                                                             0.0          NaN            NaN       NaN     NaN            NaN
mnemonic: dsb                                                               0.00188294     0.0744446      0.03    0.9798   -0.144041       0.147807
mnemonic: eor                                                               0.0          NaN            NaN       NaN     NaN            NaN
mnemonic: ldr                                                              -0.00117276     0.0316519     -0.04    0.9704   -0.0632159      0.0608704
mnemonic: ldrb                                                              0.0          NaN            NaN       NaN     NaN            NaN
mnemonic: ldrd                                                              0.0          NaN            NaN       NaN     NaN            NaN
mnemonic: ldrh                                                              0.0          NaN            NaN       NaN     NaN            NaN
mnemonic: lsl                                                               0.000285234    0.0376906      0.01    0.9940   -0.0735949      0.0741654
mnemonic: lsr                                                               0.000237562    0.0385573      0.01    0.9951   -0.0753415      0.0758166
mnemonic: mla                                                              -0.00216897     0.0368344     -0.06    0.9530   -0.0743708      0.0700328
mnemonic: mls                                                              -0.000555183    0.0328027     -0.02    0.9865   -0.0648542      0.0637438
mnemonic: mov                                                              -0.0045436      0.030818      -0.15    0.8828   -0.0649523      0.0558651
mnemonic: movt                                                              0.0          NaN            NaN       NaN     NaN            NaN
mnemonic: movw                                                              0.0          NaN            NaN       NaN     NaN            NaN
    # ... #
APSR (s flag)                                                               4.4252e-5      0.00770131     0.01    0.9954   -0.0150516      0.0151401
Is conditional                                                             -0.0380549      0.00146575   -25.96    <1e-99   -0.0409281     -0.0351818
Dest reg == source reg                                                     -3.92217e-5     0.00770131    -0.01    0.9959   -0.0151351      0.0150567
Barrel shift amount                                                         0.0          NaN            NaN       NaN     NaN            NaN
Has barrel shift                                                            0.00963344     0.0475971      0.20    0.8396   -0.083665       0.102932
Has immediate operand                                                       0.000640066    0.0297818      0.02    0.9829   -0.0577374      0.0590175
mnemonic: adc & Binary weight                                               0.000240231    0.000629472    0.38    0.7027   -0.000993643    0.0014741
mnemonic: add & Binary weight                                               0.00212642     0.000132819   16.01    <1e-56    0.00186607     0.00238677
mnemonic: and & Binary weight                                               0.000341044    0.000568978    0.60    0.5489   -0.000774252    0.00145634
mnemonic: asr & Binary weight                                               0.00036824     0.00146006     0.25    0.8009   -0.00249372     0.0032302
mnemonic: b & Binary weight                                                -0.00237033     0.0024252     -0.98    0.3284   -0.00712414     0.00238348
mnemonic: bfc & Binary weight                                               0.000310183    0.00024293     1.28    0.2017   -0.000166002    0.000786367
mnemonic: bfi & Binary weight                                               0.000346015    0.000227824    1.52    0.1288   -0.00010056     0.000792589
mnemonic: bic & Binary weight                                               0.000266862    0.00048624     0.55    0.5831   -0.000686251    0.00121998
mnemonic: bl & Binary weight                                                0.000550701    0.00113644     0.48    0.6280   -0.00167691     0.00277831
mnemonic: blx & Binary weight                                              -0.0243293      0.00392476    -6.20    <1e-09   -0.0320225     -0.0166361
    # ... #
Barrel shift amount & Has barrel shift                                      4.45507e-6     0.00218019     0.00    0.9984   -0.0042691      0.00427801
mnemonic: add & APSR (s flag)                                               0.00210831     0.00782543     0.27    0.7876   -0.0132309      0.0174475
mnemonic: and & APSR (s flag)                                               0.000169296    0.00996732     0.02    0.9864   -0.0193684      0.019707
mnemonic: asr & APSR (s flag)                                               0.0          NaN            NaN       NaN     NaN            NaN
mnemonic: b & APSR (s flag)                                                 0.0          NaN            NaN       NaN     NaN            NaN
mnemonic: bfc & APSR (s flag)                                               0.0          NaN            NaN       NaN     NaN            NaN
mnemonic: bfi & APSR (s flag)                                               0.0          NaN            NaN       NaN     NaN            NaN
mnemonic: bic & APSR (s flag)                                               0.000322832    0.0103932      0.03    0.9752   -0.0200497      0.0206953
mnemonic: bl & APSR (s flag)                                                0.0          NaN            NaN       NaN     NaN            NaN
mnemonic: blx & APSR (s flag)                                               0.0          NaN            NaN       NaN     NaN            NaN
mnemonic: bx & APSR (s flag)                                                0.0          NaN            NaN       NaN     NaN            NaN
mnemonic: clz & APSR (s flag)                                               0.0          NaN            NaN       NaN     NaN            NaN
mnemonic: cmn & APSR (s flag)                                               0.0          NaN            NaN       NaN     NaN            NaN
mnemonic: cmp & APSR (s flag)                                               0.0          NaN            NaN       NaN     NaN            NaN
mnemonic: cpsid & APSR (s flag)                                             0.0          NaN            NaN       NaN     NaN            NaN
mnemonic: cpsie & APSR (s flag)                                             0.0          NaN            NaN       NaN     NaN            NaN
mnemonic: dsb & APSR (s flag)                                               0.0          NaN            NaN       NaN     NaN            NaN
mnemonic: eor & APSR (s flag)                                              -0.000587628    0.0104193     -0.06    0.9550   -0.0210112      0.019836
 # ... #

This is due to the fact that I can’t realistically train every combination of instructions and operands, but I can estimate the contribution of some things. For instance, if the instruction has a barrel shift, I can predict the effect regardless of which instruction it is.

This can be achieved in two ways:

  1. changing the model formula so that these variables are not correlated anymore. The best for a theoretical POV, but for the instructions I did trained I’d like to keep the extra accuracy given by the correlation.
  2. instruct the model so that it doesn’t throw errors when I try to fit an instruction, but it still benefits from the ones I did train.

My question is: is there a way to do number 2? Is it theoretically correct? Like, can I instruct it like “if you can’t use the correlated parameter, use the uncorrelated one”? The naive approach I can think of is to use multiple models in sequences of try/catch: “if this throw error, try the next one, which is smaller”.

I am using GLM.jl right now, which by the way feels a little undocumented…? Is it better to use another package for this kind of things? I have hear of Flux.jl, MLJ.jl and maybe MLBase.jl, but it has to be able to train with categorical variables, since my “mnemonic” variable is a categorical string.

Thanks!