How can I use a linear model with NaN parameters due to missing train data?

bertulli · July 15, 2023, 10:09am

Hi all!

I am trying to build a model to predict electrical powers from a sequence of binary computer instructions. I thought to use a multiple linear regression, as from each instruction there are a lot of variables I can extract. To give a little context:

which opcode it is
what is its binary weight
if it has an immediate operand
what is the operand
if the destination is equal to the source

To have the best fit, I need to fit against different possible combinations of these prediction variables. My current formula is this:

julia> formula = Term(mean_power_sym) ~ ConstantTerm(1) +
           (Term(:mnemonic) & (ConstantTerm(1) + Term(binary_weight_sym))) * (
               Term(apsr_sym) + Term(conditional_sym) + Term(dest_source_eq_sym) +
               Term(barrel_shift_sym) * Term(has_barrel_shift_sym) + 
               Term(has_immediate_sym)
           )
FormulaTerm
Response:
  Base power mean (W)(unknown)
Predictors:
  1
  mnemonic(unknown)
  APSR (s flag)(unknown)
  Is conditional(unknown)
  Dest reg == source reg(unknown)
  Barrel shift amount(unknown)
  Has barrel shift(unknown)
  Has immediate operand(unknown)
  mnemonic(unknown) & Binary weight(unknown)
  Barrel shift amount(unknown) & Has barrel shift(unknown)
  mnemonic(unknown) & APSR (s flag)(unknown)
  mnemonic(unknown) & Is conditional(unknown)
  mnemonic(unknown) & Dest reg == source reg(unknown)
  mnemonic(unknown) & Barrel shift amount(unknown)
  mnemonic(unknown) & Has barrel shift(unknown)
  mnemonic(unknown) & Has immediate operand(unknown)
  mnemonic(unknown) & Barrel shift amount(unknown) & Has barrel shift(unknown)
  mnemonic(unknown) & Binary weight(unknown) & APSR (s flag)(unknown)
  mnemonic(unknown) & Binary weight(unknown) & Is conditional(unknown)
  mnemonic(unknown) & Binary weight(unknown) & Dest reg == source reg(unknown)
  mnemonic(unknown) & Binary weight(unknown) & Barrel shift amount(unknown)
  mnemonic(unknown) & Binary weight(unknown) & Has barrel shift(unknown)
  mnemonic(unknown) & Binary weight(unknown) & Has immediate operand(unknown)
  mnemonic(unknown) & Binary weight(unknown) & Barrel shift amount(unknown) & Has barrel shift(unknown)

Even if I collected a large number of data points (around 11’000), not all the combinations are present in the dataset. I end up with lots of parameters being NaN (I report here a few excerpts):

julia> model = lm(formula, df_temp)
StatsModels.TableRegressionModel{LinearModel{GLM.LmResp{Vector{Float64}}, GLM.DensePredChol{Float64, LinearAlgebra.CholeskyPivoted{Float64, Matrix{Float64}, Vector{Int64}}}}, Matrix{Float64}}

Base power mean (W) ~ 1 + mnemonic + APSR (s flag) + Is conditional + Dest reg == source reg + Barrel shift amount + Has barrel shift + Has immediate operand + mnemonic & Binary weight + Barrel shift amount & Has barrel shift + mnemonic & APSR (s flag) + mnemonic & Is conditional + mnemonic & Dest reg == source reg + mnemonic & Barrel shift amount + mnemonic & Has barrel shift + mnemonic & Has immediate operand + mnemonic & Barrel shift amount & Has barrel shift + mnemonic & Binary weight & APSR (s flag) + mnemonic & Binary weight & Is conditional + mnemonic & Binary weight & Dest reg == source reg + mnemonic & Binary weight & Barrel shift amount + mnemonic & Binary weight & Has barrel shift + mnemonic & Binary weight & Has immediate operand + mnemonic & Binary weight & Barrel shift amount & Has barrel shift

Coefficients:
──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
                                                                                  Coef.     Std. Error       t  Pr(>|t|)      Lower 95%      Upper 95%
──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
(Intercept)                                                                 0.0804049      0.0305935      2.63    0.0086    0.0204363      0.140373
mnemonic: add                                                              -0.0209991      0.0306201     -0.69    0.4929   -0.0810198      0.0390215
mnemonic: and                                                               0.000384739    0.0309979      0.01    0.9901   -0.0603766      0.0611461
mnemonic: asr                                                              -5.10902e-6     0.0385573     -0.00    0.9999   -0.0755842      0.0755739
mnemonic: b                                                                 0.00127954     0.00981586     0.13    0.8963   -0.0179612      0.0205203
mnemonic: bfc                                                               0.0          NaN            NaN       NaN     NaN            NaN
mnemonic: bfi                                                               0.0          NaN            NaN       NaN     NaN            NaN
mnemonic: bic                                                               0.0          NaN            NaN       NaN     NaN            NaN
mnemonic: bl                                                                0.0146442      0.0162697      0.90    0.3681   -0.0172472      0.0465356
mnemonic: blx                                                               0.0          NaN            NaN       NaN     NaN            NaN
mnemonic: bx                                                                0.170029       0.0430545      3.95    <1e-04    0.0856352      0.254424
mnemonic: clz                                                              -0.000283997    0.0478817     -0.01    0.9953   -0.0941403      0.0935723
mnemonic: cmn                                                               0.0          NaN            NaN       NaN     NaN            NaN
mnemonic: cmp                                                              -0.00888138     0.0309722     -0.29    0.7743   -0.0695923      0.0518295
mnemonic: cpsid                                                             0.0          NaN            NaN       NaN     NaN            NaN
mnemonic: cpsie                                                             0.0          NaN            NaN       NaN     NaN            NaN
mnemonic: dsb                                                               0.00188294     0.0744446      0.03    0.9798   -0.144041       0.147807
mnemonic: eor                                                               0.0          NaN            NaN       NaN     NaN            NaN
mnemonic: ldr                                                              -0.00117276     0.0316519     -0.04    0.9704   -0.0632159      0.0608704
mnemonic: ldrb                                                              0.0          NaN            NaN       NaN     NaN            NaN
mnemonic: ldrd                                                              0.0          NaN            NaN       NaN     NaN            NaN
mnemonic: ldrh                                                              0.0          NaN            NaN       NaN     NaN            NaN
mnemonic: lsl                                                               0.000285234    0.0376906      0.01    0.9940   -0.0735949      0.0741654
mnemonic: lsr                                                               0.000237562    0.0385573      0.01    0.9951   -0.0753415      0.0758166
mnemonic: mla                                                              -0.00216897     0.0368344     -0.06    0.9530   -0.0743708      0.0700328
mnemonic: mls                                                              -0.000555183    0.0328027     -0.02    0.9865   -0.0648542      0.0637438
mnemonic: mov                                                              -0.0045436      0.030818      -0.15    0.8828   -0.0649523      0.0558651
mnemonic: movt                                                              0.0          NaN            NaN       NaN     NaN            NaN
mnemonic: movw                                                              0.0          NaN            NaN       NaN     NaN            NaN
    # ... #
APSR (s flag)                                                               4.4252e-5      0.00770131     0.01    0.9954   -0.0150516      0.0151401
Is conditional                                                             -0.0380549      0.00146575   -25.96    <1e-99   -0.0409281     -0.0351818
Dest reg == source reg                                                     -3.92217e-5     0.00770131    -0.01    0.9959   -0.0151351      0.0150567
Barrel shift amount                                                         0.0          NaN            NaN       NaN     NaN            NaN
Has barrel shift                                                            0.00963344     0.0475971      0.20    0.8396   -0.083665       0.102932
Has immediate operand                                                       0.000640066    0.0297818      0.02    0.9829   -0.0577374      0.0590175
mnemonic: adc & Binary weight                                               0.000240231    0.000629472    0.38    0.7027   -0.000993643    0.0014741
mnemonic: add & Binary weight                                               0.00212642     0.000132819   16.01    <1e-56    0.00186607     0.00238677
mnemonic: and & Binary weight                                               0.000341044    0.000568978    0.60    0.5489   -0.000774252    0.00145634
mnemonic: asr & Binary weight                                               0.00036824     0.00146006     0.25    0.8009   -0.00249372     0.0032302
mnemonic: b & Binary weight                                                -0.00237033     0.0024252     -0.98    0.3284   -0.00712414     0.00238348
mnemonic: bfc & Binary weight                                               0.000310183    0.00024293     1.28    0.2017   -0.000166002    0.000786367
mnemonic: bfi & Binary weight                                               0.000346015    0.000227824    1.52    0.1288   -0.00010056     0.000792589
mnemonic: bic & Binary weight                                               0.000266862    0.00048624     0.55    0.5831   -0.000686251    0.00121998
mnemonic: bl & Binary weight                                                0.000550701    0.00113644     0.48    0.6280   -0.00167691     0.00277831
mnemonic: blx & Binary weight                                              -0.0243293      0.00392476    -6.20    <1e-09   -0.0320225     -0.0166361
    # ... #
Barrel shift amount & Has barrel shift                                      4.45507e-6     0.00218019     0.00    0.9984   -0.0042691      0.00427801
mnemonic: add & APSR (s flag)                                               0.00210831     0.00782543     0.27    0.7876   -0.0132309      0.0174475
mnemonic: and & APSR (s flag)                                               0.000169296    0.00996732     0.02    0.9864   -0.0193684      0.019707
mnemonic: asr & APSR (s flag)                                               0.0          NaN            NaN       NaN     NaN            NaN
mnemonic: b & APSR (s flag)                                                 0.0          NaN            NaN       NaN     NaN            NaN
mnemonic: bfc & APSR (s flag)                                               0.0          NaN            NaN       NaN     NaN            NaN
mnemonic: bfi & APSR (s flag)                                               0.0          NaN            NaN       NaN     NaN            NaN
mnemonic: bic & APSR (s flag)                                               0.000322832    0.0103932      0.03    0.9752   -0.0200497      0.0206953
mnemonic: bl & APSR (s flag)                                                0.0          NaN            NaN       NaN     NaN            NaN
mnemonic: blx & APSR (s flag)                                               0.0          NaN            NaN       NaN     NaN            NaN
mnemonic: bx & APSR (s flag)                                                0.0          NaN            NaN       NaN     NaN            NaN
mnemonic: clz & APSR (s flag)                                               0.0          NaN            NaN       NaN     NaN            NaN
mnemonic: cmn & APSR (s flag)                                               0.0          NaN            NaN       NaN     NaN            NaN
mnemonic: cmp & APSR (s flag)                                               0.0          NaN            NaN       NaN     NaN            NaN
mnemonic: cpsid & APSR (s flag)                                             0.0          NaN            NaN       NaN     NaN            NaN
mnemonic: cpsie & APSR (s flag)                                             0.0          NaN            NaN       NaN     NaN            NaN
mnemonic: dsb & APSR (s flag)                                               0.0          NaN            NaN       NaN     NaN            NaN
mnemonic: eor & APSR (s flag)                                              -0.000587628    0.0104193     -0.06    0.9550   -0.0210112      0.019836
 # ... #

This is due to the fact that I can’t realistically train every combination of instructions and operands, but I can estimate the contribution of some things. For instance, if the instruction has a barrel shift, I can predict the effect regardless of which instruction it is.

This can be achieved in two ways:

changing the model formula so that these variables are not correlated anymore. The best for a theoretical POV, but for the instructions I did trained I’d like to keep the extra accuracy given by the correlation.
instruct the model so that it doesn’t throw errors when I try to fit an instruction, but it still benefits from the ones I did train.

My question is: is there a way to do number 2? Is it theoretically correct? Like, can I instruct it like “if you can’t use the correlated parameter, use the uncorrelated one”? The naive approach I can think of is to use multiple models in sequences of try/catch: “if this throw error, try the next one, which is smaller”.

I am using GLM.jl right now, which by the way feels a little undocumented…? Is it better to use another package for this kind of things? I have hear of Flux.jl, MLJ.jl and maybe MLBase.jl, but it has to be able to train with categorical variables, since my “mnemonic” variable is a categorical string.

Thanks!

Topic		Replies	Views
How can I substitute NaN in a GLM model with zeroes? Statistics regression , glm , linear-regression , modelling	0	267	July 24, 2023
Why does my Flux model return in all NaN? Machine Learning question , flux	2	743	October 9, 2023
Missing or NaN Data in GLM (e.g., in DataFrame, @formula) Statistics glm	10	6437	September 12, 2018
How can I tell Julia what are the parameters in a linear model? Statistics regression , fit , curve-fitting , glm , linear-regression	14	1616	September 26, 2022
How come Flux.jl's network parameters go to NaN? Machine Learning first-steps , flux	10	4065	June 9, 2021

How can I use a linear model with NaN parameters due to missing train data?

Related topics