Hi all!
I am trying to build a model to predict electrical powers from a sequence of binary computer instructions. I thought to use a multiple linear regression, as from each instruction there are a lot of variables I can extract. To give a little context:
- which opcode it is
- what is its binary weight
- if it has an immediate operand
- what is the operand
- if the destination is equal to the source
To have the best fit, I need to fit against different possible combinations of these prediction variables. My current formula is this:
julia> formula = Term(mean_power_sym) ~ ConstantTerm(1) +
(Term(:mnemonic) & (ConstantTerm(1) + Term(binary_weight_sym))) * (
Term(apsr_sym) + Term(conditional_sym) + Term(dest_source_eq_sym) +
Term(barrel_shift_sym) * Term(has_barrel_shift_sym) +
Term(has_immediate_sym)
)
FormulaTerm
Response:
Base power mean (W)(unknown)
Predictors:
1
mnemonic(unknown)
APSR (s flag)(unknown)
Is conditional(unknown)
Dest reg == source reg(unknown)
Barrel shift amount(unknown)
Has barrel shift(unknown)
Has immediate operand(unknown)
mnemonic(unknown) & Binary weight(unknown)
Barrel shift amount(unknown) & Has barrel shift(unknown)
mnemonic(unknown) & APSR (s flag)(unknown)
mnemonic(unknown) & Is conditional(unknown)
mnemonic(unknown) & Dest reg == source reg(unknown)
mnemonic(unknown) & Barrel shift amount(unknown)
mnemonic(unknown) & Has barrel shift(unknown)
mnemonic(unknown) & Has immediate operand(unknown)
mnemonic(unknown) & Barrel shift amount(unknown) & Has barrel shift(unknown)
mnemonic(unknown) & Binary weight(unknown) & APSR (s flag)(unknown)
mnemonic(unknown) & Binary weight(unknown) & Is conditional(unknown)
mnemonic(unknown) & Binary weight(unknown) & Dest reg == source reg(unknown)
mnemonic(unknown) & Binary weight(unknown) & Barrel shift amount(unknown)
mnemonic(unknown) & Binary weight(unknown) & Has barrel shift(unknown)
mnemonic(unknown) & Binary weight(unknown) & Has immediate operand(unknown)
mnemonic(unknown) & Binary weight(unknown) & Barrel shift amount(unknown) & Has barrel shift(unknown)
Even if I collected a large number of data points (around 11’000), not all the combinations are present in the dataset. I end up with lots of parameters being NaN
(I report here a few excerpts):
julia> model = lm(formula, df_temp)
StatsModels.TableRegressionModel{LinearModel{GLM.LmResp{Vector{Float64}}, GLM.DensePredChol{Float64, LinearAlgebra.CholeskyPivoted{Float64, Matrix{Float64}, Vector{Int64}}}}, Matrix{Float64}}
Base power mean (W) ~ 1 + mnemonic + APSR (s flag) + Is conditional + Dest reg == source reg + Barrel shift amount + Has barrel shift + Has immediate operand + mnemonic & Binary weight + Barrel shift amount & Has barrel shift + mnemonic & APSR (s flag) + mnemonic & Is conditional + mnemonic & Dest reg == source reg + mnemonic & Barrel shift amount + mnemonic & Has barrel shift + mnemonic & Has immediate operand + mnemonic & Barrel shift amount & Has barrel shift + mnemonic & Binary weight & APSR (s flag) + mnemonic & Binary weight & Is conditional + mnemonic & Binary weight & Dest reg == source reg + mnemonic & Binary weight & Barrel shift amount + mnemonic & Binary weight & Has barrel shift + mnemonic & Binary weight & Has immediate operand + mnemonic & Binary weight & Barrel shift amount & Has barrel shift
Coefficients:
──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
Coef. Std. Error t Pr(>|t|) Lower 95% Upper 95%
──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
(Intercept) 0.0804049 0.0305935 2.63 0.0086 0.0204363 0.140373
mnemonic: add -0.0209991 0.0306201 -0.69 0.4929 -0.0810198 0.0390215
mnemonic: and 0.000384739 0.0309979 0.01 0.9901 -0.0603766 0.0611461
mnemonic: asr -5.10902e-6 0.0385573 -0.00 0.9999 -0.0755842 0.0755739
mnemonic: b 0.00127954 0.00981586 0.13 0.8963 -0.0179612 0.0205203
mnemonic: bfc 0.0 NaN NaN NaN NaN NaN
mnemonic: bfi 0.0 NaN NaN NaN NaN NaN
mnemonic: bic 0.0 NaN NaN NaN NaN NaN
mnemonic: bl 0.0146442 0.0162697 0.90 0.3681 -0.0172472 0.0465356
mnemonic: blx 0.0 NaN NaN NaN NaN NaN
mnemonic: bx 0.170029 0.0430545 3.95 <1e-04 0.0856352 0.254424
mnemonic: clz -0.000283997 0.0478817 -0.01 0.9953 -0.0941403 0.0935723
mnemonic: cmn 0.0 NaN NaN NaN NaN NaN
mnemonic: cmp -0.00888138 0.0309722 -0.29 0.7743 -0.0695923 0.0518295
mnemonic: cpsid 0.0 NaN NaN NaN NaN NaN
mnemonic: cpsie 0.0 NaN NaN NaN NaN NaN
mnemonic: dsb 0.00188294 0.0744446 0.03 0.9798 -0.144041 0.147807
mnemonic: eor 0.0 NaN NaN NaN NaN NaN
mnemonic: ldr -0.00117276 0.0316519 -0.04 0.9704 -0.0632159 0.0608704
mnemonic: ldrb 0.0 NaN NaN NaN NaN NaN
mnemonic: ldrd 0.0 NaN NaN NaN NaN NaN
mnemonic: ldrh 0.0 NaN NaN NaN NaN NaN
mnemonic: lsl 0.000285234 0.0376906 0.01 0.9940 -0.0735949 0.0741654
mnemonic: lsr 0.000237562 0.0385573 0.01 0.9951 -0.0753415 0.0758166
mnemonic: mla -0.00216897 0.0368344 -0.06 0.9530 -0.0743708 0.0700328
mnemonic: mls -0.000555183 0.0328027 -0.02 0.9865 -0.0648542 0.0637438
mnemonic: mov -0.0045436 0.030818 -0.15 0.8828 -0.0649523 0.0558651
mnemonic: movt 0.0 NaN NaN NaN NaN NaN
mnemonic: movw 0.0 NaN NaN NaN NaN NaN
# ... #
APSR (s flag) 4.4252e-5 0.00770131 0.01 0.9954 -0.0150516 0.0151401
Is conditional -0.0380549 0.00146575 -25.96 <1e-99 -0.0409281 -0.0351818
Dest reg == source reg -3.92217e-5 0.00770131 -0.01 0.9959 -0.0151351 0.0150567
Barrel shift amount 0.0 NaN NaN NaN NaN NaN
Has barrel shift 0.00963344 0.0475971 0.20 0.8396 -0.083665 0.102932
Has immediate operand 0.000640066 0.0297818 0.02 0.9829 -0.0577374 0.0590175
mnemonic: adc & Binary weight 0.000240231 0.000629472 0.38 0.7027 -0.000993643 0.0014741
mnemonic: add & Binary weight 0.00212642 0.000132819 16.01 <1e-56 0.00186607 0.00238677
mnemonic: and & Binary weight 0.000341044 0.000568978 0.60 0.5489 -0.000774252 0.00145634
mnemonic: asr & Binary weight 0.00036824 0.00146006 0.25 0.8009 -0.00249372 0.0032302
mnemonic: b & Binary weight -0.00237033 0.0024252 -0.98 0.3284 -0.00712414 0.00238348
mnemonic: bfc & Binary weight 0.000310183 0.00024293 1.28 0.2017 -0.000166002 0.000786367
mnemonic: bfi & Binary weight 0.000346015 0.000227824 1.52 0.1288 -0.00010056 0.000792589
mnemonic: bic & Binary weight 0.000266862 0.00048624 0.55 0.5831 -0.000686251 0.00121998
mnemonic: bl & Binary weight 0.000550701 0.00113644 0.48 0.6280 -0.00167691 0.00277831
mnemonic: blx & Binary weight -0.0243293 0.00392476 -6.20 <1e-09 -0.0320225 -0.0166361
# ... #
Barrel shift amount & Has barrel shift 4.45507e-6 0.00218019 0.00 0.9984 -0.0042691 0.00427801
mnemonic: add & APSR (s flag) 0.00210831 0.00782543 0.27 0.7876 -0.0132309 0.0174475
mnemonic: and & APSR (s flag) 0.000169296 0.00996732 0.02 0.9864 -0.0193684 0.019707
mnemonic: asr & APSR (s flag) 0.0 NaN NaN NaN NaN NaN
mnemonic: b & APSR (s flag) 0.0 NaN NaN NaN NaN NaN
mnemonic: bfc & APSR (s flag) 0.0 NaN NaN NaN NaN NaN
mnemonic: bfi & APSR (s flag) 0.0 NaN NaN NaN NaN NaN
mnemonic: bic & APSR (s flag) 0.000322832 0.0103932 0.03 0.9752 -0.0200497 0.0206953
mnemonic: bl & APSR (s flag) 0.0 NaN NaN NaN NaN NaN
mnemonic: blx & APSR (s flag) 0.0 NaN NaN NaN NaN NaN
mnemonic: bx & APSR (s flag) 0.0 NaN NaN NaN NaN NaN
mnemonic: clz & APSR (s flag) 0.0 NaN NaN NaN NaN NaN
mnemonic: cmn & APSR (s flag) 0.0 NaN NaN NaN NaN NaN
mnemonic: cmp & APSR (s flag) 0.0 NaN NaN NaN NaN NaN
mnemonic: cpsid & APSR (s flag) 0.0 NaN NaN NaN NaN NaN
mnemonic: cpsie & APSR (s flag) 0.0 NaN NaN NaN NaN NaN
mnemonic: dsb & APSR (s flag) 0.0 NaN NaN NaN NaN NaN
mnemonic: eor & APSR (s flag) -0.000587628 0.0104193 -0.06 0.9550 -0.0210112 0.019836
# ... #
This is due to the fact that I can’t realistically train every combination of instructions and operands, but I can estimate the contribution of some things. For instance, if the instruction has a barrel shift, I can predict the effect regardless of which instruction it is.
This can be achieved in two ways:
- changing the model formula so that these variables are not correlated anymore. The best for a theoretical POV, but for the instructions I did trained I’d like to keep the extra accuracy given by the correlation.
- instruct the model so that it doesn’t throw errors when I try to fit an instruction, but it still benefits from the ones I did train.
My question is: is there a way to do number 2? Is it theoretically correct? Like, can I instruct it like “if you can’t use the correlated parameter, use the uncorrelated one”? The naive approach I can think of is to use multiple models in sequences of try
/catch
: “if this throw error, try the next one, which is smaller”.
I am using GLM.jl right now, which by the way feels a little undocumented…? Is it better to use another package for this kind of things? I have hear of Flux.jl, MLJ.jl and maybe MLBase.jl, but it has to be able to train with categorical variables, since my “mnemonic” variable is a categorical string.
Thanks!