How to embed expert knowledge /constraints in ML training?

I want to train a model to predict the timber volume growth in a forest as a function of pedological/site variables and climatic data.

So, I have something like this:

ip_point X Y vol vol_growth sand_pc soil_depth precipitation temp
1 92.1 12.3 100 10 0.2 50 200 15
2 91.4 15.2 121 9 0.5 60 150 16
3 90.2 14.1 110 8 0.3 55 180 14
4 89.8 13.5 95 7 0.4 52 170 15

Of course, I could “simply” train to predict vol_growth as a function of the other variables (including growth), but I want to bound the growth to follow a logistic function, so to follow the equation (in the growth/state space):

\dot V = aV - (a/b) V^2

And I want to predict the 2 parameters a and b, but of course on each point I have a single observation, so my question, how to train a ML model but being sure that in relation to the vol variable I am following a logistic function? Put a loss ? Fitting a unrestricted model and then fit it (e.g. LsqFit) to the logistic model?

Is vol_growth here a measurement of \dot V?

yes, it is \dot V (in my real dataset, I have V in two observations 5 years apart)

In that case, standard linear least squares would be applicable:

A = [V V.^2]
y = V_dot
x = A\y
a = x[1]
b = -a/x[2]

This minimizes
\sum \big(\dot V - (aV - a/b V^2)\big)^2

This implicitly makes the assumption that there are only measurement errors present in \dot V, if you expect significant errors also in the measurement of V, you may want to consider an errors-in-variables approach.

2 Likes

I am not sure I understood, but this fit \dot V to the function, but I can’t see the relation to the other variables.. my idea is that the growth is influenced by the other variables (soil, climate, …)

Ah I see, in that case, you could make a and b functions of these predictors, a = a(p_c, d, p) or something like that.

The problem feels a bit under specified though, considering that you have no evolution of time in your dataset. I could imagine this approach would work well if you had several measurements from each site, but with data like this I’m not sure.

1 Like

I don’t know what kind of “ML” you are interested in but constrained splines might help you. Constrained P-Splines have nice properties and have been around for awhile (and can handle monotonic, etc constraints). You can check out this book on P-Splines which has a large amount of nicely written R code which is extremely simple to translate to Julia: https://psplines.bitbucket.io/

Simon Wood (of mgcv fame) has a paper on embedding shape constrained splines in generalized additive models here: Shape constrained additive models | Statistics and Computing with an accompanying R package.

3 Likes

In my experience (which is admittedly getting pretty ancient at this point), the best way to embed expert knowledge into simple (and frequently small) tabular datasets like these is through feature engineering. That’s exactly what @baggepinnen is doing by regressing against both V and V.^2 — instead of needing to train a complicated non-linear model that such a nonlinearity is important, you just use a linear model with the non-linearity you expect to be relevant as an input.

But time is indeed a major missing variable. You typically want to predict growth in the next X months, and in that case you will also want to feed into your model windowed historic data of, e.g., precipitation totals over the past month, over the past three months, over the past year, etc. Again, those kinds of engineered features are things that can be hard for a complicated model to “learn” from the basic dataset, but once you’ve engineered them, simple models like decision trees can give great performance.

3 Likes

I am attempting to “solve” the issue (for now on dummy data) by creating a multi-branch neural network, where one branch predicts a and b based on soil, precipitation (at time t, t-1, etc) and an other branch carry on with the volumes across the network. Finally, a terminal layer embeds the structure that I want to maintain (logistic in relation to the current volume) working on a function of (a, b and v0)
This avoids me to estimate separately a and b that are unobserved.

This is the code on dummy data:

using BetaML, Plots

# Dummy data creation....
N = 4000
x = rand(1:0.001:2,N,5)
e = rand(N)/1000
a = @. x[:,1]^2 - 2 * x[:,2]^2 + 3 * x[:,3]^2 - 4 * x[:,4]^1 + 2 * x[:,5] + e + 10
b = @. (2 *x[:,1]^2 + 4 * x[:,2]^2 + 5 * x[:,3] + 4 * x[:,4]^1 + 2 * x[:,5] + e *10) *100
v0 = rand(17:0.001:22,N)
xv = hcat(x,v0)
dv = @. a * v0 - (a/b) * v0^2


# Method 1: direct estimation of dv...
layers = [DenseLayer(6,10,f=relu),
          DenseLayer(10,10,f=relu),
          DenseLayer(10,1,f=identity)]
m = NeuralNetworkEstimator(epochs=200, batch_size=4, layers=layers)
dv_est = fit!(m, xv, dv)

hcat(dv,dv_est)
mse(dv,dv_est) # 51.61

# Method 2: Structure estimation...
# I create a NN with a branch for the parameters a and b, and a separate branch for the volume, and then  Icreate a (weightless) function layer that computes the growth based on the two parameters and the current volume.
dvcomp(x) = [x[1] * x[3] - (x[1] / x[2]) * x[3]^2] # x is [a ,b, V] - output must be a vector
l1_ab  = DenseLayer(5,10,f=relu)
l1_v   = ReplicatorLayer(1)
l1     = GroupedLayer([l1_ab,l1_v])
l2_ab  = DenseLayer(10,10,f=relu)
l2_v   = ReplicatorLayer(1)
l2     = GroupedLayer([l2_ab,l2_v])
l3_ab  = DenseLayer(10,2,f=identity)
l3_v   = ReplicatorLayer(1)
l3     = GroupedLayer([l3_ab,l3_v])
l4     = VectorFunctionLayer(3,f=dvcomp)

layers=[l1,l2,l3,l4]
m = NeuralNetworkEstimator(epochs=200, batch_size=4, layers=layers)
dv_est = fit!(m, xv, dv)

hcat(dv,dv_est)
mse(dv,dv_est) # 5.23
plot(dv,dv_est, label=nothing,xlabel="obs",ylabel="est")

The predictions (on dummy data) seems very good:

However, I may be doing something very dummy, as the model seems not respected.. going to look on this..

1 Like

I mean, it looks like you’re training models with hundreds of parameters… and then testing against the training data? With 4000 data points? That’s a lot of parameters for not much data; surely there’s a goodly amount of overfitting going on.

2 Likes

Time for splines? :slightly_smiling_face:

2 Likes

Sure, but I wasn’t considering that in this first attempt.. in my application I have hundred of thousand of data, and I’ll train-test split and cross-validate to control for overfitting…

The only problem I have now is that when I use the “structured” version the network is very hard to train, in the meaning that half of the attempts (depending on initial values of the weights), with the standard gradient-based optimers (SGD or ADAM), the loss get stuck in some local minima..

Revised code
using Pkg
Pkg.activate(@__DIR__)
using Revise

using Random, Pipe, BetaML, Plots

Random.seed!(200) #100

# Data creation....
N = 5000
x = rand(1.5:0.001:2,N,5)
e = rand(N)/1000
a = @. (x[:,1]^2 - 2 * x[:,2]^2 + 3 * x[:,3]^2 - 4 * x[:,4]^1 + 2 * x[:,5] + e + 10) / 100
b = @. (2 *x[:,1]^2 + 4 * x[:,2]^2 + 5 * x[:,3] + 4 * x[:,4]^1 + 2 * x[:,5] + e + 10) * 10
v0 = rand(20:0.001:minimum(b),N)
xv = hcat(x,v0)
xvvsq = hcat(x, v0, v0.^2)
dv = @. a * v0 - (a/b) * v0^2
scatter(v0, dv)

# Method 1: direct estimation of dv...
layers = [DenseLayer(6,10,f=relu),
          DenseLayer(10,10,f=relu),
          DenseLayer(10,1,f=relu)]
m = NeuralNetworkEstimator(epochs=100, batch_size=4, layers=layers)
dv_est = fit!(m, fit!(Scaler(),xv), dv)

hcat(dv,dv_est)
mse(dv,dv_est) # 51.61
scatter(dv,dv_est, label=nothing,xlabel="obs",ylabel="est")

# Method 2: Structure estimation...
# I create a NN with a branch for the parameters a and b, and a separate branch for the volume, and then  Icreate a (weightless) function layer that computes the growth based on the two parameters and the current volume.
dvcomp(x) = [x[1] * x[3] - (0.001*x[2]) * x[3]^2] # x is [a ,b, V] - output must be a vector
l1_ab  = DenseLayer(5,5,f=relu)
l1_v   = ReplicatorLayer(1)
l1     = GroupedLayer([l1_ab,l1_v])
l2_ab  = DenseLayer(5,5,f=relu)
l2_v   = ReplicatorLayer(1)
l2     = GroupedLayer([l2_ab,l2_v])
l3_ab  = DenseLayer(5,2,f=relu)
l3_v   = ReplicatorLayer(1)
l3     = GroupedLayer([l3_ab,l3_v])
l4     = VectorFunctionLayer(3,f=dvcomp)

layers=[l1,l2,l3,l4]
m = NeuralNetworkEstimator(epochs=100, batch_size=8, layers=layers)
dv_est = fit!(m, xv, dv)

hcat(dv,dv_est)
mse(dv,dv_est) # 5.23

scatter(dv,dv_est, label=nothing,xlabel="obs",ylabel="est")

function getabv(m,x,i)
    i = 10
    r = xv[i,:]
    comp_layers = parameters(m).nnstruct.layers
    xi_last = @pipe forward(comp_layers[1],r)|> forward(comp_layers[2],_) |> forward(comp_layers[3],_) 
    a = xi_last[1]
    b = (a/xi_last[2])/0.001
    v = xi_last[3] 
    return (a,b,v)
end

# Checking logistic model is enforced...
v = 0:500
j = rand(1:N)
xv_check = transpose(hcat([vcat(x[j,:],v[i]) for i in 1:length(v)]...))
(ae,be,vj) = getabv(m,xv,i)
(a[i],b[i])
dv_est = predict(m, xv_check)
plot(v,dv_est, label=nothing, xlabel="vol", ylabel="dvol", title="NN growth predictions on a single plot") # This should give me a bell shape curve

The neural network predictions on a single pixel are now following the logistic growth:

1 Like