Parameter inference problem with matrix inversions

TLDR: I have a parameter inference problem that involves inferring a matrix of a linear system but I am having trouble with either avoiding the trivial solution (matrix = all zeros) or getting usable results.

I have a system akin to
\vec{X}(t) = \vec{b} + diag(\vec{\lambda}(\delta))[\vec{X}(t-\delta) - \vec{b}] + M\vec{Z}(t,\delta)

Where \vec{X} is my data; \vec{b}, \vec{\lambda} and M are vectors and a square matrix of parameters; \vec{Z} is noise. Everything is of size m=12 (or m \times m). I’m trying to see if I can infer the parameters by minimizing \vec{Z} across different time step lengths simultaneously. Something like

min \sum_{i, t, \delta} Z_i(t, \delta)^2

So that for different time points t and step lengths \delta we just get to minimize
M^{-1}\{\vec{X}(t) - \vec{b} - diag(\vec{\lambda}(\delta))[\vec{X}(t-\delta) - \vec{b}]\}

I’m having trouble figuring out a way of getting usable results for this. I was not sure I wanted my matrix M to be positive semidefinite because I did not necessarily want the different coordinates X_i to have symmetrical relationships with the noise coordinates Z_j. Because trying to solve directly for a matrix A=M^{-1} yields the trivial solution of A_{ij} = 0, instead I tried solving the linear system inside the loss function. Although I did get a result, I am neither satisfied with how it looks or with the performance (a gradient calculation takes 15s). Here’s a paraphrased version of the code:

using ForwardDiff
using LinearAlgebra
using Optimization

function myloss_systemsolve(
	parameters::AbstractVector{T},
    sample_series::AbstractMatrix;
    max_stepsize::Integer = 90,
) where {T<:Real}
    nvars, nsamples = size(sample_series)
    lamb_arg_vec = parameters[1:nvars]
    b_vec = parameters[nvars+1:2*nvars]
    system_matrix = reshape(parameters[2*nvars+1:end], (nvars, nvars))

    loss = zero(T)

    for stepsize = 1:max_stepsize
        lamb_vec = exp.(-stepsize * lamb_arg_vec)
        for t = (stepsize+1):nsamples
            residuals =
                system_matrix \ (
                    sample_series[:, t] - b_vec -
                    lamb_vec .* (sample_series[:, t-stepsize] - b_vec)
                )
            loss += sum(abs2, residuals)
        end
    end
    return loss
end


function myfit_systemsolve(
    sample_series::AbstractMatrix{<:Real};
    maxiters::Integer = 10000,
    max_stepsize::Integer = 90,
)
    nvars, nsamples = size(sample_series)

    localloss(x) = myloss_systemsolve(x, sample_series; max_stepsize = max_stepsize)

    opt_params_a = 0.001 * randn(2 * nvars)
    opt_params_a[1:nvars] = max.(1e-8, opt_params_a[1:nvars])
    opt_params_b = vec(Matrix{Float64}(LinearAlgebra.I, (nvars, nvars)))
    opt_params = vcat(opt_params_a, opt_params_b)


    lb_lamb = zeros(nvars)
    lb_b = fill(-Inf, nvars)
    lb_sys = fill(-Inf, nvars^2)
    lb = vcat(lb_lamb, lb_b, lb_sys)

    ub_lamb = fill(Inf, nvars)
    ub_b = fill(Inf, nvars)
    ub_sys = fill(Inf, nvars^2)
    ub = vcat(ub_lamb, ub_b, ub_sys)

    opt_func = OptimizationFunction((x, p) -> localloss(x), Optimization.AutoForwardDiff())
    opt_prob = OptimizationProblem{true}(opt_func, opt_params; lb = lb, ub = ub)
    opt_solution = solve(
        opt_prob,
        Optimization.LBFGS(30),
        maxiters = maxiters,
    )

    println(opt_solution.stats)

    lamb_vec = opt_solution.u[1:nvars]
    b_vec = opt_solution.u[nvars+1:2*nvars]
    system_matrix = reshape(opt_solution.u[2*nvars+1:end], (nvars, nvars))
    system_matrix .= system_matrix ./ maximum(system_matrix)

    return lamb_vec, b_vec, system_matrix
end

You can see that at the end I am rescaling the entire matrix M because I am getting very big numbers in the entries of M (particularly in the diagonals).

I then tried to actually optimize A=M^{-1} directly and using a regularization of the log determinant to try and ensure invertibility so that the optimizer avoids the trivial solution:

function myloss_direct(
	parameters::AbstractVector{T},
    sample_series::AbstractMatrix;
    max_stepsize::Integer = 90,
) where {T<:Real}
    nvars, nsamples = size(sample_series)
    lamb_arg_vec = parameters[1:nvars]
    b_vec = parameters[nvars+1:2*nvars]
    inv_system_matrix = reshape(parameters[2*nvars+1:end], (nvars, nvars))

    loss = zero(T) - 0.001 * logdet(inv_system_matrix)

    for stepsize = 1:max_stepsize
        lamb_vec = exp.(-stepsize * lamb_arg_vec)
        for t = (stepsize+1):nsamples
            residuals =
                inv_system_matrix * (
                    sample_series[:, t] - b_vec -
                    lamb_vec .* (sample_series[:, t-stepsize] - b_vec)
                )
            loss += sum(abs2, residuals)
        end
    end
    return loss
end


function myfit_direct(
    sample_series::AbstractMatrix{<:Real};
    maxiters::Integer = 10000,
    max_stepsize::Integer = 90,
)
    nvars, nsamples = size(sample_series)

    localloss(x) = myloss_direct(x, sample_series; max_stepsize = max_stepsize)

    opt_params_a = 0.001 * randn(2 * nvars)
    opt_params_a[1:nvars] = max.(1e-8, opt_params_a[1:nvars])
    opt_params_b = vec(Matrix{Float64}(LinearAlgebra.I, (nvars, nvars)))
    opt_params = vcat(opt_params_a, opt_params_b)


    lb_lamb = zeros(nvars)
    lb_b = fill(-Inf, nvars)
    lb_sys = fill(-Inf, nvars^2)
    lb = vcat(lb_lamb, lb_b, lb_sys)

    ub_lamb = fill(Inf, nvars)
    ub_b = fill(Inf, nvars)
    ub_sys = fill(Inf, nvars^2)
    ub = vcat(ub_lamb, ub_b, ub_sys)

    opt_func = OptimizationFunction((x, p) -> localloss(x), Optimization.AutoForwardDiff())
    opt_prob = OptimizationProblem{true}(opt_func, opt_params; lb = lb, ub = ub)
    opt_solution = solve(
        opt_prob,
        Optimization.LBFGS(30),
        maxiters = maxiters,
    )

    println(opt_solution.stats)

    lamb_vec = opt_solution.u[1:nvars]
    b_vec = opt_solution.u[nvars+1:2*nvars]
    system_matrix = inv(reshape(opt_solution.u[2*nvars+1:end], (nvars, nvars)))
    system_matrix .= system_matrix ./ maximum(system_matrix)

    return lamb_vec, b_vec, system_matrix
end

The problem is that the optimizer runs into negative determinants during the line search:

ERROR: DomainError with -1.0:
log was called with a negative real argument but will only return a complex result if called with a complex argument. Try log(Complex(x)).
DomainError detected in the user `f` function. This occurs when the domain of a function is violated.
For example, `log(-1.0)` is undefined because `log` of a real number is defined to only output real
numbers, but `log` of a negative number is complex valued and therefore Julia throws a DomainError
by default. Cases to be aware of include:

* `log(x)`, `sqrt(x)`, `cbrt(x)`, etc. where `x<0`
* `x^y` for `x<0` floating point `y` (example: `(-1.0)^(1/2) == im`)

Finally, I am now pondering if I should just give up on allowing M to be non-symmetric and try to use something like the system solving version with something like the cholesky decomposition inside it. I’m not really sure how to achieve it yet, I guess I need a function to transform the vector of parameters into a triangular matrix inside the loss function.

Any thoughts would be appreciated here.

You only need to factorize once. lu(system_matrix) ouside of the loop and reuse.

You mean like this?

    factorized = lu(system_matrix)
    for stepsize = 1:max_stepsize
        lamb_vec = exp.(-stepsize * lamb_arg_vec)
        for t = (stepsize+1):nsamples
            residuals =
                factorized \ (
                    sample_series[:, t] - b_vec -
                    lamb_vec .* (sample_series[:, t-stepsize] - b_vec)
                )

Surprisingly, the previous version takes now 10s per gradient but this version takes 19s per gradient.

julia> @btime ForwardDiff.gradient(testloss, opt_params)
  9.669 s (36315808 allocations: 35.85 GiB)
168-element Vector{Float64}:
      -1.3287476712792678e7
      -1.3912004307969796e7
      -9.338172896703465e6
      -1.649469889841166e7
...

vs

julia> @btime ForwardDiff.gradient(testloss, opt_params)
  19.793 s (31776686 allocations: 30.10 GiB)
168-element Vector{Float64}:
      -1.3287476712792678e7
      -1.3912004307969796e7
      -9.338172896703465e6
      -1.649469889841166e7
...

The lu factorized version takes less memory but it’s 2x slower. Might it be because \ is choosing a different (more suitable) algorithm? Should I figure out which it is and try do do it outside the loop like you suggested with lu?

EDIT: This might just be because opt_params is using the identity matrix because it’s my chosen initialization, so \ might be detecting it’s a diagonal matrix… Let me check further

Putting it inside the loss function won’t help — if I understand your problem correctly, the basic issue is that your \Vert Z(M) \Vert^2 is minimized when M^{-1} is the zero matrix, i.e. when M goes to infinity. So putting it in the loss function, or similar algebraic re-arrangements, won’t improve matters — your problem is ill-posed.

In order to get a finite answer for M you will need additional constraints, e.g. \Vert M \Vert_F needs to be bounded or something like that.

I absolutely agree with the problem being ill-posed; I actually had an L2 regularization term on all the parameters to try to make it a bit less, so in a sense it included the Frobenius norm of M as well.
My intuition behind putting the system solve inside the loss function was that, if a determinant calculation or inversion is implied in some form, the algorithm will avoid the determinant vanishing because of the NaN’s in the divisions by zero. And for some reason it seems like it does avoid the M_{ij}^{-1} = 0 solution even without the regularization.

With this version (with the system solve but without the L2 regularization I mentioned) I did get LBFGS to stop after a few iterations:
SciMLBase.OptimizationStats(360, 9350.245136022568, 511, 511, 0)

I’m not particularly confident that the fact that LBFGS did stop before the maximum allowed iteration implies that it converged, but plotting the residuals actually gives me back something that seems more stationary than the original series.
For example, for the time evolution of one of the coordinates of the system:

I would like to compare it with the L2 regularized version as well and try to reach some conclusion behind it.

M=0 is not the problem. The difficulty is that optimization will want to drive M to be an arbitrarily large (nonsingular!) matrix, since that will drive your loss function \Vert Z \Vert^2 to zero.

You have to formulate an optimization problem that makes sense before you worry about algorithms.

Regarding performance, reverse-mode differentiation should be much more efficient than forward mode differentiation, in principle, for a scalar-valued function (loss function) of many (> 100) parameters. (You can even work out a closed-form reverse-mode/adjoint expression for the gradient for this simple function.)

Yes, sorry, I edited the other post to have M_{ij}^{-1}=0 instead, my bad.
I did try Zygote but it is slower than ForwardDiff for this particular problem.

If your optimization is not driving M^{-1} towards zero (without regularization), then it’s not working correctly. (M^{-1} won’t reach zero in a finite number of steps, because M after any finite number of steps will be finite. But it should be going in that direction.)

That’s a good perspective. I guess my next step here will be to try to do the optimization while monitoring the norm of M or of its inverse and see if that is in fact happening. If it is, I would imagine the next step would be regularizing to enforce the boundedness on M like you suggested. Thanks for the insight!