Make model matrix with raw data (Non-Real, Complex numbers and other structures) in StatsModels and possible application with Metida (spatial covariance with shortest path)

Is it possible to use Complex numbers and custom types in StatsModels as continuous variables?

So, why I made this thread? :slight_smile:

With little abuse of StatsModels.jl I made this categorical term and got row columns in model matrix:

import StatsModels: ContrastsMatrix, AbstractContrasts, modelcols

mutable struct RawCoding <: AbstractContrasts
end
function StatsModels.ContrastsMatrix(contrasts::RawCoding, levels::AbstractVector{T}) where T
    ContrastsMatrix(ones(1,1),
                             [""],
                             levels,
                             contrasts)
end
function StatsModels.modelcols(t::CategoricalTerm{RawCoding, T, N}, d::NamedTuple) where T where N
    d[t.sym]
end

Why It was helpful for me?

I’m working on Metida.jl, and I thought, that I can try to make spatial structured covariance matrix with data where observation have complicated properties (for simple example: coordinates; but it can be any user data structures). So for that I only need to make model matrix with StatsModels and method to get distance (or any metric) between two observations.

DataFrame:

using DataFrames

vy = [1.0,1,2,3,2,3,2,3,4,5,6,6,3,2]
vf = ["a","a","a","a","a","a","a","b","b","b","b","b","b","b"]
vc = [CartesianIndex(1, 1),
CartesianIndex(4, 5),
CartesianIndex(5, 10),
CartesianIndex(2, 7),
CartesianIndex(7, 7),
CartesianIndex(10, 7),
CartesianIndex(6, 15),
CartesianIndex(8, 3),
CartesianIndex(4, 8),
CartesianIndex(2, 8),
CartesianIndex(3, 6),
CartesianIndex(9, 12),
CartesianIndex(8, 11),
CartesianIndex(1, 10),]
vcx = map(x-> x[2], vc) 
vcy = map(x-> x[1], vc)
df = DataFrame(vy = vy, vf = vf, vc = vc, vcx = vcx, vcy = vcy)

where vy - dependent variable, vf - categorical regressor, vc - vector of coordinates (vcx , vcy - same but as simple vectors).

At first simple example of spatial exponential model:

using Metida
lmm = Metida.LMM(@formula(vy ~ vf), df;
    repeated = Metida.VarEffect(Metida.@covstr(vcx+vcy|1), Metida.SPEXP),
    )
Metida.fit!(lmm)

result:

Linear Mixed Model: vy ~ vf
Random 1:
   No
Repeated:
    Model: :(vcx + vcy)|1
    Type: SPEXP (2)
Blocks: 1, Maximum block size: 14
Status: converged (No Errors)
    -2 logREML: 43.3528    BIC: 48.3226

    Fixed-effects parameters:
────────────────────────────────────────────────
               Coef.  Std. Error     z  Pr(>|z|)
────────────────────────────────────────────────
(Intercept)  1.98684    0.473562  4.20    <1e-04
vf: b        2.14717    0.666787  3.22    0.0013
────────────────────────────────────────────────
    Variance components:
    θ vector: [1.25307, 0.331533]
  Residual   σ²    var     1.57018
  Residual   θ     theta   0.331533

and if we make method for Metida to calculate distance we can use column with coordinates:

function Metida.edistance(mx::AbstractMatrix{<:CartesianIndex}, i::Int, j::Int)
    return sqrt((mx[i, 1][1] - mx[j, 1][1])^2 + (mx[i, 1][2] - mx[j, 1][2])^2)
end
lmm = Metida.LMM(@formula(vy ~ vf), df;
    repeated = Metida.VarEffect(Metida.@covstr(vc|1), Metida.SPEXP; coding = Dict(:vc => Metida.RawCoding())),
    )
Metida.fit!(lmm)

result:

Linear Mixed Model: vy ~ vf
Random 1: 
   No
Repeated:
    Model: vc|1
    Type: SPEXP (2)
Blocks: 1, Maximum block size: 14
Status: converged (No Errors)
    -2 logREML: 43.3528    BIC: 48.3226

    Fixed-effects parameters:
────────────────────────────────────────────────
               Coef.  Std. Error     z  Pr(>|z|)
────────────────────────────────────────────────
(Intercept)  1.98684    0.473562  4.20    <1e-04
vf: b        2.14717    0.666787  3.22    0.0013
────────────────────────────────────────────────
    Variance components:
    θ vector: [1.25307, 0.331533]
  Residual   σ²    var     1.57018
  Residual   θ     theta   0.331533

as expected results is the same…

But if we want some complicated distance? For example shortest way? I used using AStarSearch.jl to use maze:

using AStarSearch, Plots

const UP = CartesianIndex(-1, 0)
const DOWN = CartesianIndex(1, 0)
const LEFT = CartesianIndex(0, -1)
const RIGHT = CartesianIndex(0, 1)
const DIRECTIONS = [UP, DOWN, LEFT, RIGHT]

manhattan(a::CartesianIndex, b::CartesianIndex) = sum(abs.((b - a).I))

function mazeneighbours(maze, p)
  res = CartesianIndex[]
  for d in DIRECTIONS
      n = p + d
      if 1 ≤ n[1] ≤ size(maze)[1] && 1 ≤ n[2] ≤ size(maze)[2] && !maze[n]
          push!(res, n)
      end
  end
  return res
end

function solvemaze(maze, start, goal)
  currentmazeneighbours(state) = mazeneighbours(maze, state)
  return astar(currentmazeneighbours, start, goal, heuristic=manhattan)
end

maze = [0 0 1 0 0 0 0 0 0 0 0 0 0 0 0;
        0 1 0 0 0 1 0 0 0 0 1 0 0 0 0;
        0 1 0 0 1 0 1 0 0 0 1 0 0 1 0;
        0 0 0 1 1 0 1 0 1 0 1 0 1 0 0;
        1 0 1 0 0 0 0 0 1 0 0 1 0 0 0;
        0 0 0 0 0 1 1 1 1 0 0 0 1 0 0;
        1 1 1 0 0 1 0 0 0 0 0 0 0 1 0;
        0 0 0 0 1 0 0 0 1 0 0 1 1 1 0;
        0 0 1 1 0 0 1 1 0 1 0 0 0 0 0;
        0 0 0 0 0 0 0 0 0 0 1 0 0 0 0] .== 1

And covariance based on A* path now:

Metida.edistance(mx::AbstractMatrix{<:CartesianIndex}, i::Int, j::Int) = begin
    res = solvemaze(maze, mx[i, 1], mx[j, 1])
    res.cost
end
lmm = Metida.LMM(@formula(vy ~ vf), df;
    repeated = Metida.VarEffect(Metida.@covstr(vc|1), Metida.SPEXP; coding = Dict(:vc => Metida.RawCoding())),
    )
Metida.fit!(lmm)

result:

Linear Mixed Model: vy ~ vf
Random 1:
   No
Repeated:
    Model: vc|1
    Type: SPEXP (2)
Blocks: 1, Maximum block size: 14
Status: converged (No Errors)
    -2 logREML: 43.3107    BIC: 48.2805

    Fixed-effects parameters:
────────────────────────────────────────────────
               Coef.  Std. Error     z  Pr(>|z|)
────────────────────────────────────────────────
(Intercept)  1.98521    0.472616  4.20    <1e-04
vf: b        2.14289    0.663757  3.23    0.0012
────────────────────────────────────────────────
    Variance components:
    θ vector: [1.25128, 0.447061]
  Residual   σ²    var     1.5657
  Residual   θ     theta   0.447061

So we have slightly different results as expected. (This minimal working example can be executed with “dev” brunch of Metida).

Inspirational materials:

Davis BJK, Curriero FC. Development and Evaluation of Geostatistical Methods for Non-Euclidean-Based Spatial Covariance Matrices. Math Geosci. 2019 Aug;51(6):767-791. doi: 10.1007/s11004-019-09791-y. Epub 2019 Mar 14. PMID: 31827631; PMCID: PMC6905632.

https://www.researchgate.net/publication/346030087_Estimation_of_Shortest_Path_Covariance_Matrices