Overfitting when building kriging surrogate model

Hello,

I’m using Surrogates.jl to build a surrogate model based on data in a CSV file. There are 24 input parameters and 1 output variables in my data. I have 500 samples. 80% samples are used for training set and other 20% for test.

using CSV 
using DataFrames
using Plots
using Surrogates
using SurrogatesPolyChaos
using StatsPlots
using Statistics
default()
using Dates
using LinearAlgebra
using Lathe.preprocess: TrainTestSplit

df = DataFrame(CSV.File("result.csv"))
train, test = TrainTestSplit(df, .8)
dim = 24

x_train = train[:, 1:dim]
x_train = values.(eachrow(x_train))
y_train = train[:, end]
# y_train = values.(eachrow(y_train))

x_test = test[:, 1:dim]
x_test = values.(eachrow(x_test))
y_test = test[:, end]

lower_bound = [
    200, 200, 200, 200, 200, 200, 
    200, 200, 200, 200, 200, 200,
    180, 180, 180, 180, 180, 180,
    300, 300, 300, 300, 300, 300
]
upper_bound = [
    230, 230, 230, 230, 230, 230,
    275, 275, 275, 275, 275, 275,
    200, 200, 200, 200, 200, 200,
    350, 350, 350, 350, 350, 350
]
p = ones(dim) * 2
theta = [0.5 / max(1e-6 * norm(upper_bound .- lower_bound), std(x_i[i] for x_i in x_train))^p[i] for i in 1:length(x_train[1])]
mymodel = Kriging(x_train, y_train, lower_bound, upper_bound, p=p, theta=theta)

# Prediction
ys_test = mymodel.(x_test)
ys_train = mymodel.(x_train)

# Model assessment criteria
function mae(x, y)
    return sum(abs.(x - y)) / length(x)
end

function mape(x, y)
    return sum(abs.(x - y)/y) / length(x)
end

function rmse(x, y)
    return sqrt(sum(((x - y).^2) / length(x)))
end

function mse(x, y)
    return sum(((x - y).^2) / length(x))
end

function r2(x,y)
    sse = sum((x - y).^2)
    sst = sum((y .- mean(y)).^2)
    return 1 - sse / sst
end

println("      ASSESSMENT CRITERIA            TRAIN              TEST")
println("      Mean Absolute Error       ", mae(ys_train, y_train), "      ", mae(ys_test, y_test))
println("Mean Absolute Percentage Error  ", mape(ys_train, y_train), "      ", mape(ys_test, y_test))
println("    Root Mean Square Error      ", rmse(ys_train, y_train), "      ", rmse(ys_test, y_test))
println("       Mean Square Error        ", mse(ys_train, y_train), "      ", mse(ys_test, y_test))
println("           R Square             ", r2(ys_train, y_train), "      ", r2(ys_test, y_test))
      ASSESSMENT CRITERIA      TRAIN              TEST
      Mean Absolute Error       0.0      8.600517533587887
Mean Absolute Percentage Error  0.0      0.0032085499707193293
    Root Mean Square Error      0.0      10.959428095549947
       Mean Square Error        0.0      120.10906418152953
           R Square             1.0      -0.018184443821864793

After calculation, I found there is overfitting with my model. Can someone help me? Thanks a lot!

This is what the hyperparameter tuning is for. Did you try different theta?

Note there are some improvements here in progress: More Kriging improvements (log likelihood, noise variance, more tests) by archermarx · Pull Request #379 · SciML/Surrogates.jl · GitHub

Does the hyperparameter tuning method in Cross-validation · ScikitLearn.jl is available for surrogate model?

using ScikitLearn.CrossValidation: cross_val_score

accuracy = cross_val_score(mymodel, x_test, y_test; cv=5)

I try this but it fails.

MethodError: no method matching is_classifier(::Kriging{Vector{NTuple{24, Float64}}, Vector{Float64}, Vector{Int64}, Vector{Int64}, Vector{Float64}, Vector{Float64}, Float64, Vector{Float64}, Float64, Matrix{Float64}})
Closest candidates are:
  is_classifier(!Matched::ScikitLearnBase.BaseClassifier) at ~/.julia/packages/ScikitLearnBase/QnTv8/src/ScikitLearnBase.jl:37
  is_classifier(!Matched::ScikitLearnBase.BaseRegressor) at ~/.julia/packages/ScikitLearnBase/QnTv8/src/ScikitLearnBase.jl:38
  is_classifier(!Matched::ScikitLearn.Skcore.GridSearchCV) at ~/.julia/packages/ScikitLearn/ssekP/src/grid_search.jl:566
  ...

Stacktrace:
  [1] cross_val_score(estimator::Function, X::Vector{NTuple{24, Float64}}, y::Vector{Float64}; scoring::Nothing, cv::Int64, n_jobs::Int64, verbose::Int64, fit_params::Nothing)
    @ ScikitLearn.Skcore ~/.julia/packages/ScikitLearn/ssekP/src/cross_validation.jl:277
  [2] top-level scope
    @ ~/Code_linux/caebigdata/gl-reservoirs-control/telemac/Telemac-1D_surrogates.ipynb:5
  [3] eval
    @ ./boot.jl:373 [inlined]
  [4] include_string(mapexpr::typeof(REPL.softscope), mod::Module, code::String, filename::String)
    @ Base ./loading.jl:1196
  [5] #invokelatest#2
    @ ./essentials.jl:716 [inlined]
  [6] invokelatest
    @ ./essentials.jl:714 [inlined]
  [7] (::VSCodeServer.var"#164#165"{VSCodeServer.NotebookRunCellArguments, String})()
    @ VSCodeServer ~/.vscode/extensions/julialang.language-julia-1.6.24/scripts/packages/VSCodeServer/src/serve_notebook.jl:19
  [8] withpath(f::VSCodeServer.var"#164#165"{VSCodeServer.NotebookRunCellArguments, String}, path::String)
    @ VSCodeServer ~/.vscode/extensions/julialang.language-julia-1.6.24/scripts/packages/VSCodeServer/src/repl.jl:184
  [9] notebook_runcell_request(conn::VSCodeServer.JSONRPC.JSONRPCEndpoint{Base.PipeEndpoint, Base.PipeEndpoint}, params::VSCodeServer.NotebookRunCellArguments)
    @ VSCodeServer ~/.vscode/extensions/julialang.language-julia-1.6.24/scripts/packages/VSCodeServer/src/serve_notebook.jl:13
 [10] dispatch_msg(x::VSCodeServer.JSONRPC.JSONRPCEndpoint{Base.PipeEndpoint, Base.PipeEndpoint}, dispatcher::VSCodeServer.JSONRPC.MsgDispatcher, msg::Dict{String, Any})
    @ VSCodeServer.JSONRPC ~/.vscode/extensions/julialang.language-julia-1.6.24/scripts/packages/JSONRPC/src/typed.jl:67
 [11] serve_notebook(pipename::String, outputchannel_logger::Base.CoreLogging.SimpleLogger; crashreporting_pipename::String)
    @ VSCodeServer ~/.vscode/extensions/julialang.language-julia-1.6.24/scripts/packages/VSCodeServer/src/serve_notebook.jl:136
 [12] top-level scope
    @ ~/.vscode/extensions/julialang.language-julia-1.6.24/scripts/notebook/notebook.jl:32
 [13] include(mod::Module, _path::String)
    @ Base ./Base.jl:418
 [14] exec_options(opts::Base.JLOptions)
    @ Base ./client.jl:292
 [15] _start()
    @ Base ./client.jl:495

Any tools in julia to do the hyperparameter tuning for surrogate model?

It’s differentiable Making Kriging differentiable and better initial hyperparams · Issue #368 · SciML/Surrogates.jl · GitHub so you can just define OptimizationProblems and use it. Check out the examples there.

I have checked this issue. Defining the function kriging_logpdf and loss function, I try to find minimum of the loss function, which corresponds to the optimal hyperparameters. But, it fails to find optimum of the loss function, the value of loss function is Inf.

function kriging_logpdf(params, x, y, lb, ub)
    d = length(params) ÷ 2
    p = params[1:d]
    theta = params[d+1:end]
    surr = Kriging(x, y, lb, ub; p, theta)

    n = length(y)
    y_minus_1μ = y - ones(length(y), 1) * surr.mu
    Rinv = surr.inverse_of_R

    term1 = only(-(y_minus_1μ' * surr.inverse_of_R * y_minus_1μ) / 2 / surr.sigma)
    term2 = -log((2π * surr.sigma)^(n/2) / sqrt(det(Rinv)))

    logpdf = term1 + term2
    return logpdf
end

loss_func = params -> -kriging_logpdf(params, x_train, y_train, lower_bound, upper_bound)

using Optim

result = optimize(loss_func, [p; theta], BFGS())
 * Status: failure

 * Candidate solution
    Final objective value:     Inf

 * Found with
    Algorithm:     BFGS

 * Convergence measures
    |x - x'|               = 0.00e+00 ≤ 0.0e+00
    |x - x'|/|x'|          = 0.00e+00 ≤ 0.0e+00
    |f(x) - f(x')|         = NaN ≰ 0.0e+00
    |f(x) - f(x')|/|f(x')| = NaN ≰ 0.0e+00
    |g(x)|                 = NaN ≰ 1.0e-08

 * Work counters
    Seconds run:   0  (vs limit Inf)
    Iterations:    0
    f(x) calls:    1
    ∇f(x) calls:   1

I don’t know what’s wrong with my function.

Open an issue.