Hi all — I’m having some problems making this bit of code run at a reasonable speed. In essence, I have functions defined within functions, where the innermost function evaluates a sum. Then, I end up combining functions of these forms and repeating the process — which is not particularly fast and has a large amount of allocations. Any suggestions on how to improve this without massively changing the structure of the code overall? (For those who care, structure comes from RKHS theory and directly translates from the mathematical formalism, so I would really like to maintain it.)
MWE included, where the resulting function d isn’t too complicated (i.e. one would want to repeat the process substituting d for k_hat perhaps) and n very small.
using BenchmarkTools
using LinearAlgebra
# define example data
n = 64
θs = range(0, stop = 2pi, length = n)
obs_map(θ::Float64) = [sin(θ); cos(θ)]
X_obs = reduce(hcat, obs_map.(θs))
sample_f = ones(n)
sample_eval = [0.1, 0.2]
# def of initial kernel, will always accept two vectors, return a float 
k(x::Union{Vector{Float64}, SubArray{Float64}}, y::Union{Vector{Float64}, SubArray{Float64}}) = exp(-1*norm(x - y)^2 / 0.05)
# the problem?
function make_integralop(k::Function, X::Matrix{Float64})
    colsX = eachcol(X)
    function K(f::Vector{Float64})
        function g(x)
            return mean((k(x, y) for y in colsX) .* f)
        end
        
        return g
    end
end
# example use case
K = make_integralop(k, X_obs)
q = K(sample_f)
k_hat(x,y) = k(x,y) / (q(x) * q(y))
K_hat = make_integralop(k_hat, X_obs)
d = K_hat(sample_f)
@btime d(sample_eval) 
# 276.334 μs (8906 allocations: 798.41 KiB) for n = 64
# 68.330 ms (2011010 allocations: 183.81 MiB) for n = 1000
