Improving speed with iterative sums and functions within functions

Hi all — I’m having some problems making this bit of code run at a reasonable speed. In essence, I have functions defined within functions, where the innermost function evaluates a sum. Then, I end up combining functions of these forms and repeating the process — which is not particularly fast and has a large amount of allocations. Any suggestions on how to improve this without massively changing the structure of the code overall? (For those who care, structure comes from RKHS theory and directly translates from the mathematical formalism, so I would really like to maintain it.)

MWE included, where the resulting function d isn’t too complicated (i.e. one would want to repeat the process substituting d for k_hat perhaps) and n very small.

using BenchmarkTools
using LinearAlgebra

# define example data
n = 64
θs = range(0, stop = 2pi, length = n)
obs_map(θ::Float64) = [sin(θ); cos(θ)]
X_obs = reduce(hcat, obs_map.(θs))
sample_f = ones(n)
sample_eval = [0.1, 0.2]

# def of initial kernel, will always accept two vectors, return a float 
k(x::Union{Vector{Float64}, SubArray{Float64}}, y::Union{Vector{Float64}, SubArray{Float64}}) = exp(-1*norm(x - y)^2 / 0.05)

# the problem?
function make_integralop(k::Function, X::Matrix{Float64})
    colsX = eachcol(X)

    function K(f::Vector{Float64})
        function g(x)
            return mean((k(x, y) for y in colsX) .* f)
        end
        
        return g
    end
end

# example use case
K = make_integralop(k, X_obs)
q = K(sample_f)
k_hat(x,y) = k(x,y) / (q(x) * q(y))
K_hat = make_integralop(k_hat, X_obs)
d = K_hat(sample_f)

@btime d(sample_eval) 
# 276.334 μs (8906 allocations: 798.41 KiB) for n = 64
# 68.330 ms (2011010 allocations: 183.81 MiB) for n = 1000

Welcome to the forum!

I think there are might be two problems here.

This allocates a new vector each time. You can avoid this by writing it as
return mean((k(x, y)*fval for (fval, y) in zip(f, colsX))

See whether this already fixed things. If not you are likely running into issues related to captured variables. Here is the section of the performance tips with the fix.

Please report back whether this helps! :slight_smile:

2 Likes

Your single biggest culprit for allocations is the definition

k(x::Union{Vector{Float64}, SubArray{Float64}}, y::Union{Vector{Float64}, SubArray{Float64}}) = exp(-1*norm(x - y)^2 / 0.05)

The part x - y is allocating a new array every time this is called. Because of how small this array is, there is significant performance potentially lost here.

You could do something else, like replace norm(x-y)^2 with something non-allocating (like sum(abs2,Broadcast.Broadcasted(-, (x,y))) ALTHOUGH I DO NOT RECOMMEND THIS PARTICULAR ONE BECAUSE IT’S USING THE INTERNAL Broadcast.Broadcasted AND MAY BREAK IN A FUTURE VERSION OF JULIA). But something that will be much more convenient here (because your vectors are small and of fixed size) is to instead change the input vectors to SVector from StaticArrays.jl.

Below I include a trivial re-write that does this, as well as some similar changes to your mean call. I’ll remark that you had grossly over-typed your function signatures. There is no need for any type restrictions here, since you don’t use multiple dispatch. There is no performance benefit from function argument annotations (virtually?) ever. But I only loosened them somewhat (some relaxations were necessary to accept the reformatted values, others I did purely for style). This already lessens allocations by a factor of 10:

using BenchmarkTools
using LinearAlgebra
using StaticArrays

# define example data
n = 64
θs = range(0, stop = 2pi, length = n)
obs_map(θ) = SVector(sincos(θ)...)
X_obs = obs_map.(θs)
sample_f = ones(n)
sample_eval = @SVector [0.1, 0.2]

# def of initial kernel, will always accept two vectors, return a float 
k(x::AbstractVector, y::AbstractVector) = exp(-1*norm(x - y)^2 / 0.05)

# the problem?
function make_integralop(k, X::AbstractVector{<:AbstractVector})

    function K(f::AbstractVector)
        function g(x)
            zXf = zip(X,f)
            return sum(((Xi,fi),) -> k(x, Xi) * fi, zXf) / length(zXf)
        end
        
        return g
    end
end

# example use case
K = make_integralop(k, X_obs)
q = K(sample_f)
k_hat(x,y) = k(x,y) / (q(x) * q(y))
K_hat = make_integralop(k_hat, X_obs)
d = K_hat(sample_f)

@btime $d($sample_eval)
1 Like

You could call euclidean(x, y)^2 from Distances.jl, for example. Or just write a loop:

# compute norm(x - y)^2, without allocations
function distance2(x::AbstractArray, y::AbstractArray)
    sum = float(abs2(zero(eltype(x)) - zero(eltype(y))))
    @simd for i in eachindex(x, y)
        @inbounds sum += abs2(x[i] - y[i])
    end
    return sum
end

(This, along with norm(x - y), is such a common function to compute that I wish it were built in to LinearAlgebra.)

You can also use the identity norm(x-y)^2 == dot(x,x) + dot(y,y) - 2real(dot(x,y)) (up to roundoff error … it won’t calculate small values of norm(x - y) accurately due to cancellation error, but maybe you don’t care), which makes an extra pass over the arrays but exploits BLAS1 calls.

4 Likes

Thanks all! — with improvements, theres at least a 5x speed up (12.709 ms (10002 allocations: 187.53 KiB for n=1000). Most of which came from changing the definition of k(x,y) to use something non-allocating.

Also, good note on over-typing my function signatures. I’ll try to remember that in the future. :slight_smile: