Store a sum of parameterized scalar functions

Hello everyone,

I need to compute the same sum of scalar functions (constant, linear, Gaussian kernel…)
for a large number of different values. I would like to create a structure that will hold the sum of these functions to speed-up the computations

I have created an abstract type ParamFcn for these scalar functions. Each function is then a subtype of ParamFcn. For instance rbf is the subtype for a Gaussian kernel parameterized by a mean μ and std σ.

I have created a subtype of ParamFcn called AddedParamFcn that holds a tuple of ParamFcn to compute the sum of the scalar functions.

However, even for simple computations, this way of proceeding is actually slower than taking the sum of the different functions . See example. I would expect to get some speed-up with this approach since the the functions can be precomputed.

Should I use a Tuple to store the different ParamFcn functions, or can I create an `Array{ParamFcn,1}$ ( I don’t know how to implement this second option) since I know in advance the number of scalar functions.

What is a better way to implement this?

import Base: show, zero

# Type to hold 1D functions
abstract type ParamFcn end

# zero(::T) where {T <: ParamFcn} = T()

struct AddedParamFcn{T <: Tuple} <: ParamFcn
    list::T
end
function show(io::IO, Σg::AddedParamFcn)
    println(io, "AddedParamFcn:")
    for g in Σg.list
        println(io, "  $g")
    end
end

"""
    g₁::ParamFcn + g₂::ParamFcn

Add the parameterized functions so that `(g₁ + g₂)(z) = g₁(z) + g₂(z)`.
"""
+(g::ParamFcn, Σg::AddedParamFcn) = AddedParamFcn((g, Σg.list...))
+(Σg::AddedParamFcn, g::ParamFcn) = AddedParamFcn((Σg.list..., g))
function +(Σg₁::AddedParamFcn, Σg₂::AddedParamFcn)
    AddedParamFcn((Σg₁..., Σg₂...))
end

+(g::ParamFcn...) = AddedParamFcn(g)

(Σg::AddedParamFcn)(z) = mapreduce(g->g(z),+, Σg.list)



struct constant <:ParamFcn
end

const Cst = constant()

(C::constant)(z::T) where {T<:Real} = 1.0


struct linear <:ParamFcn
end

const Lin = linear()

(L::linear)(z::T) where {T<:Real} = z

# Gaussian Kernel
struct rbf <:ParamFcn
        μ::Float64
        σ::Float64
end


function show(io::IO, N::rbf)
println(io,"Gaussian kernel with mean $(N.μ) and std $(N.σ)")
end

(N::rbf)(z::T)  where {T<:Real} = 1/(N.σ*√(2*π))*exp(-(0.5/N.σ^2)*(z-N.μ)^2)


struct ψ₀ <:ParamFcn
        ξ₀::Float64
        σ₀::Float64
end


function show(io::IO, F::ψ₀)
println(io,"ψ₀ function with mean $(F.ξ₀) and std $(F.σ₀)")
end


function (F::ψ₀)(z::T) where {T<:Real}
    @assert F.σ₀ >= 0.0 "std must be >= 0"
    Δ₀ = (1/(F.σ₀*√2))*(z-F.ξ₀)
    return 0.5*((z-F.ξ₀)*(1-erf(Δ₀))-F.σ₀*√(2/π)*exp(-Δ₀^2))
end

Speed benchmark

function time_estim()
@btime begin
z= 0.0
for i=1:1000
z= randn()     
1.0 + z + rbf(1.0,1.0)(z) + ψ₀(1.0,2.0)(z)
end
end
    
@btime U = Cst + Lin + rbf(1.0,1.0) + ψ₀(1.0, 2.0)
@btime begin
U = Cst + Lin + rbf(1.0,1.0) + ψ₀(1.0, 2.0);
z= 0
for i=1:1000
z= randn()
U(z)
end
end
end

time_estim()

32.604 μs (0 allocations: 0 bytes)
0.015 ns (0 allocations: 0 bytes)
33.814 μs (0 allocations: 0 bytes)

For me, the difference between the benchmarks is below the noise level. There are no obvious bigger performance improvements visible to me. That said, AFAIK:

  • A specialized method is compiled for each set of parameter types, not for values. So, putting your runtime parameters ξ₀,σ₀ into a struct ψ₀ will not create an optimized version of the function (F::ψ₀)(z::T) for each set of runtime parameters. (can be done with the Val type or metaprogramming, if your set of parameters is not too large).
  • Small thing: Put @assert in the type constuctor

Thank you for your answer. Sorry, I have not fully understood how to improve the code for part 1. Can you give me an example ?

If you want (g₁ + g₂)(z) = g₁(z) + g₂(z), why not simply use z -> g₁(z) + g₂(z)?

1 Like

How can I create this new function by looping over my list of scalar functions?

Try foldl

1 Like

Thanks, I have included this function into the code by calling mapreduce.

As mentioned @AndiMD, it will be great if each scalar function or the sum of the scalar functions can have a specialized method based on the value of the parameters and not simply based on their types.
I don’t know how this

It is not clear to me why you expect that doing this will speed up your computations. You can do this to organize your code nicely (like you do with AddedParamFcn), but there will be no speed benefit compared to an unrolled version.

The best you can hope for is probably getting the same speed, if the compiler specializes everything, which should happen if you store the functions in a Tuple (definitely not an Array, don’t do that).

This will happen automatically for the (Σg::AddedParamFcn)(z) method. This is a key feature of Julia.

Incidentally,

mapreduce(g->g(z),+, Σg.list)

can be written as

mapreduce(g, +, Σg.list)
1 Like

In this case, I might have misunderstood something? Please correct me:
The method (Σg::AddedParamFcn)(z) should specialize for all types of Σg, which includes all type combinations stored in the Tuple Σg.list. But there will not be an optimized method for each set of values (like Σg.list[1].σ==2.0)

As long as types are concrete or parametrized to make this possible, you get a fully specialized method automatically (apart from some corner cases I won’t go into here).

So that method will be specialized for all concrete types, including tuples with your rbf type.

You can check this with @code_warntype. I would recommend reading

https://docs.julialang.org/en/v1/manual/performance-tips/#man-performance-tips-1

it is very useful and nicely written.

Thank you for your explanations

@Tamas_Papp, your simplification doesn’t seem to work:

(Σg::AddedParamFcn)(z::T) where {T<:Real} = mapreduce(g,+, Σg.list)

fi = rbf(1.0, 1.0)
U = +Lin
for i=1:10
    U += fi
end

U(1.0)

UndefVarError: g not defined

Stacktrace:
[1] (::AddedParamFcn{Tuple{linear,rbf,rbf,rbf,rbf,rbf,rbf,rbf,rbf,rbf,rbf}})(::Float64) at /media/mat/HDD/TransportMap/src/function.jl:40
[2] top-level scope at In[16]:1

You are right, I overlooked the fact that you are mapping over the gs.

I am getting a speed-up of roughly 4(see code below) by using this storage of the additive map

function time_estim()
@btime begin
    z= 0.0
    for i=1:10000
    z= randn()     
    z + mapreduce(i->rbf(1, i)(z),+,1:10)
    end
end

@btime begin
    U = +Lin
    for i=1:10
        U += rbf(1.0, i)
    end
end
 
@btime begin
    U = +Lin
    for i=1:10
        U += rbf(1.0, i)
    end
    z = 0  
    for i=1:10000
        z = randn()
        U(z)
    end
end
    
end

time_estim()

4.128 ms (320001 allocations: 6.41 MiB)
245.894 ns (20 allocations: 1.33 KiB)
1.187 ms (20020 allocations: 313.83 KiB)

Are you sure your benchmark codes do the same thing? Maybe use a summation variable to make sure nothing is optimized away.
Here are some variants, with comparison to a simple function rbffct. Providing the init argument to mapreduce helps a lot.

function rbffct(z::T,μ,σ)  where {T<:Real}
    1/(σ*√(2*π))*exp(-(0.5/σ^2)*(z-μ)^2)
end


function time_estim()
    
    @btime begin
        s=0.0
        for i=1:10000
            z = randn()
            s += mapreduce(i->rbf(1, i)(z),+,1:10; init=0.0)
        end
        s
    end

    @btime begin
        s=0.0
        U = mapreduce(i->rbf(1, i),+,1:10)
        for i=1:10000
            z = randn()
            s += U(z)
        end
        s
    end
    
    @btime begin
        s=0.0
        U = +Lin
        for i=1:10
            U += rbf(1.0, i)
        end
        for i=1:10000
            z = randn()
            s += U(z)
        end
        s
    end

    @btime begin
        s=0.0
        for i=1:10000
            z = randn()
            s += mapreduce(i->rbffct(z, 1, i),+,1:10)
        end
        s
    end

    @btime begin
        s=0.0
        U(z) = mapreduce(i->rbffct(z, 1, i),+,1:10)
        for i=1:10000
            z = randn()
            s += U(z)
        end
        s
    end
    
end

time_estim()

944.986 μs (0 allocations: 0 bytes)
1.566 ms (30017 allocations: 469.98 KiB)
1.630 ms (30020 allocations: 470.08 KiB)
1.205 ms (0 allocations: 0 bytes)
1.192 ms (0 allocations: 0 bytes)