Large memory allocation in function within function

Hey everyone,

Apologies for the frequent questions, but I just dont quite get behind the compiler as of yet. I have the following problem: I have a custom struct and some data, and I want to call functions based on the tuple (struct, data). Something like this:

using Distributions
using BenchmarkTools

abstract type Supertype{F<:VariateForm} end
struct MyStruct{F,T} <: Supertype{F}
    feat1::Vector{T}
    feat2::Matrix{T}
    distr::Vector{Distribution{F}}
    MyStruct{F,T}(feat1, feat2, distr) where {F,T} = new(feat1, feat2, distr)
end
#Methods
MyStruct(feat1::AbstractVector{T}, feat2::AbstractMatrix{T}, distr::AbstractVector{<:Distribution{F}}) where {F,T} = MyStruct{F,T}(feat1, feat2, distr)

#Initialize object and some random data
MyObject = MyStruct([.4, .6], [0.9 0.1; 0.1 0.9], [Normal(0,1), Normal(5,1)])
obs = rand(10^6,1)

Now I define an intermediate function:

#Function1
function fun1(mystruct::Supertype, observations::AbstractArray)
    hcat(map(distribution -> logpdf.(distribution, observations), mystruct.distr)...)
end
log_evidence = fun1(MyObject, obs)

Based on the output of this function, I want to do some further calculation. I can do that either by directly using the output of this function, or by calling the function within a function (and thus keeping the (struct, data) tuple:

#Function2 - directly take input of fun1 / fun2 independent of struct in minimal example
function fun2(mystruct::Supertype, log_evidence::AbstractArray)
alpha = zeros( size(log_evidence) )
    for t in 1:size(log_evidence)[1]
        alpha[t,:] = log_evidence[t,:]
    end
end

#Function3 - Calculate fun1 within function3
function fun3(mystruct::Supertype, observations::AbstractArray)
log_evidence = fun1(mystruct, obs)
alpha = zeros( size(log_evidence) )
    for t in 1:size(log_evidence)[1]
        alpha[t,:] = log_evidence[t,:]
    end
end

Obviously, fun3 is slower, but even though the function within fun3 hardly consumes memory or time, fun3 consumes 3 times as much memory as fun2 and is 25 times slower, see:

fun1(MyObject, obs)
fun2(MyObject, log_evidence)
fun3(MyObject, obs)
@benchmark fun1(MyObject, obs)
@benchmark fun2(MyObject, log_evidence)
@benchmark fun3(MyObject, obs)

(1) What is the reason for this gigantic discrepancy? I dont quite get how so much memory can be allocated if the actual inner function does not even consume 10% of the additional memory of fun3 against fun2, nevermind the speed discrepancy. There seem to be many threads on this but I still failed to grasp it, i.e. I dont get why this should be a ā€œglobalā€ variable problem?

(2) Also, I could just use directly fun2, but would rather keep the order of input (mystruct, data) for all similar functions. I could archieve similar things if I first define ā€˜fun2ā€™ and then apply Multiple Dispatch on it like this:

#Not implemented
function fun2(MyObject, obs)
    fun2(MyObject, fun1(MyObject, obs) )
end

This gives similar performance for some reason while I can keep the input order (mystruct, data). However, I did not get my head around using Multiple dispatch for this problem, as mystruct has to be the same in both functions, and log_evidence and obs have the same type, i.e. I dont know how I would discriminate these two cases such that the compiler knows which method to use?

Thank you again and apologies for the wall of text!

Best regards,

##################################################################################
Edit: A solution would be by defining fun2 differently and then do MD, like this:

#Not implemented - fun2 independent of struct in minimal example
function fun2(distr, log_evidence::AbstractArray)
alpha = zeros( size(log_evidence) )
    for t in 1:size(log_evidence)[1]
        alpha[t,:] = log_evidence[t,:]
    end
end

function fun2(MyObject, obs)
    fun2(MyObject.distr, fun1(MyObject, obs) )
end

but I would rather avoid this as the idea is to apply ā€˜fun2ā€™ for different Types with the same observationsā€¦

This doesnā€™t appear to be a global variable problem.

Instead, this is where the @code_warntype comes in handy - if you use it on fun1 youā€™ll see that Julia is unable to infer the type of the output (itā€™s Any). With fun2 you have a function barrier that prevents the loss of performance associated with unknown types (see Performance Tips Ā· The Julia Language ) similar to your proposed solution.

2 Likes

Thank you for your answer!

I had a look at it and redefined fun3 the following way below. That way, memory allocation is roughly the one from fun1 + fun2 (expected), the computational time is still roughly 1.5 times higher than the individual times of fun1 + fun2.

function fun4(mystruct::Supertype, observations::AbstractArray)

log_evidence = fun1(MyObject, obs)::Array{Float64,2} #NEW function barrier
alpha = zeros( size(log_evidence) )

    for t in 1:size(log_evidence)[1]
        alpha[t,:] = log_evidence[t,:]
    end
end

@benchmark fun1(MyObject, obs)
@benchmark fun2(MyObject, log_evidence)
@benchmark fun4(MyObject, obs)

(1) Is there a better way to trim that down/use function barries? I tried initializing an empty array and then perform the computation already, but it didnt really improve things (I guess it just overrode the empty array when I assigned the function to the variable). Allocating an Matrix with a Matrix seems to be not working for me:

log_evidence = fun1(MyObject, obs)::Array{Float64,2}
alpha = zeros( size(log_evidence) )

typeof(log_evidence) #Matrix{Float64}
typeof(alpha) #Matrix{Float64}

#not working
fill!(alpha, log_evidence)

Do I need to initialize an empty array and then fill it rowwise with a for loop?

(2) Also, if I have a function with only 1 input that varies, and that input has the same type for all different variations, then i guess I cannot use Multiple dispatch? In this example that would be:

function fun2(mystruct::Supertype, observations::AbstractArray)
#do computations as above
end
function fun2(mystruct::Supertype, log_evidence::AbstractArray)
#do computations as above
end

What I could do is probably define a custom type for log_evidence?

Thank you, again, for your time!

Best regards,
###############################################################################

Edit:
For large T, the difference seems to vanish and the computation time is roughly the same as its individual parts. I take it that I probably should have used

function fun1(mystruct::Supertype, observations::AbstractArray)::Array{Float64,2}
    hcat(map(distribution -> logpdf.(distribution, observations), mystruct.distr)...)
end

straight so I wont need to that within other functions.

Iā€™m not on my computer, so sorry if Iā€™m way off, but this seems strange to me. Julia is column major, so doing this row-wise is not optimal and very complicated and verbose.

Canā€™t you just use

alpha = copy(log_evidence) 

Unfortunately, the rest of your problem is too hard for me to solve on a phone, so I can just pick nits :upside_down_face:

Thank you for your reply! This for loop is just a placeholder for a much much larger loop that will be the same regardless of the input. I put it in so I have a minimal example that works.

Okay, but I was guessing that you wouldnā€™t have used this example without having some misconception about memory layout and array copying. Also you appeared to be explicitly asking for how to copy an array, and whether you should do it row-wise:

Why deliberately use bad code as an example/placeholder? It seems to distract from your main point.

Here is how Iā€™d implement it

# Same speed as original but output is inferred correctly
function fun1!(output::Matrix, mystruct::Supertype, observations)
    for i in eachindex(mystruct.distr)
        output[:, i] .= logpdf.(mystruct.distr[i], vec(observations))
    end
    return output
end
fun1(mystruct::Supertype, observations) = fun1!(zeros(length(observations), length(mystruct.distr)), mystruct, observations)

# 23ms compared with 43ms original or 240ms for fun3
function fun2(mystruct::Supertype, observations)
    log_evidence = fun1(mystruct, observations)
    alpha = zero(log_evidence)
    for t in eachindex(log_evidence)
        alpha[t] = log_evidence[t]
    end
    return alpha
end

A few things to note - it appears that fun1 is actually easy to get wrong. I naively assumed that using a double for-loop over observations and distributions would be fastest but it is actually considerably slower than the broadcasted version shown (by a factor of 10!). Iā€™m not quite sure why this (is Distributions doing something strange?). The version here is the same speed as yours but the output is correctly inferred and so there is no speed loss in fun2. Note that a common style in Julia is to have lots of simple functions composed together in this way which tends to help inference.

Another thing is that your use of 1:size(log_evidence)[1] isnā€™t very performant. Better is 1:size(log_evidence, 1). (Again, @code_warntype reveals this.)

As you learn the typical style of Julia code, youā€™ll find that you are automatically writing fairly performant code without many type annotations; a sprinkling of @benchmark and @code_warntype and youā€™ll spot the remaining problems. It does take a bit of time to learn though but the rewards are good.

2 Likes

Thank you so much!

I did not manage to make the output type stable but this is a very clever trick! Also thanks for the indexing advice!