Function that executes two functions slower than two function separately

So I have three types I defined as Dist, Kernel and State. I am running a function g defines as follows:

function g(x::Kernel, s::State)
    d = get_dist(x, s)
    return sample(d)

Where d is a Dist and g returns a State.

Now if I have a function f that does g sequentially twice for different kernels:

function f(x::Kernel, y::Kernel, s::State)
    s = g(x, s)
    s = g(y, s)
end

It runs about 3ms slower. For reference, g(x, s) takes about 7-8ms, g(y, s) takes less than 2ms. While f runs at about 13ms. Is there anyway to optimize further? What is the exact cause of this discrepancy?

State has two named tuples as fields while Kernel is an abstract type with singleton subtypes. I tried to not redefine the variable s in the function f. Weirdly enough that version of the function consumed less memory (I expected it to consume more since I’m defining two new states basically) but took longer to run.

Was the function g called before in your Julia session?

If you alter your function in the following way,

function f(x::Kernel, y::Kernel, s::State)
    @time "first g call" s = g(x, s)
    @time "second g call" s = g(y, s)
end

I think you’ll notice that the first g call also has some compilation time reported.

I suggest trying this as a test:

# do a `g` call here
@time "compiling g" g(x,y)

function f(x::Kernel, y::Kernel, s::State)
    @time "first g call" s = g(x, s)
    @time "second g call" s = g(y, s)
end

Now, when you call f, you should see similar time execution for the two g calls.

Sorry I guess my question wasn’t clear at all that’s my bad. The first g call and second g call are actually different functions because x and y are different concrete types. I would expect the first call to take more time so it’s not a compilation issue.

My problem is that the sum of the time taken by the two g functions by themselves is less than the time taken by f by an amount that’s significant for my purposes.

Can you provide a full minimal working example? And how are you measuring the execution times?

Also, are the two different methods of g type stable? So can the output of the first line in the function f be inferred from the input types? (what does @code_warntype f(x, y, s) output?)

I don’t think some dynamic dispatch would account for more than a µs runtime cost, but it would be good to know anyway.

4 Likes

In fact, it was my bad: after closer reading, I see that this is a sum larger than the sum of the parts issue.

Please check @Sevi’s post. That is the way :slight_smile:

2 Likes

A minimal working example might be too complicated to do. But I was measuring execution times using @benchmark (i.e. @benchmark f($x, $y, $s)). @code_warntype gives me all blue so I think I’m good with that regard.

I think I have an idea of what is going on though. Calling g(y, s) changes parts of s that g(x, s) depends on but wouldn’t change itself. So I’m guessing there might be some optimization that happens when just running g(x, s). Just a guess though.

That does indeed sound strange…

When you benchmark the individual runs of g, are you using the same s?

2 Likes

Full problem too big for cache, but subproblems fit?

4 Likes

Yes, I was using the same s. Whereas in f the s would randomly change after the first call for g.

Is there any way to check this? s is about 18 MB in memory.

Then this might be why the timings are different. Depending on how s is changed internally, the function just might need longer to execute.

If you benchmark exactly the same calls that happen inside f, i.e. using the modified s to benchmark the second call to g, do you still get different timings?

Actually never mind. Even when I use the same s in f without altering its value, I get the same run times, so I don’t think it has to do with any efficiency lost in s changing.

18 MB makes this seem likely. 18 MB is in the ballpark of typical L3 cache sizes.

You can try a smaller s.
If you’re on Linux, you can use LinuxPerf.jl to see how the cache misses change between evaluating components separately, versus together.

You could try blocking your algorithm, if that’s at all possible.
That is, break s into chunks that fit in cache, and operate on these chunks.

1 Like