Profiler: how to understand time spent in functions

Chong_Wang · March 2, 2023, 7:44pm

I am optimizing a bubble sort function which also returns -1^(number of permutations).

using BenchmarkTools, Profile, StatProfilerHTML

function sort_signature!(state::Vector{T}) where T <: Integer
    signature = 1
    nparticles = length(state)
    for ipass in 1:nparticles
        for iswap in 1:(nparticles - ipass)
            a = state[iswap]
            b = state[iswap + 1]
            if a == b
                return 0
            elseif a > b
                signature *= -1
                state[iswap] = b
                state[iswap + 1] = a
            end
        end
    end
    return signature
end

function foo(N)
    a = rand(Int8, 8)
    for _ in 1:N
        sort_signature!(a)
    end
end

foo(10)

@profile foo(100000000)

statprofilehtml(from_c=true)

The output is

The time spent in sort_signature! is mainly consumed by get_index, >, == and iterate. However, there is a large portion of time that is not accounted for. What does it mean?

mikmoore · March 2, 2023, 10:32pm

Someone can probably do better than me on this, but I suppose the answer is that this is “self time.” I.e., time executing code in the body of a function that isn’t easily attributed to another function. For example, time spent on loop overhead. You could ask the same question about why nothing is shown inside getindex.

It’s not a fantastic answer. Hopefully someone else can elaborate.

A remark: I don’t think your profiling here is as good as it could be. Note that you call sort_signature! a bunch of times, but after the first time the input is already sorted and there isn’t much work to do. Notice, for example, that there are no calls to setindex! caught in your profile.

It still has to run your double-loop and make some checks, but the a > b check is always false on the sorted input (so it never needs to swap entries) and the a == b check may never trigger (or will always trigger on the X-th comparison). This also means that branch prediction will do a fantastic job here, giving somewhat optimistic results.

You’d do better to make b = similar(a) and call sort_signature!(copy!(b,a)) so that it has some work to do every time. The branch predictor may still be unusually good in this case, however, due to the short input size. If you generate a new random vector each time, or just use a longer input, the branch predictor will get knocked down to its average performance.

All this said, on a function this small I wouldn’t expect profiling to be very instructive. You’d likely to better to benchmark this function and try slight variations on things to make it faster. In general, unpredictable branches (e.g., a > b on unsorted values) are difficult on performance. It may faster to always write-back the values (for example, using min and max to choose which goes where, and using some other way to choose whether/how to update signature).

I also notice that signature is mostly a parity bit. You could start it at signature = 0, increment it every time you do a swap, and return iseven(signature) ? 1 : -1 (except at your return 0 exit, obviously). I don’t know whether this will be any faster, though.

Chong_Wang · March 2, 2023, 10:39pm

I agree, except that I have no idea how to optimize the function “self time”.

Right, but copy! probably takes more time than this function. I probably should write a sort_signature!(b, a) for benchmark purpose.

mikmoore · March 2, 2023, 10:54pm

One tip: marking all the getindex and setindex! calls with @inbounds (or, equivalently, marking the inner loop with @inbounds) results in a significant speedup in my tests.

Chong_Wang · March 2, 2023, 11:26pm

Really? On my laptop

> julia foo.jl
  5.504674 seconds (100.00 M allocations: 5.960 GiB, 3.38% gc time)
> julia --check-bounds=no foo.jl
  5.313396 seconds (100.00 M allocations: 5.960 GiB, 3.21% gc time)

where foo.jl is

function sort_signature!(state::Vector{T}) where T <: Integer
    signature = 1
    nparticles = length(state)
    for ipass in 1:nparticles
        for iswap in 1:(nparticles - ipass)
            a = state[iswap]
            b = state[iswap + 1]
            if a == b
                return 0
            elseif a > b
                signature *= -1
                state[iswap] = b
                state[iswap + 1] = a
            end
        end
    end
    return signature
end

function foo(N)
    a = rand(Int8, 8)
    for _ in 1:N
        b = similar(a)
        sort_signature!(copy!(b, a))
    end
end

foo(10)

@time foo(100000000)

mikmoore · March 3, 2023, 12:14am

This should concern you. This function isn’t supposed to be allocating anything. Move b = similar(a) outside of the loop. This could be dominating the run time.

On v1.8.0 I was seeing almost a 2x speedup from @inbounds. But it’s possible on more recent versions the compiler is smart enough to see that it doesn’t need them without help from @inbounds.

Note that @time is not the best tool for benchmarking. Check out @benchmark from the BenchmarkTools package. You can have it do setup (like copy!(b,a)) outside of the actual timing so that you only measure your function call. Test functions like your foo end up timing the setup as well.

Chong_Wang · March 3, 2023, 12:22am

Still not much difference:

> julia foo.jl
  3.506782 seconds (2 allocations: 128 bytes)
> julia --check-bounds=no foo.jl
  3.534420 seconds (2 allocations: 128 bytes)

The function foo is

function foo(N)
    a = rand(Int8, 8)
    b = similar(a)
    for _ in 1:N
        sort_signature!(copy!(b, a))
    end
end

Topic		Replies	Views
Interpreting output of Profile: Finding where code is called Performance	1	175	November 17, 2022
Profiling code reveals that most time is used calling 3 functions in task.jl General Usage question	24	4658	March 3, 2025
Tutorial: How to benchmark and profile your code? Performance blog-post	11	2946	March 24, 2021
Doubt about nested functions and scopes General Usage	2	427	November 1, 2018
Significant time spent in type inference in profiler flame graphs Performance question , profiling	4	547	October 22, 2022

Profiler: how to understand time spent in functions

Related topics