Profiler: how to understand time spent in functions

I am optimizing a bubble sort function which also returns -1^(number of permutations).

using BenchmarkTools, Profile, StatProfilerHTML

function sort_signature!(state::Vector{T}) where T <: Integer
    signature = 1
    nparticles = length(state)
    for ipass in 1:nparticles
        for iswap in 1:(nparticles - ipass)
            a = state[iswap]
            b = state[iswap + 1]
            if a == b
                return 0
            elseif a > b
                signature *= -1
                state[iswap] = b
                state[iswap + 1] = a
            end
        end
    end
    return signature
end

function foo(N)
    a = rand(Int8, 8)
    for _ in 1:N
        sort_signature!(a)
    end
end

foo(10)

@profile foo(100000000)

statprofilehtml(from_c=true)

The output is

The time spent in sort_signature! is mainly consumed by get_index, >, == and iterate. However, there is a large portion of time that is not accounted for. What does it mean?

3 Likes

Someone can probably do better than me on this, but I suppose the answer is that this is “self time.” I.e., time executing code in the body of a function that isn’t easily attributed to another function. For example, time spent on loop overhead. You could ask the same question about why nothing is shown inside getindex.

It’s not a fantastic answer. Hopefully someone else can elaborate.


A remark: I don’t think your profiling here is as good as it could be. Note that you call sort_signature! a bunch of times, but after the first time the input is already sorted and there isn’t much work to do. Notice, for example, that there are no calls to setindex! caught in your profile.

It still has to run your double-loop and make some checks, but the a > b check is always false on the sorted input (so it never needs to swap entries) and the a == b check may never trigger (or will always trigger on the X-th comparison). This also means that branch prediction will do a fantastic job here, giving somewhat optimistic results.

You’d do better to make b = similar(a) and call sort_signature!(copy!(b,a)) so that it has some work to do every time. The branch predictor may still be unusually good in this case, however, due to the short input size. If you generate a new random vector each time, or just use a longer input, the branch predictor will get knocked down to its average performance.


All this said, on a function this small I wouldn’t expect profiling to be very instructive. You’d likely to better to benchmark this function and try slight variations on things to make it faster. In general, unpredictable branches (e.g., a > b on unsorted values) are difficult on performance. It may faster to always write-back the values (for example, using min and max to choose which goes where, and using some other way to choose whether/how to update signature).

I also notice that signature is mostly a parity bit. You could start it at signature = 0, increment it every time you do a swap, and return iseven(signature) ? 1 : -1 (except at your return 0 exit, obviously). I don’t know whether this will be any faster, though.

I agree, except that I have no idea how to optimize the function “self time”.

Right, but copy! probably takes more time than this function. I probably should write a sort_signature!(b, a) for benchmark purpose.

One tip: marking all the getindex and setindex! calls with @inbounds (or, equivalently, marking the inner loop with @inbounds) results in a significant speedup in my tests.

Really? On my laptop

> julia foo.jl
  5.504674 seconds (100.00 M allocations: 5.960 GiB, 3.38% gc time)
> julia --check-bounds=no foo.jl
  5.313396 seconds (100.00 M allocations: 5.960 GiB, 3.21% gc time)

where foo.jl is

function sort_signature!(state::Vector{T}) where T <: Integer
    signature = 1
    nparticles = length(state)
    for ipass in 1:nparticles
        for iswap in 1:(nparticles - ipass)
            a = state[iswap]
            b = state[iswap + 1]
            if a == b
                return 0
            elseif a > b
                signature *= -1
                state[iswap] = b
                state[iswap + 1] = a
            end
        end
    end
    return signature
end

function foo(N)
    a = rand(Int8, 8)
    for _ in 1:N
        b = similar(a)
        sort_signature!(copy!(b, a))
    end
end

foo(10)

@time foo(100000000)

This should concern you. This function isn’t supposed to be allocating anything. Move b = similar(a) outside of the loop. This could be dominating the run time.

On v1.8.0 I was seeing almost a 2x speedup from @inbounds. But it’s possible on more recent versions the compiler is smart enough to see that it doesn’t need them without help from @inbounds.

Note that @time is not the best tool for benchmarking. Check out @benchmark from the BenchmarkTools package. You can have it do setup (like copy!(b,a)) outside of the actual timing so that you only measure your function call. Test functions like your foo end up timing the setup as well.

Still not much difference:

> julia foo.jl
  3.506782 seconds (2 allocations: 128 bytes)
> julia --check-bounds=no foo.jl
  3.534420 seconds (2 allocations: 128 bytes)

The function foo is

function foo(N)
    a = rand(Int8, 8)
    b = similar(a)
    for _ in 1:N
        sort_signature!(copy!(b, a))
    end
end