Try-catch statement really slow even if catch is never executed

Consider the following functions that add the elements of a vector:

slow(v) = begin
    s = v |> eltype |> zero
    for ele in v
        try 
            s += ele
        catch
            rethrow()
        end
    end
    s
end

fast(v) = begin
    s = v |> eltype |> zero
    for ele in v
        s += ele
    end
    s
end

We can time these functions with:

using BenchmarkTools

const v = rand(Int64, 1_000_000)

@btime slow(v) # 14.6 milliseconds
@btime fast(v) # 155.9 microseconds

slow is 100 times slower even though the catch clause is never executed!

Looking on the internet other languages like C# and Java don’t have this problem (I did not check for myself so take with a grain of salt). Is there any optimization to be done here to fix this? Otherwise try-catch clauses become almost impossible to use for a very high performance applications that may very rarely need to execute the slow catch clause.

I wonder if this is a regression?

the equivalent of that example is putting the try outside the loop:

slow(v) = begin
    s = v |> eltype |> zero
    try
        for ele in v
            s += ele
        end
    catch
        rethrow()
    end
    s
end

For my use case the try would need to be in an inner loop. Actually the try is inside a function that is called while processing streams of data. In the case of an error that function resets any state that was altered to provide a transaction like interface.

So I guess we should limit our discussion to poor performance of try-catch inside loops.

From your benchmark it seems that the try/catch has an overhead of 14.6 nanoseconds per try*. Are you sure that really is important in your case? I have the impression that that cannot get any much faster.

  • 14.6 milliseconds = 14.6e6 nanoseconds -> 14.6 nanoseconds per try.

I am trying to build a DAG based workflow abstraction that is nearly as performant as native code.

Example use might look like:

# Control
control_sum_benchmark(items) = begin
    v = Vector{Float64}(undef, length(items))
    for (i, item) in enumerate(items)
        v[i] = item[1] + item[2]
    end
    v
end


# Experiment
fs = FlowSource(Tuple{Int64, Float64})
x, y = Split(fs)
s = Sum(x, y)
materialize(s)

flow_all_items!(fs, items) = begin
    for item in items
        flow!(fs, item)
    end
end

# Setup
N = 1_000_000
items1 = rand(Int64, N)
items2 = rand(Float64, N)
items = [i for i in zip(items1, items2)]

# Benchmark
@btime flow_all_items!(fs, items) 
@btime control_sum_benchmark(items); 

Obviously real use cases would use nodes more complicated than a simple Sum. Great care has been taken to achieve such performance. The transaction like feature was a nice addition, but would destroy the native performance promise. (Slows this code down about 50 times over)

So basically if flow! takes about 15ns anyways per item (possible with very large and complex workflows) then the try statement won’t matter. But otherwise all the time is wasted on building a try-catch which seems unreasonably more expensive than a comparable if-else

There seems to be a significant overhead on catch statements on C++, from this discussion: https://pspdfkit.com/blog/2020/performance-overhead-of-exceptions-in-cpp/

I guess one has to write custom exception handling functions using as much information as possible from the type of data one has, to overcome this.

1 Like

I haven’t checked, but I’d be willing to bet that this is due to a difference in vectorization. I’d assume the try/catch doesn’t allow the loop to be vectorized, while the “naive” version does.

The usual recommendation, irrespective of language, is “don’t do exception management in a hot loop” - be aware that both C# and Java are (for this purpose) static languages, so they can get a lot more guarantees out of code that doesn’t throw. Very high performance algorithms tend to check the necessary conditions before entering the hot loop, instead of relying on a compiler optimization to do that for them.

4 Likes

If it takes less than that we are talking about a few CPU operations, where alignment and vectorization may take place. In that case, as pointed above, one would really like to avoid testing inside the loop.

Yes. Perhaps it is just best to forget about performance on such a trivial example and focus on more practical use cases where workflows would be larger and more complex (where performance would remain high even with the try-catch).

1 Like