Consider the following functions that add the elements of a vector:
slow(v) = begin
s = v |> eltype |> zero
for ele in v
try
s += ele
catch
rethrow()
end
end
s
end
fast(v) = begin
s = v |> eltype |> zero
for ele in v
s += ele
end
s
end
We can time these functions with:
using BenchmarkTools
const v = rand(Int64, 1_000_000)
@btime slow(v) # 14.6 milliseconds
@btime fast(v) # 155.9 microseconds
slow is 100 times slower even though the catch clause is never executed!
Looking on the internet other languages like C# and Java don’t have this problem (I did not check for myself so take with a grain of salt). Is there any optimization to be done here to fix this? Otherwise try-catch clauses become almost impossible to use for a very high performance applications that may very rarely need to execute the slow catch clause.
For my use case the try would need to be in an inner loop. Actually the try is inside a function that is called while processing streams of data. In the case of an error that function resets any state that was altered to provide a transaction like interface.
So I guess we should limit our discussion to poor performance of try-catch inside loops.
From your benchmark it seems that the try/catch has an overhead of 14.6 nanoseconds per try*. Are you sure that really is important in your case? I have the impression that that cannot get any much faster.
14.6 milliseconds = 14.6e6 nanoseconds -> 14.6 nanoseconds per try.
I am trying to build a DAG based workflow abstraction that is nearly as performant as native code.
Example use might look like:
# Control
control_sum_benchmark(items) = begin
v = Vector{Float64}(undef, length(items))
for (i, item) in enumerate(items)
v[i] = item[1] + item[2]
end
v
end
# Experiment
fs = FlowSource(Tuple{Int64, Float64})
x, y = Split(fs)
s = Sum(x, y)
materialize(s)
flow_all_items!(fs, items) = begin
for item in items
flow!(fs, item)
end
end
# Setup
N = 1_000_000
items1 = rand(Int64, N)
items2 = rand(Float64, N)
items = [i for i in zip(items1, items2)]
# Benchmark
@btime flow_all_items!(fs, items)
@btime control_sum_benchmark(items);
Obviously real use cases would use nodes more complicated than a simple Sum. Great care has been taken to achieve such performance. The transaction like feature was a nice addition, but would destroy the native performance promise. (Slows this code down about 50 times over)
So basically if flow! takes about 15ns anyways per item (possible with very large and complex workflows) then the try statement won’t matter. But otherwise all the time is wasted on building a try-catch which seems unreasonably more expensive than a comparable if-else
I haven’t checked, but I’d be willing to bet that this is due to a difference in vectorization. I’d assume the try/catch doesn’t allow the loop to be vectorized, while the “naive” version does.
The usual recommendation, irrespective of language, is “don’t do exception management in a hot loop” - be aware that both C# and Java are (for this purpose) static languages, so they can get a lot more guarantees out of code that doesn’t throw. Very high performance algorithms tend to check the necessary conditions before entering the hot loop, instead of relying on a compiler optimization to do that for them.
If it takes less than that we are talking about a few CPU operations, where alignment and vectorization may take place. In that case, as pointed above, one would really like to avoid testing inside the loop.
Yes. Perhaps it is just best to forget about performance on such a trivial example and focus on more practical use cases where workflows would be larger and more complex (where performance would remain high even with the try-catch).