Another solution could consist in adding an extremely small value to the argument of the log
, so that it never goes to -Inf
. The values of x
for which this makes a difference are so small, that I would expect the overall result to remain constant (as soon as it is not incredibly small itself).
function loop2(X)
acc = zero(eltype(X))
e = nextfloat(acc)
for x in X
acc += x*(1+log(x+e))
end
acc
end
Let us benchmark this (see below for the complete copy-pastable code):
using BenchmarkTools
# array size
N = 1_000_000
# fraction of zero values in the array
frac = 0.1
# generate a random array with approximately the required fraction of zeros
X = rand(N);
Nz = Int(round(frac*N))
idx = rand(1:N, Nz);
X[idx] .= 0.;
# No handling of 0 -> NaN
@btime loop0($X)
# Version with a test
@btime loop1($X)
# The version above
@btime loop2($X)
which gives, for a large fraction of zeros (10%)
julia> @btime loop0($X)
9.627 ms (0 allocations: 0 bytes)
NaN
julia> @btime loop1($X)
10.464 ms (0 allocations: 0 bytes)
225700.31035719207
julia> @btime loop2($X)
16.853 ms (0 allocations: 0 bytes)
225700.31035719207
and for a smaller fraction of zeros (1%):
julia> @btime loop0($X)
9.514 ms (0 allocations: 0 bytes)
NaN
julia> @btime loop1($X)
10.643 ms (0 allocations: 0 bytes)
247655.2276365137
julia> @btime loop2($X)
10.916 ms (0 allocations: 0 bytes)
247655.2276365137
My interpretation:
- performing tests costs a price which is greatly reduced by the branch prediction features of the CPU (one could actually even expect benchmarked times to be a bit underestimated due to this; it might be a good idea to confirm this benchmark on real data in a real context);
- however, taking the sort-circuit branch saves the costly computation of a logarithm, while adding a small value to
x
implies that the log
is always computed, even when its value is useless;
- handling subnormal floating-point values comes with an associated cost, which might be why the benchmarked time increases with the fraction of zero values. Increasing the “epsilon” value (e.g replacing
nextfloat(0)
with something like 1e-100
) helps improving performance, at the cost of a potential change in your results.
In any case, I would think that you have much to gain from vectorizing this loop, which (1) is not trivial because of the log
, and (2) becomes even harder if you keep the version with tests.
If this is a very hot loop and you don’t have too many zero values in your input, I would suggest using a SIMD-friendly math library such as SLEEF to vectorize your loop. There are various Julia projects helping you to do this, that you can find on github (maybe @Elrod would like to add more details). If you go down this road, having a loop body without test will probably help.
Complete benchmarking code
using BenchmarkTools
function loop0(X)
acc = zero(eltype(X))
@simd for x in X
acc += x*(1+log(x))
end
acc
end
function loop1(X)
acc = zero(eltype(X))
@simd for x in X
if x>0.
acc += x*(1+log(x))
end
end
acc
end
function loop2(X)
acc = zero(eltype(X))
e = nextfloat(acc)
@simd for x in X
acc += x*(1+log(x+e))
end
acc
end
using BenchmarkTools
frac = 0.01
N = 1_000_000
Nz = Int(round(frac*N))
X = rand(N);
idx = rand(1:N, Nz);
X[idx] .= 0.;
@btime loop0($X)
@btime loop1($X)
@btime loop2($X)