# Nan to zero in hot loop?

I have a function like this :

``````f(x, a) = sum(x. + x .* log. (a. *x)) #broadcasted for simplicity
``````

Where 0<=x[i] <=1 and a is positive . When x[i] =0,f(x)[i] is zero. The logarithm generates problems, because `0*Inf` is NaN, and one Nan messes up the entire sum. Right now I’m doing this with loop and short circuit evaluation :

``````res=0
for i = 1:length(x)
x[i] >0 && res+=x[i] +x[i] *log(a[i] *x[i])
end
``````

This kind of problem is my inner loop, there is any way to improve this? This code pattern is correct?

Without having tested it, it looks OK. But you should not initialize `r = 0`, since that will cause a type instability. Use 0.0 or some other appropriate float.

Also, use `eachindex(x)` instead of `1:length(x)`, since that is more general.

You could try to add an `@inbounds` in front if the loop to see if that makes any difference.

1 Like

Nice! For inicialization of `res`, I actually have `res=zero(eltype(x))`, and the `eachindex(x)` is indeed useful and clear

`zero(eltype(x))` is normally good. If the `eltype` of `x` is `Int`, though , you might still get a type instability. It’s perhaps a bit pedantic to suggest this, but `zero(float(eltype(x)))` or `log(one(eltype(x)) * one(eltype(a)))`, is even more safe.

OK, maybe a bit much. I wonder how the `sum` function does this.

2 Likes

I address those type instabilities with a function before the hot loop, but I’m gonna try some of the methods for the inicial res

Another solution could consist in adding an extremely small value to the argument of the `log`, so that it never goes to `-Inf`. The values of `x` for which this makes a difference are so small, that I would expect the overall result to remain constant (as soon as it is not incredibly small itself).

``````function loop2(X)
acc = zero(eltype(X))
e = nextfloat(acc)
for x in X
acc += x*(1+log(x+e))
end
acc
end
``````

Let us benchmark this (see below for the complete copy-pastable code):

``````using BenchmarkTools

# array size
N = 1_000_000

# fraction of zero values in the array
frac = 0.1

# generate a random array with approximately the required fraction of zeros
X = rand(N);
Nz = Int(round(frac*N))
idx = rand(1:N, Nz);
X[idx] .= 0.;

# No handling of 0 -> NaN
@btime loop0(\$X)

# Version with a test
@btime loop1(\$X)

# The version above
@btime loop2(\$X)
``````

which gives, for a large fraction of zeros (10%)

``````julia> @btime loop0(\$X)
9.627 ms (0 allocations: 0 bytes)
NaN

julia> @btime loop1(\$X)
10.464 ms (0 allocations: 0 bytes)
225700.31035719207

julia> @btime loop2(\$X)
16.853 ms (0 allocations: 0 bytes)
225700.31035719207
``````

and for a smaller fraction of zeros (1%):

``````julia> @btime loop0(\$X)
9.514 ms (0 allocations: 0 bytes)
NaN

julia> @btime loop1(\$X)
10.643 ms (0 allocations: 0 bytes)
247655.2276365137

julia> @btime loop2(\$X)
10.916 ms (0 allocations: 0 bytes)
247655.2276365137
``````

My interpretation:

• performing tests costs a price which is greatly reduced by the branch prediction features of the CPU (one could actually even expect benchmarked times to be a bit underestimated due to this; it might be a good idea to confirm this benchmark on real data in a real context);
• however, taking the sort-circuit branch saves the costly computation of a logarithm, while adding a small value to `x` implies that the `log` is always computed, even when its value is useless;
• handling subnormal floating-point values comes with an associated cost, which might be why the benchmarked time increases with the fraction of zero values. Increasing the “epsilon” value (e.g replacing `nextfloat(0)` with something like `1e-100`) helps improving performance, at the cost of a potential change in your results.

In any case, I would think that you have much to gain from vectorizing this loop, which (1) is not trivial because of the `log`, and (2) becomes even harder if you keep the version with tests.

If this is a very hot loop and you don’t have too many zero values in your input, I would suggest using a SIMD-friendly math library such as SLEEF to vectorize your loop. There are various Julia projects helping you to do this, that you can find on github (maybe @Elrod would like to add more details). If you go down this road, having a loop body without test will probably help.

Complete benchmarking code
``````using BenchmarkTools

function loop0(X)
acc = zero(eltype(X))
@simd for x in X
acc += x*(1+log(x))
end
acc
end

function loop1(X)
acc = zero(eltype(X))
@simd for x in X
if x>0.
acc += x*(1+log(x))
end
end
acc
end

function loop2(X)
acc = zero(eltype(X))
e = nextfloat(acc)
@simd for x in X
acc += x*(1+log(x+e))
end
acc
end

using BenchmarkTools

frac = 0.01
N = 1_000_000

Nz = Int(round(frac*N))

X = rand(N);
idx = rand(1:N, Nz);
X[idx] .= 0.;

@btime loop0(\$X)
@btime loop1(\$X)
@btime loop2(\$X)
``````
1 Like

Two other things I’d try:

• use `ifelse` instead of `&&` to avoid branching
• replace `x*log(a*x)` by its (appropriately truncated) Taylor expansion at `x=0` to avoid `NaN`s entirely.

Edit: at `x=0` doesn’t work as the derivative is undefined there. `x=1` is the common choice for anything involving logarithms.

1 Like

Regarding `nextfloat(0.0)`: Try to avoid that. This is a subnormal number, and many CPU and instructions get very unhappy about that (fall back from in-silicon arithmetic units to microcode emulation). Example:

``````julia> using BenchmarkTools
julia> a = rand(10_000); b=rand(10_000); c=zeros(10_000);

julia> @btime c.=a.*b;
8.534 μs (2 allocations: 48 bytes)
julia> @btime c.=a.+b;
8.493 μs (2 allocations: 48 bytes)
julia> @btime c.=a./b;
18.532 μs (2 allocations: 48 bytes)

julia> a.=[rand()>0.5 ? rand()*2 : nextfloat(0.0) for i=1:10_000];
julia> b.=[rand()>0.5 ? rand()*2 : nextfloat(0.0) for i=1:10_000];

julia> @btime c.=a.*b;
145.813 μs (2 allocations: 48 bytes)
julia> @btime c.=a.+b;
8.525 μs (2 allocations: 48 bytes)
julia> @btime c.=a./b;
225.753 μs (2 allocations: 48 bytes)
``````

This was on an intel broadwell. YMMV.

1 Like