Nan to zero in hot loop?

#1

I have a function like this :

f(x, a) = sum(x. + x .* log. (a. *x)) #broadcasted for simplicity 

Where 0<=x[i] <=1 and a is positive . When x[i] =0,f(x)[i] is zero. The logarithm generates problems, because 0*Inf is NaN, and one Nan messes up the entire sum. Right now I’m doing this with loop and short circuit evaluation :

res=0
for i = 1:length(x) 
x[i] >0 && res+=x[i] +x[i] *log(a[i] *x[i]) 
end

This kind of problem is my inner loop, there is any way to improve this? This code pattern is correct?

#2

Without having tested it, it looks OK. But you should not initialize r = 0, since that will cause a type instability. Use 0.0 or some other appropriate float.

Also, use eachindex(x) instead of 1:length(x), since that is more general.

You could try to add an @inbounds in front if the loop to see if that makes any difference.

1 Like
#3

Nice! For inicialization of res, I actually have res=zero(eltype(x)), and the eachindex(x) is indeed useful and clear

#4

zero(eltype(x)) is normally good. If the eltype of x is Int, though , you might still get a type instability. It’s perhaps a bit pedantic to suggest this, but zero(float(eltype(x))) or log(one(eltype(x)) * one(eltype(a))), is even more safe.

OK, maybe a bit much. I wonder how the sum function does this.

2 Likes
#5

I address those type instabilities with a function before the hot loop, but I’m gonna try some of the methods for the inicial res

#6

Another solution could consist in adding an extremely small value to the argument of the log, so that it never goes to -Inf. The values of x for which this makes a difference are so small, that I would expect the overall result to remain constant (as soon as it is not incredibly small itself).

function loop2(X)
    acc = zero(eltype(X))
    e = nextfloat(acc)
    for x in X
        acc += x*(1+log(x+e))
    end
    acc
end

Let us benchmark this (see below for the complete copy-pastable code):

using BenchmarkTools

# array size
N = 1_000_000

# fraction of zero values in the array
frac = 0.1

# generate a random array with approximately the required fraction of zeros
X = rand(N);
Nz = Int(round(frac*N))
idx = rand(1:N, Nz);
X[idx] .= 0.;

# No handling of 0 -> NaN
@btime loop0($X)

# Version with a test
@btime loop1($X)

# The version above
@btime loop2($X)

which gives, for a large fraction of zeros (10%)

julia> @btime loop0($X)
  9.627 ms (0 allocations: 0 bytes)
NaN

julia> @btime loop1($X)
  10.464 ms (0 allocations: 0 bytes)
225700.31035719207

julia> @btime loop2($X)
  16.853 ms (0 allocations: 0 bytes)
225700.31035719207

and for a smaller fraction of zeros (1%):

julia> @btime loop0($X)
  9.514 ms (0 allocations: 0 bytes)
NaN

julia> @btime loop1($X)
  10.643 ms (0 allocations: 0 bytes)
247655.2276365137

julia> @btime loop2($X)
  10.916 ms (0 allocations: 0 bytes)
247655.2276365137

My interpretation:

  • performing tests costs a price which is greatly reduced by the branch prediction features of the CPU (one could actually even expect benchmarked times to be a bit underestimated due to this; it might be a good idea to confirm this benchmark on real data in a real context);
  • however, taking the sort-circuit branch saves the costly computation of a logarithm, while adding a small value to x implies that the log is always computed, even when its value is useless;
  • handling subnormal floating-point values comes with an associated cost, which might be why the benchmarked time increases with the fraction of zero values. Increasing the “epsilon” value (e.g replacing nextfloat(0) with something like 1e-100) helps improving performance, at the cost of a potential change in your results.

In any case, I would think that you have much to gain from vectorizing this loop, which (1) is not trivial because of the log, and (2) becomes even harder if you keep the version with tests.

If this is a very hot loop and you don’t have too many zero values in your input, I would suggest using a SIMD-friendly math library such as SLEEF to vectorize your loop. There are various Julia projects helping you to do this, that you can find on github (maybe @Elrod would like to add more details). If you go down this road, having a loop body without test will probably help.



Complete benchmarking code
using BenchmarkTools

function loop0(X)
    acc = zero(eltype(X))
    @simd for x in X
        acc += x*(1+log(x))
    end
    acc
end

function loop1(X)
    acc = zero(eltype(X))
    @simd for x in X
        if x>0.
            acc += x*(1+log(x))
        end
    end
    acc
end

function loop2(X)
    acc = zero(eltype(X))
    e = nextfloat(acc)
    @simd for x in X
        acc += x*(1+log(x+e))
    end
    acc
end


using BenchmarkTools

frac = 0.01
N = 1_000_000

Nz = Int(round(frac*N))

X = rand(N);
idx = rand(1:N, Nz);
X[idx] .= 0.;

@btime loop0($X)
@btime loop1($X)
@btime loop2($X)
1 Like
#7

Two other things I’d try:

  • use ifelse instead of && to avoid branching
  • replace x*log(a*x) by its (appropriately truncated) Taylor expansion at x=0 to avoid NaNs entirely.

Edit: at x=0 doesn’t work as the derivative is undefined there. x=1 is the common choice for anything involving logarithms.

1 Like
#8

Regarding nextfloat(0.0): Try to avoid that. This is a subnormal number, and many CPU and instructions get very unhappy about that (fall back from in-silicon arithmetic units to microcode emulation). Example:

julia> using BenchmarkTools
julia> a = rand(10_000); b=rand(10_000); c=zeros(10_000);

julia> @btime c.=a.*b;
  8.534 μs (2 allocations: 48 bytes)
julia> @btime c.=a.+b;
  8.493 μs (2 allocations: 48 bytes)
julia> @btime c.=a./b;
  18.532 μs (2 allocations: 48 bytes)

julia> a.=[rand()>0.5 ? rand()*2 : nextfloat(0.0) for i=1:10_000];
julia> b.=[rand()>0.5 ? rand()*2 : nextfloat(0.0) for i=1:10_000];

julia> @btime c.=a.*b;
  145.813 μs (2 allocations: 48 bytes)
julia> @btime c.=a.+b;
  8.525 μs (2 allocations: 48 bytes)
julia> @btime c.=a./b;
  225.753 μs (2 allocations: 48 bytes)

This was on an intel broadwell. YMMV.

1 Like