Avoid negative Values


I’m trying to solve a food-web with many interacting diffeq. Because animals and plants can’t have negative population sizes it would be great to restrict the solution to only positive values. Additionally some of the equations depend on power functions (u^(rand(0:0.0000001:1))) and imaginary values aren’t an option.
My code is to large for a comment, therefore only the callbacks:

function condition(out,u, t, integrator)
    out .= u .- 1e-6 # if population size is less than 1e-6 the population is extinct 

function affect!(integrator,event_index)
    u = integrator.u
    u[event_index] = 0.0 # if population is extinct it's set to 0

cb = VectorContinuousCallback(condition,affect!, N*N_I) #first callback
cb2 = PositiveDomain()#second callback: restrict the population to positive and should set negative values to 0 
cbs = CallbackSet(cb,cb2) #tuple of callbacks
prob = ODEProblem(func!,Pop_0,tspan,p, callback=cbs, reltol=1e-12) #make tolerance small to avoid big jumps.    
solution = solve(prob) #solve problem

When running this code I sometimes get a Warning:

DomainError with -1.871257021144186e-15
Exponentiation yielding a complex result requires a complex argument

So I get nonetheless very small negative values which destroys my whole calculation .
I already tried to write functions like du[i] = max(0,u[i])^(rand(0:0.0000001:1)) but it couldn’t solve the Problem.

Thanks for the Help,

Note that sqrt(x) is going to be about 4x faster than x^.5.
(not that it will solve your problem)

1 Like

Thanks for the fast respond, the u^0.5 was just an example, the values are randomly drawn between 0 and 1.
I noticed when I just using the callback cb2 = PositiveDomain() I get a solution. But I need anyway the 1e-6 part to work.

If your ODE is simple enough, you could use a substitution u = exp(z) and solve the ODE for z instead.
(I was working on a PR for ModellingToolkit to do that symbolically but didn’t find the time to finish it. But my current state already works for ODEs, see)

Are there any optimizations to be had for sqrt (or other functions) if you make a function, just for the range [0, 1]? I Googled a bit to see, in general and for Julia.

At least you can get 33% speedup with:

julia> @btime @fastmath Float64(sqrt(Float32($x)))
  1.885 ns (0 allocations: 0 bytes)


julia> @btime sqrt($x);
  2.805 ns (0 allocations: 0 bytes)

julia> @btime $x^0.5;
  3.374 ns (0 allocations: 0 bytes)

By now, fastmath scares me a bit (and for sure in the global setting), but it seemed ok for this one value x = 0.5 I tried.

I found a 2021 paper on square roots but it may apply less to x86:

Our experimental results show that the proposed algorithms provide a fairly good trade-off between accuracy and latency after two iterations for numbers of type float, and after three iterations for numbers of type double when using fused multiply–add instructions—giving almost complete accuracy. […]
Polynomial methods of high order rely heavily on multiplications and need to store
polynomial coefficients in memory; they also require a range reduction step

range reduction step could be skipped? And polynomial simplified?

Seem relevant on page 13 (also alg. 6 and 11):

Algorithm 5. Proposed Sqrt31f algorithm (DC initial approximation)

As shown in Table 1, the proposed algorithms give significantly better performance
than the library functions on the Raspberry Pi, from 3.17 to 3.62 times faster, and for SP
numbers on ESP-32, 2.34 times faster for the reciprocal square root and approximately
1.78 times faster than the sqrtf (x) function.

Have you tried SLEEFpirates.jl?


It doesn’t seem to export sqrt, it has this line:

# sqrt without the domain checks which we don't need since we handle the checks ourselves

@inline _sqrt(x::T) where {T<:Union{Float32,Float64}} = Base.sqrt_llvm(x)

so I just tried (giving same speed):

julia> @btime Float64(Base.sqrt_llvm(Float32($x)))
  1.885 ns (0 allocations: 0 bytes)

It could be benchmarking isn’t realistic as always same value, and branching then not getting in the way, but would be otherwise for the range reduction, and this thus faster in realistic use. At least there must be a reason the do this for use by other exported functions. I guess they do not want to export _sqrt as it doesn’t covert the whole range, dangerous if you wouldn’t read the fine print.

That can be fixed quite easily though:

@benchmark Float64(sqrt(x)) setup=(r=rand(Float32))

I’ve ommitted the conversion to F32, but putting it back is just removing the Float32 from rand.

Note that you’re also benchmarking the conversions.

I do get the same time, more or less with:

julia> @benchmark Base.sqrt_llvm(r) setup=(r=rand(Float32))
BenchmarkTools.Trial: 10000 samples with 1000 evaluations.
 Range (min … max):  1.883 ns … 32.506 ns  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     1.959 ns              ┊ GC (median):    0.00%
 Time  (mean ± σ):   1.992 ns ±  0.704 ns  ┊ GC (mean ± σ):  0.00% ± 0.00%

  █▁▁▁▁▁▁▁▁▁▁▁▁▁▁ ▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁ ▁
  2.08 ns        Histogram: frequency by time        1.97 ns <

 Memory estimate: 0 bytes, allocs estimate: 0.

Note, I have r, not x, there, what I think you meant to write. Otherwise I get way longer time, and $x not really correct either.

1 Like

Yeah, that’s a typo - the variable from the setup and inside the benchmark expression should of course be the same.

I was wrong.

EDIT: The speedup I benchmarked actually isn’t there (on x86), it’s 40% slower than SP sqrt (while a bit faster then DP sqrt), yes, sqrt instructions on x86 likely can’t be beaten. I suppose this code can help on ARM, as the the article explains.

Right way to see the speed:

julia> @btime Base.sqrt_llvm(r) setup=(r=rand(Float32));
  1.882 ns (0 allocations: 0 bytes)

julia> function Sqrt31f(x::Float32)
         i = reinterpret(Int32, x)
         k = i & 0x00800000;
         if(k != 0)
           i = 0x5ed9e893 - (i >> 1);
           y = reinterpret(Float32, i)
           c = x*y;
           y = 2.33130789f0*c*fma(y, -c, 1.07495356f0);
           i = 0x5f19e8fd - (i >> 1);
           y = reinterpret(Float32, i)
           c = x*y;
           y = 0.82421863f0*c*fma(y, -c, 2.1499474f0);
         return y

I also started translating alg. 5 first but didn't figure out what the array-looking part was:
julia> function Sqrt32f(x::Float32)
       y = [0x5ed9d098, 0x5f19d352,
       2.33139729f0, 1.07492042f0,
       0.82420468f0, 2.14996147f0] # (x);
       c = x*y;
       r = fma(y, -c, 1.0f0);
       y = fma(0.5f0*c, r, c); # modified
       return y;
Sqrt32f (generic function with 1 method)

That can’t be right - sub-nanosecond benchmarks would mean your computer did all those calculations in less time than it takes for a single addition. Doesn’t seem possible with a CPU in the Ghz range.

What about filling an array of N values once and checking the results against a slower version?

1 Like

Have you checked whether the LLVM intrinsic already calls a specific x86_64 instruction for square roots? It may be that you can’t beat hardware on this.

1 Like

Hi Everyone,

Thanks for the interesting talk on sqrt().
I haven’t checked everything, because I changed multiple things in my code, but what finially worked was to hint that the problem is stiff. I totally forgot about that when writing this post.

function condition(out,u, t, integrator)
    out .= u .- 1e-6 # if population size is less than 1e-6 the population is extinct 

function affect!(integrator,event_index)
    u = integrator.u
    u[event_index] = 0.0 # if population is extinct it's set to 0


cb = VectorContinuousCallback(condition,affect!, N*N_I)
cb2 = PositiveDomain()
cbs = CallbackSet(cb,cb2) 
prob = ODEProblem(func!,Pop_0,tspan,p, callback=cbs, reltol=1e-12)  
solution = solve(prob) #solve problem

I will do further checks and let you know if there are still problems.