NLSolve performance 32 vs 64 bits

Hi,

For an application with NLsolve.jl I ran into an OutOfMemoryError(), so I decided to try using single instead of double precision in order to reduce the memory footprint (approach similar to machine learning on GPU’s). However, this appears to be slower and make a lot more allocations than the 64-bit version. Is it possible that some internals on NLsolve.jl only use double precision which in turn leads to a lot of type conversions? Or maybe its some other issue like type assertions that tanks the performance?

Below there is a MWE that tries this for the example on the NLsolve homepage.

using NLsolve
using BenchmarkTools

function f!(F, x)
    F[1] = (x[1]+3)*(x[2]^3-7)+18
    F[2] = sin(x[2]*exp(x[1])-1)
    return F
end

let
    x0 = [ 0.1; 1.2]
    F = similar(x0)
    @assert isequal(typeof(f!(F, x0)), Array{Float64,1})
    res = nlsolve(f!, x0)
    @info "64-bits: $(res.f_calls) f_calls, $(res.g_calls) g_calls"
    display(@benchmark nlsolve($f!, $x0))
end

let
    x0 = Float32.([ 0.1; 1.2])
    F = similar(x0)
    @assert isequal(typeof(f!(F, x0)), Array{Float32,1})
    res = nlsolve(f!, x0)
    @info "32-bits: $(res.f_calls) f_calls, $(res.g_calls) g_calls"
    display(@benchmark nlsolve($f!, $x0))
end

and the results:

[ Info: 64-bits: 5 f_calls, 5 g_calls
BenchmarkTools.Trial: 
  memory estimate:  5.45 KiB
  allocs estimate:  66
  --------------
  minimum time:     4.475 μs (0.00% GC)
  median time:      4.638 μs (0.00% GC)
  mean time:        5.269 μs (9.68% GC)
  maximum time:     694.599 μs (98.82% GC)
  --------------
  samples:          10000
  evals/sample:     7
[ Info: 32-bits: 30 f_calls, 4 g_calls
BenchmarkTools.Trial: 
  memory estimate:  746.59 KiB
  allocs estimate:  7971
  --------------
  minimum time:     570.446 μs (0.00% GC)
  median time:      578.320 μs (0.00% GC)
  mean time:        641.458 μs (8.87% GC)
  maximum time:     4.567 ms (86.10% GC)
  --------------
  samples:          7792
  evals/sample:     1

SIAMFANLEquations.jl will let you do single precision linear algebra and double precision everything else. There is an example in the Chapter2s notebook and in the docstrings. Storing the Jacobian in Float32 will cut the linear algebra work by a factor of two and usually cost nothing in nonlinear performance or accuracy of the solution.

Doing this pays off best when the Jacobian is large and dense (1000 x 1000 or more). I’m working on a Newton-GMRES implementation where this may save some work in orthogonalization if the problem is large and you need many Krylovs/Newton.

If you try that, please let me know how it turned out.