Accelerate Non-linear function evaluation

Hi,

I’m working on an application with maximum entropy networks. To determine some parameters, a system of non-linear equation F(x)=0 needs to be solved. For a network of n nodes, each node i has an associated equation that looks like this (in its most basic form):
F(x_i) = \sum_{j\ne i}\dfrac{x_ix_j}{1+x_ix_j} - k_i\ , where k_i is known.

I’m using a matrix free quasi newton method to solve the system. The actual function evaluation of F(x) becomes a bottleneck for larger applications. For a real application the number of equations could go up to 1\text{e}7.

I’ve gained some improvements by using multithreading and @inbounds and suppressing the i\nej statement as shown below. Could this be made faster still? I haven’t used pre-calculated values to avoid using a lot of memory (same motivation for using matrix free quasi newton) as there are (n^2/2-n) unique values to be calculated.

Any suggestions are welcome :slight_smile:

using BenchmarkTools

function ML_baseline(x::Array{Float64,1},k::Array{Float64,1}) 
    n = length(x)
    F = zeros(Float64,n)
    for i = 1:n
        for j = 1:n
            if i≠j
                F[i] += x[i]*x[j]/(1+x[i]*x[j])
            else
                F[i] -= k[i]
            end
        end
    end
    return F
end

function ML_threaded_bounds_noif(x::Array{Float64,1},k::Array{Float64,1}) 
    n = Int64(length(x))
    F = zeros(Float64,n)
    Threads.@threads for i = 1:n
        for j = 1:n
            @inbounds F[i] += x[i]*x[j]/(1+x[i]*x[j])
        end
        @inbounds F[i] -= x[i]*x[i]/(1+x[i]*x[i]) + k[i]
    end
    return F
end

# setup
n = Int(1e4)
x = rand(n)
k = round.(rand(n)*n);

@benchmark ML_baseline(x,k)
@benchmark ML_threaded_bounds_noif(x,k)
BenchmarkTools.Trial: 
  memory estimate:  78.20 KiB
  allocs estimate:  2
  --------------
  minimum time:     231.226 ms (0.00% GC)
  median time:      242.516 ms (0.00% GC)
  mean time:        246.103 ms (0.00% GC)
  maximum time:     271.991 ms (0.00% GC)
  --------------
  samples:          21
  evals/sample:     1

BenchmarkTools.Trial: 
  memory estimate:  80.95 KiB
  allocs estimate:  25
  --------------
  minimum time:     68.404 ms (0.00% GC)
  median time:      73.837 ms (0.00% GC)
  mean time:        81.320 ms (0.00% GC)
  maximum time:     254.281 ms (0.00% GC)
  --------------
  samples:          62
  evals/sample:     1

The above results were obtained using 4 threads. Systeminfo below for completeness.

Julia Version 1.4.2
Commit 44fa15b150* (2020-05-23 18:35 UTC)
Platform Info:
  OS: macOS (x86_64-apple-darwin18.7.0)
  CPU: Intel(R) Core(TM) i7-7660U CPU @ 2.50GHz
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-8.0.1 (ORCJIT, skylake)

Welcome! Have you checked that your method is generating type-stable code?

You could try

@code_warntype ML_baseline(x,k)

another idea might be to pre-allocate F

What about preallocating a sub sample of the problem? Solve the equation in chunks and preallocating those chunks.
A small help would be to save a multiplication by doing

xij = x[i]*x[j] 
F[i] += xij/(1+xij)

Thanks for the reply. I did this for both beforehand, but did not mention it. Both look ok:

  • ML_baseline has two orange ones, but from what I understand from here, these are internal variables linked to the iteration.
  • The threaded one is all green and looks coherent

With respect to pre-allocating F, how is this different from initialising F as all zeros and adding the values afterwards? (I don’t master all the subtleties of the language yet)

very obvious, yet did not think of that. Thanks

Also, you can directly initialize F:

F =( x. *x)  ./ (1 . - x  .* x)  . - k

What about using the @simd macro too in the hot loop?

I get the impression that you are running this function many times repeatedly. Since allocating memory (for F) is slow, you can save a lot of time by allocating it once and then reusing the same memory again and again.

You could also look into Krylov solvers which dont require you to calculate all components of F all the time. The O(N^2) scaling in this problem is just brutal.

I think this should give the correct result

f(x) = x / (1 + x)
F = similar(x)
ML_map!(F, x, k) = map!(F, x, k) do xi, ki
    return sum(xj -> f(xi * xj), x) - f(xi^2) - ki
end

Benchmark results:

julia> @benchmark ML_map!($F, $x, $k)
BenchmarkTools.Trial: 
  memory estimate:  16 bytes
  allocs estimate:  1
  --------------
  minimum time:     131.697 ms (0.00% GC)
  median time:      132.760 ms (0.00% GC)
  mean time:        132.688 ms (0.00% GC)
  maximum time:     133.522 ms (0.00% GC)
  --------------
  samples:          38
  evals/sample:     1

julia> @benchmark ML_threaded_bounds_noif($x, $k)
BenchmarkTools.Trial: 
  memory estimate:  81.16 KiB
  allocs estimate:  26
  --------------
  minimum time:     106.273 ms (0.00% GC)
  median time:      116.451 ms (0.00% GC)
  mean time:        119.378 ms (0.00% GC)
  maximum time:     130.660 ms (0.00% GC)
  --------------
  samples:          42
  evals/sample:     1

julia> @benchmark ML_baseline($x, $k)
BenchmarkTools.Trial: 
  memory estimate:  78.20 KiB
  allocs estimate:  2
  --------------
  minimum time:     453.683 ms (0.00% GC)
  median time:      456.925 ms (0.00% GC)
  mean time:        458.466 ms (0.00% GC)
  maximum time:     466.490 ms (0.00% GC)
  --------------
  samples:          11
  evals/sample:     1

Note that this is with 4 threads. By default map!() is not multi threaded but i think some implmentations exist in packages.

I don’t think this code reproduces what the original code does. It squares x instead of summing over it.

Is just to remove the evaluation of Diagonal terms of the loop

This is a great use-case for https://github.com/chriselrod/LoopVectorization.jl. I see a ~5x increase from the baseline on my system.

using LoopVectorization

function ML_avx(x, k)
    F = zero(x)
    @avx for i in axes(x, 1)
        for j in axes(x, 1)
            F[i] += x[i]*x[j]/(1+x[i]*x[j])
        end
        F[i] -= x[i]*x[i]/(1+x[i]*x[i]) + k[i]
    end
    return F
end

The compiler is smart enough to figure out the x[i]*x[j] is being re-used here. My timings:

Baseline

julia> @benchmark ML_baseline(x,k)
BenchmarkTools.Trial: 
  memory estimate:  78.20 KiB
  allocs estimate:  2
  --------------
  minimum time:     252.660 ms (0.00% GC)
  median time:      253.723 ms (0.00% GC)
  mean time:        254.228 ms (0.00% GC)
  maximum time:     257.703 ms (0.00% GC)
  --------------
  samples:          20
  evals/sample:     1

Threaded no-if

@benchmark ML_threaded_bounds_noif(x,k)
BenchmarkTools.Trial: 
  memory estimate:  79.63 KiB
  allocs estimate:  15
  --------------
  minimum time:     100.016 ms (0.00% GC)
  median time:      101.370 ms (0.00% GC)
  mean time:        106.876 ms (0.00% GC)
  maximum time:     203.586 ms (0.00% GC)
  --------------
  samples:          47
  evals/sample:     1

@avx

BenchmarkTools.Trial: 
  memory estimate:  78.20 KiB
  allocs estimate:  2
  --------------
  minimum time:     55.772 ms (0.00% GC)
  median time:      56.494 ms (0.00% GC)
  mean time:        56.427 ms (0.00% GC)
  maximum time:     58.250 ms (0.00% GC)
  --------------
  samples:          89
  evals/sample:     1

My system:

julia 1.4.2
LoopVectorization v0.8.7

julia> Sys.cpu_summary()
Intel(R) Core(TM) i5-7360U CPU @ 2.30GHz: 
       speed         user         nice          sys         idle          irq
#1  2300 MHz     257491 s          0 s     178748 s    3357439 s          0 s
#2  2300 MHz      75119 s          0 s      56061 s    3662464 s          0 s
#3  2300 MHz     257768 s          0 s     120962 s    3414916 s          0 s
#4  2300 MHz      60438 s          0 s      40995 s    3692210 s          0 s

julia> Sys.KERNEL
:Darwin
1 Like

Interesting, I get almost no performance improvement from LoopVecorization when I use julia -O3:

julia> @benchmark ML_avx!($(similar(x)), $x, $k)
BenchmarkTools.Trial: 
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     122.705 ms (0.00% GC)
  median time:      123.716 ms (0.00% GC)
  mean time:        123.640 ms (0.00% GC)
  maximum time:     124.179 ms (0.00% GC)
  --------------
  samples:          41
  evals/sample:     1

julia> @benchmark ML_map!($(similar(x)), $x, $k)
BenchmarkTools.Trial: 
  memory estimate:  16 bytes
  allocs estimate:  1
  --------------
  minimum time:     132.446 ms (0.00% GC)
  median time:      133.329 ms (0.00% GC)
  mean time:        133.825 ms (0.00% GC)
  maximum time:     137.926 ms (0.00% GC)
  --------------
  samples:          38
  evals/sample:     1

julia> Sys.cpu_summary()
AMD FX(tm)-6300 Six-Core Processor: 
       speed         user         nice          sys         idle          irq
#1  1951 MHz    2660670 s      22038 s     333236 s    4944166 s     143650 s
#2  2226 MHz    2686557 s      19279 s     283194 s     720676 s      26927 s
#3  1405 MHz    2837037 s      19918 s     310490 s     696377 s      29987 s
#4  1405 MHz    2676784 s      16408 s     279095 s     738176 s      26085 s
#5  1405 MHz    2791156 s      22128 s     310827 s     745150 s      29477 s
#6  1404 MHz    2726276 s      16260 s     275786 s     696377 s      25419 s

(I’m using a nonallocating version of the @avx code here to make it a fair comparison.)

Many thanks to those who took the time to reply! Different suggestion were proposed:

  • pre-allocating F: minor speedup, and useful since a lot of function calls may be required (similar as f!(F,x) in NLsolve.jl) in the optimisation proces;
  • using a map outside of the function for core calculations: good speedup. I had to modify the ML_map! suggestion from @josuagrw a bit to get it working with threading;
  • using a view combined with threading (max threads available) to calculate the function value. Because of the view, map! no longer worked as intended, hence the change;
  • I tried the @avx as proposed by @platawiec and noticed a similar speedup as what he/she observed, however if you want to use a pre-allocating ML_avx!(F, x, k) you need to reset each F(x[i]), which was giving me an error upon macro expansion;
  • using @simd and @inbounds in the hot loop as proposed by @longemen3000 also improved performance;

All this led to the concoction below. There might still be some room for improvement (I’ve you happen to be inspired, please do share), but this already is an increase in speed by a factor 7.5 (using 4 threads), with more threads allowing even more gains (reached x18.5 on a workstation)

To do:

  • I might look into obtaining further gains by using Distributed instead of/combined with Threads.
  • As long as the vectors fit in memory, GPU acceleration could worth looking into if the speedup outweighs the transfer time between the GPU RAM and CPU RAM (not sure, no experience)
  • using a Krylov solver to solve the system is also on my list. I would like to compare and rank different methods to solve this. The problem can also be formulated as an optimisation problem, but JuMP and IPopt (with mumps) were not having a good time with this…
using BenchmarkTools

f(x::Float64) = x / (1 + x)

function ff(xi::Float64, ki::Float64, x::Array{Float64,1})
    return sum(xj -> f(xi * xj), x) - f(xi^2) - ki
end

function ML_mapbis!(F::SubArray{Float64,1,Array{Float64,1},Tuple{UnitRange{Int64}},true}, x::Array{Float64,1}, k::Array{Float64,1}) 
    @simd for i in F.indices[1]
        @inbounds F[i-F.offset1] = ff(x[i], k[i], x)
    end
end

function ML_mapbispar!(F::Array{Float64,1}, x::Array{Float64,1}, k::Array{Float64,1})
    n = length(x)
    inds = collect(Iterators.partition(1:n,ceil(Int,n/Threads.nthreads())))
    Threads.@threads for ind in inds
        ML_mapbis!(view(F,ind), x, k)
    end
    return F
end
BenchmarkTools.Trial: 
  memory estimate:  78.20 KiB
  allocs estimate:  2
  --------------
  minimum time:     230.218 ms (0.00% GC)
  median time:      231.675 ms (0.00% GC)
  mean time:        233.626 ms (0.00% GC)
  maximum time:     272.767 ms (0.00% GC)
  --------------
  samples:          22
  evals/sample:     1

BenchmarkTools.Trial: 
  memory estimate:  3.06 KiB
  allocs estimate:  28
  --------------
  minimum time:     28.888 ms (0.00% GC)
  median time:      31.267 ms (0.00% GC)
  mean time:        31.588 ms (0.00% GC)
  maximum time:     42.551 ms (0.00% GC)
  --------------
  samples:          159
  evals/sample:     1

Perhaps of interest, I have a package which is a convenient way to use both LoopVectorization and threads:

julia> using Tullio, LoopVectorization

julia> function ML_tullio(x,k)
           @tullio F[i] := x[i]*x[j]/(1+x[i]*x[j]) * (i!=j) # sum over j
           @tullio F[i] += -k[i]
       end

julia> ML_tullio(x,k) ≈ ML_baseline(x,k)
true

julia> @btime ML_tullio($x, $k); # threads + avx
  8.328 ms (1178 allocations: 122.70 KiB)

julia> @btime ML_avx($x, $k); # just @avx, above
  43.627 ms (2 allocations: 78.20 KiB)

julia> @btime ML_threaded_bounds_noif($x, $k); # just threads, above
  15.554 ms (65 allocations: 86.27 KiB)

julia> @btime ML_baseline($x, $k);
  196.572 ms (2 allocations: 78.20 KiB)

(This should work on the GPU too, but may not be quicker at this size.)

6 Likes

Very nice, compared with my own creating your approach is on par or slightly faster, but way more elegant :slight_smile:.

1 Like

What about the actual nonlinear solver? In my very limited experience I have tried to use LoopVectorization with Nlsolve but it breaks (didn’t dig deeper into Nlsolve). Just wanted to know if @bdc managed to solve his problem by Tullio or some of the other solutions.