Accelerate Non-linear function evaluation

bdc · June 26, 2020, 2:18pm

Hi,

I’m working on an application with maximum entropy networks. To determine some parameters, a system of non-linear equation F(x)=0 needs to be solved. For a network of n nodes, each node i has an associated equation that looks like this (in its most basic form):
F(x_i) = \sum_{j\ne i}\dfrac{x_ix_j}{1+x_ix_j} - k_i\ , where k_i is known.

I’m using a matrix free quasi newton method to solve the system. The actual function evaluation of F(x) becomes a bottleneck for larger applications. For a real application the number of equations could go up to 1\text{e}7.

I’ve gained some improvements by using multithreading and @inbounds and suppressing the i\nej statement as shown below. Could this be made faster still? I haven’t used pre-calculated values to avoid using a lot of memory (same motivation for using matrix free quasi newton) as there are (n^2/2-n) unique values to be calculated.

Any suggestions are welcome

using BenchmarkTools

function ML_baseline(x::Array{Float64,1},k::Array{Float64,1}) 
    n = length(x)
    F = zeros(Float64,n)
    for i = 1:n
        for j = 1:n
            if i≠j
                F[i] += x[i]*x[j]/(1+x[i]*x[j])
            else
                F[i] -= k[i]
            end
        end
    end
    return F
end

function ML_threaded_bounds_noif(x::Array{Float64,1},k::Array{Float64,1}) 
    n = Int64(length(x))
    F = zeros(Float64,n)
    Threads.@threads for i = 1:n
        for j = 1:n
            @inbounds F[i] += x[i]*x[j]/(1+x[i]*x[j])
        end
        @inbounds F[i] -= x[i]*x[i]/(1+x[i]*x[i]) + k[i]
    end
    return F
end

# setup
n = Int(1e4)
x = rand(n)
k = round.(rand(n)*n);

@benchmark ML_baseline(x,k)
@benchmark ML_threaded_bounds_noif(x,k)

BenchmarkTools.Trial: 
  memory estimate:  78.20 KiB
  allocs estimate:  2
  --------------
  minimum time:     231.226 ms (0.00% GC)
  median time:      242.516 ms (0.00% GC)
  mean time:        246.103 ms (0.00% GC)
  maximum time:     271.991 ms (0.00% GC)
  --------------
  samples:          21
  evals/sample:     1

BenchmarkTools.Trial: 
  memory estimate:  80.95 KiB
  allocs estimate:  25
  --------------
  minimum time:     68.404 ms (0.00% GC)
  median time:      73.837 ms (0.00% GC)
  mean time:        81.320 ms (0.00% GC)
  maximum time:     254.281 ms (0.00% GC)
  --------------
  samples:          62
  evals/sample:     1

The above results were obtained using 4 threads. Systeminfo below for completeness.

Julia Version 1.4.2
Commit 44fa15b150* (2020-05-23 18:35 UTC)
Platform Info:
  OS: macOS (x86_64-apple-darwin18.7.0)
  CPU: Intel(R) Core(TM) i7-7660U CPU @ 2.50GHz
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-8.0.1 (ORCJIT, skylake)

josuagrw · June 26, 2020, 2:48pm

Welcome! Have you checked that your method is generating type-stable code?

You could try

@code_warntype ML_baseline(x,k)

josuagrw · June 26, 2020, 2:49pm

another idea might be to pre-allocate F

longemen3000 · June 26, 2020, 2:57pm

What about preallocating a sub sample of the problem? Solve the equation in chunks and preallocating those chunks.
A small help would be to save a multiplication by doing

xij = x[i]*x[j] 
F[i] += xij/(1+xij)

bdc · June 26, 2020, 3:05pm

Thanks for the reply. I did this for both beforehand, but did not mention it. Both look ok:

ML_baseline has two orange ones, but from what I understand from here, these are internal variables linked to the iteration.
The threaded one is all green and looks coherent

bdc · June 26, 2020, 3:07pm

With respect to pre-allocating F, how is this different from initialising F as all zeros and adding the values afterwards? (I don’t master all the subtleties of the language yet)

bdc · June 26, 2020, 3:08pm

very obvious, yet did not think of that. Thanks

longemen3000 · June 26, 2020, 3:15pm

Also, you can directly initialize F:

F =( x. *x)  ./ (1 . - x  .* x)  . - k

What about using the @simd macro too in the hot loop?

josuagrw · June 26, 2020, 3:29pm

I get the impression that you are running this function many times repeatedly. Since allocating memory (for F) is slow, you can save a lot of time by allocating it once and then reusing the same memory again and again.

You could also look into Krylov solvers which dont require you to calculate all components of F all the time. The O(N^2) scaling in this problem is just brutal.

josuagrw · June 26, 2020, 3:36pm

I think this should give the correct result

f(x) = x / (1 + x)
F = similar(x)
ML_map!(F, x, k) = map!(F, x, k) do xi, ki
    return sum(xj -> f(xi * xj), x) - f(xi^2) - ki
end

Benchmark results:

julia> @benchmark ML_map!($F, $x, $k)
BenchmarkTools.Trial: 
  memory estimate:  16 bytes
  allocs estimate:  1
  --------------
  minimum time:     131.697 ms (0.00% GC)
  median time:      132.760 ms (0.00% GC)
  mean time:        132.688 ms (0.00% GC)
  maximum time:     133.522 ms (0.00% GC)
  --------------
  samples:          38
  evals/sample:     1

julia> @benchmark ML_threaded_bounds_noif($x, $k)
BenchmarkTools.Trial: 
  memory estimate:  81.16 KiB
  allocs estimate:  26
  --------------
  minimum time:     106.273 ms (0.00% GC)
  median time:      116.451 ms (0.00% GC)
  mean time:        119.378 ms (0.00% GC)
  maximum time:     130.660 ms (0.00% GC)
  --------------
  samples:          42
  evals/sample:     1

julia> @benchmark ML_baseline($x, $k)
BenchmarkTools.Trial: 
  memory estimate:  78.20 KiB
  allocs estimate:  2
  --------------
  minimum time:     453.683 ms (0.00% GC)
  median time:      456.925 ms (0.00% GC)
  mean time:        458.466 ms (0.00% GC)
  maximum time:     466.490 ms (0.00% GC)
  --------------
  samples:          11
  evals/sample:     1

Note that this is with 4 threads. By default map!() is not multi threaded but i think some implmentations exist in packages.

josuagrw · June 26, 2020, 3:37pm

I don’t think this code reproduces what the original code does. It squares x instead of summing over it.

longemen3000 · June 26, 2020, 6:25pm

Is just to remove the evaluation of Diagonal terms of the loop

platawiec · June 26, 2020, 6:46pm

This is a great use-case for https://github.com/chriselrod/LoopVectorization.jl. I see a ~5x increase from the baseline on my system.

using LoopVectorization

function ML_avx(x, k)
    F = zero(x)
    @avx for i in axes(x, 1)
        for j in axes(x, 1)
            F[i] += x[i]*x[j]/(1+x[i]*x[j])
        end
        F[i] -= x[i]*x[i]/(1+x[i]*x[i]) + k[i]
    end
    return F
end

The compiler is smart enough to figure out the x[i]*x[j] is being re-used here. My timings:

Baseline

julia> @benchmark ML_baseline(x,k)
BenchmarkTools.Trial: 
  memory estimate:  78.20 KiB
  allocs estimate:  2
  --------------
  minimum time:     252.660 ms (0.00% GC)
  median time:      253.723 ms (0.00% GC)
  mean time:        254.228 ms (0.00% GC)
  maximum time:     257.703 ms (0.00% GC)
  --------------
  samples:          20
  evals/sample:     1

Threaded no-if

@benchmark ML_threaded_bounds_noif(x,k)
BenchmarkTools.Trial: 
  memory estimate:  79.63 KiB
  allocs estimate:  15
  --------------
  minimum time:     100.016 ms (0.00% GC)
  median time:      101.370 ms (0.00% GC)
  mean time:        106.876 ms (0.00% GC)
  maximum time:     203.586 ms (0.00% GC)
  --------------
  samples:          47
  evals/sample:     1

`@avx`

BenchmarkTools.Trial: 
  memory estimate:  78.20 KiB
  allocs estimate:  2
  --------------
  minimum time:     55.772 ms (0.00% GC)
  median time:      56.494 ms (0.00% GC)
  mean time:        56.427 ms (0.00% GC)
  maximum time:     58.250 ms (0.00% GC)
  --------------
  samples:          89
  evals/sample:     1

My system:

julia 1.4.2
LoopVectorization v0.8.7

julia> Sys.cpu_summary()
Intel(R) Core(TM) i5-7360U CPU @ 2.30GHz: 
       speed         user         nice          sys         idle          irq
#1  2300 MHz     257491 s          0 s     178748 s    3357439 s          0 s
#2  2300 MHz      75119 s          0 s      56061 s    3662464 s          0 s
#3  2300 MHz     257768 s          0 s     120962 s    3414916 s          0 s
#4  2300 MHz      60438 s          0 s      40995 s    3692210 s          0 s

julia> Sys.KERNEL
:Darwin

josuagrw · June 26, 2020, 6:58pm

Interesting, I get almost no performance improvement from LoopVecorization when I use julia -O3:

julia> @benchmark ML_avx!($(similar(x)), $x, $k)
BenchmarkTools.Trial: 
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     122.705 ms (0.00% GC)
  median time:      123.716 ms (0.00% GC)
  mean time:        123.640 ms (0.00% GC)
  maximum time:     124.179 ms (0.00% GC)
  --------------
  samples:          41
  evals/sample:     1

julia> @benchmark ML_map!($(similar(x)), $x, $k)
BenchmarkTools.Trial: 
  memory estimate:  16 bytes
  allocs estimate:  1
  --------------
  minimum time:     132.446 ms (0.00% GC)
  median time:      133.329 ms (0.00% GC)
  mean time:        133.825 ms (0.00% GC)
  maximum time:     137.926 ms (0.00% GC)
  --------------
  samples:          38
  evals/sample:     1

julia> Sys.cpu_summary()
AMD FX(tm)-6300 Six-Core Processor: 
       speed         user         nice          sys         idle          irq
#1  1951 MHz    2660670 s      22038 s     333236 s    4944166 s     143650 s
#2  2226 MHz    2686557 s      19279 s     283194 s     720676 s      26927 s
#3  1405 MHz    2837037 s      19918 s     310490 s     696377 s      29987 s
#4  1405 MHz    2676784 s      16408 s     279095 s     738176 s      26085 s
#5  1405 MHz    2791156 s      22128 s     310827 s     745150 s      29477 s
#6  1404 MHz    2726276 s      16260 s     275786 s     696377 s      25419 s

(I’m using a nonallocating version of the @avx code here to make it a fair comparison.)

bdc · June 27, 2020, 5:25pm

Many thanks to those who took the time to reply! Different suggestion were proposed:

pre-allocating F: minor speedup, and useful since a lot of function calls may be required (similar as f!(F,x) in NLsolve.jl) in the optimisation proces;
using a map outside of the function for core calculations: good speedup. I had to modify the ML_map! suggestion from @josuagrw a bit to get it working with threading;
using a view combined with threading (max threads available) to calculate the function value. Because of the view, map! no longer worked as intended, hence the change;
I tried the @avx as proposed by @platawiec and noticed a similar speedup as what he/she observed, however if you want to use a pre-allocating ML_avx!(F, x, k) you need to reset each F(x[i]), which was giving me an error upon macro expansion;
using @simd and @inbounds in the hot loop as proposed by @longemen3000 also improved performance;

All this led to the concoction below. There might still be some room for improvement (I’ve you happen to be inspired, please do share), but this already is an increase in speed by a factor 7.5 (using 4 threads), with more threads allowing even more gains (reached x18.5 on a workstation)

To do:

I might look into obtaining further gains by using Distributed instead of/combined with Threads.
As long as the vectors fit in memory, GPU acceleration could worth looking into if the speedup outweighs the transfer time between the GPU RAM and CPU RAM (not sure, no experience)
using a Krylov solver to solve the system is also on my list. I would like to compare and rank different methods to solve this. The problem can also be formulated as an optimisation problem, but JuMP and IPopt (with mumps) were not having a good time with this…

using BenchmarkTools

f(x::Float64) = x / (1 + x)

function ff(xi::Float64, ki::Float64, x::Array{Float64,1})
    return sum(xj -> f(xi * xj), x) - f(xi^2) - ki
end

function ML_mapbis!(F::SubArray{Float64,1,Array{Float64,1},Tuple{UnitRange{Int64}},true}, x::Array{Float64,1}, k::Array{Float64,1}) 
    @simd for i in F.indices[1]
        @inbounds F[i-F.offset1] = ff(x[i], k[i], x)
    end
end

function ML_mapbispar!(F::Array{Float64,1}, x::Array{Float64,1}, k::Array{Float64,1})
    n = length(x)
    inds = collect(Iterators.partition(1:n,ceil(Int,n/Threads.nthreads())))
    Threads.@threads for ind in inds
        ML_mapbis!(view(F,ind), x, k)
    end
    return F
end

BenchmarkTools.Trial: 
  memory estimate:  78.20 KiB
  allocs estimate:  2
  --------------
  minimum time:     230.218 ms (0.00% GC)
  median time:      231.675 ms (0.00% GC)
  mean time:        233.626 ms (0.00% GC)
  maximum time:     272.767 ms (0.00% GC)
  --------------
  samples:          22
  evals/sample:     1

BenchmarkTools.Trial: 
  memory estimate:  3.06 KiB
  allocs estimate:  28
  --------------
  minimum time:     28.888 ms (0.00% GC)
  median time:      31.267 ms (0.00% GC)
  mean time:        31.588 ms (0.00% GC)
  maximum time:     42.551 ms (0.00% GC)
  --------------
  samples:          159
  evals/sample:     1

mcabbott · June 27, 2020, 8:37pm

Perhaps of interest, I have a package which is a convenient way to use both LoopVectorization and threads:

julia> using Tullio, LoopVectorization

julia> function ML_tullio(x,k)
           @tullio F[i] := x[i]*x[j]/(1+x[i]*x[j]) * (i!=j) # sum over j
           @tullio F[i] += -k[i]
       end

julia> ML_tullio(x,k) ≈ ML_baseline(x,k)
true

julia> @btime ML_tullio($x, $k); # threads + avx
  8.328 ms (1178 allocations: 122.70 KiB)

julia> @btime ML_avx($x, $k); # just @avx, above
  43.627 ms (2 allocations: 78.20 KiB)

julia> @btime ML_threaded_bounds_noif($x, $k); # just threads, above
  15.554 ms (65 allocations: 86.27 KiB)

julia> @btime ML_baseline($x, $k);
  196.572 ms (2 allocations: 78.20 KiB)

(This should work on the GPU too, but may not be quicker at this size.)

bdc · June 29, 2020, 10:56am

Very nice, compared with my own creating your approach is on par or slightly faster, but way more elegant .

cricci · April 6, 2021, 12:48pm

What about the actual nonlinear solver? In my very limited experience I have tried to use LoopVectorization with Nlsolve but it breaks (didn’t dig deeper into Nlsolve). Just wanted to know if @bdc managed to solve his problem by Tullio or some of the other solutions.

Topic		Replies	Views
ANN: LoopVectorization 0.12: multithreading and better handling of discontiguous memory accesses Performance	16	2170	March 17, 2021
Help Improving Performance of a Loop Performance performance , loops	15	1091	February 16, 2021
Speeding up a function Performance performance	43	966	August 14, 2023
Outperformed by Matlab Performance matlab , multithreading , linearalgebra , tullio , loopvectorization	54	4194	November 23, 2021
Sum operations between arrays Performance	21	5630	April 7, 2020

Accelerate Non-linear function evaluation

Baseline

Threaded no-if

@avx

Related topics

`@avx`