I’m working on an application with maximum entropy networks. To determine some parameters, a system of non-linear equation F(x)=0 needs to be solved. For a network of n nodes, each node i has an associated equation that looks like this (in its most basic form): F(x_i) = \sum_{j\ne i}\dfrac{x_ix_j}{1+x_ix_j} - k_i\ , where k_i is known.
I’m using a matrix free quasi newton method to solve the system. The actual function evaluation of F(x) becomes a bottleneck for larger applications. For a real application the number of equations could go up to 1\text{e}7.
I’ve gained some improvements by using multithreading and @inbounds and suppressing the i\nej statement as shown below. Could this be made faster still? I haven’t used pre-calculated values to avoid using a lot of memory (same motivation for using matrix free quasi newton) as there are (n^2/2-n) unique values to be calculated.
Any suggestions are welcome
using BenchmarkTools
function ML_baseline(x::Array{Float64,1},k::Array{Float64,1})
n = length(x)
F = zeros(Float64,n)
for i = 1:n
for j = 1:n
if i≠j
F[i] += x[i]*x[j]/(1+x[i]*x[j])
else
F[i] -= k[i]
end
end
end
return F
end
function ML_threaded_bounds_noif(x::Array{Float64,1},k::Array{Float64,1})
n = Int64(length(x))
F = zeros(Float64,n)
Threads.@threads for i = 1:n
for j = 1:n
@inbounds F[i] += x[i]*x[j]/(1+x[i]*x[j])
end
@inbounds F[i] -= x[i]*x[i]/(1+x[i]*x[i]) + k[i]
end
return F
end
# setup
n = Int(1e4)
x = rand(n)
k = round.(rand(n)*n);
@benchmark ML_baseline(x,k)
@benchmark ML_threaded_bounds_noif(x,k)
BenchmarkTools.Trial:
memory estimate: 78.20 KiB
allocs estimate: 2
--------------
minimum time: 231.226 ms (0.00% GC)
median time: 242.516 ms (0.00% GC)
mean time: 246.103 ms (0.00% GC)
maximum time: 271.991 ms (0.00% GC)
--------------
samples: 21
evals/sample: 1
BenchmarkTools.Trial:
memory estimate: 80.95 KiB
allocs estimate: 25
--------------
minimum time: 68.404 ms (0.00% GC)
median time: 73.837 ms (0.00% GC)
mean time: 81.320 ms (0.00% GC)
maximum time: 254.281 ms (0.00% GC)
--------------
samples: 62
evals/sample: 1
The above results were obtained using 4 threads. Systeminfo below for completeness.
Julia Version 1.4.2
Commit 44fa15b150* (2020-05-23 18:35 UTC)
Platform Info:
OS: macOS (x86_64-apple-darwin18.7.0)
CPU: Intel(R) Core(TM) i7-7660U CPU @ 2.50GHz
WORD_SIZE: 64
LIBM: libopenlibm
LLVM: libLLVM-8.0.1 (ORCJIT, skylake)
What about preallocating a sub sample of the problem? Solve the equation in chunks and preallocating those chunks.
A small help would be to save a multiplication by doing
With respect to pre-allocating F, how is this different from initialising F as all zeros and adding the values afterwards? (I don’t master all the subtleties of the language yet)
I get the impression that you are running this function many times repeatedly. Since allocating memory (for F) is slow, you can save a lot of time by allocating it once and then reusing the same memory again and again.
You could also look into Krylov solvers which dont require you to calculate all components of F all the time. The O(N^2) scaling in this problem is just brutal.
using LoopVectorization
function ML_avx(x, k)
F = zero(x)
@avx for i in axes(x, 1)
for j in axes(x, 1)
F[i] += x[i]*x[j]/(1+x[i]*x[j])
end
F[i] -= x[i]*x[i]/(1+x[i]*x[i]) + k[i]
end
return F
end
The compiler is smart enough to figure out the x[i]*x[j] is being re-used here. My timings:
Baseline
julia> @benchmark ML_baseline(x,k)
BenchmarkTools.Trial:
memory estimate: 78.20 KiB
allocs estimate: 2
--------------
minimum time: 252.660 ms (0.00% GC)
median time: 253.723 ms (0.00% GC)
mean time: 254.228 ms (0.00% GC)
maximum time: 257.703 ms (0.00% GC)
--------------
samples: 20
evals/sample: 1
Threaded no-if
@benchmark ML_threaded_bounds_noif(x,k)
BenchmarkTools.Trial:
memory estimate: 79.63 KiB
allocs estimate: 15
--------------
minimum time: 100.016 ms (0.00% GC)
median time: 101.370 ms (0.00% GC)
mean time: 106.876 ms (0.00% GC)
maximum time: 203.586 ms (0.00% GC)
--------------
samples: 47
evals/sample: 1
@avx
BenchmarkTools.Trial:
memory estimate: 78.20 KiB
allocs estimate: 2
--------------
minimum time: 55.772 ms (0.00% GC)
median time: 56.494 ms (0.00% GC)
mean time: 56.427 ms (0.00% GC)
maximum time: 58.250 ms (0.00% GC)
--------------
samples: 89
evals/sample: 1
My system:
julia 1.4.2
LoopVectorization v0.8.7
julia> Sys.cpu_summary()
Intel(R) Core(TM) i5-7360U CPU @ 2.30GHz:
speed user nice sys idle irq
#1 2300 MHz 257491 s 0 s 178748 s 3357439 s 0 s
#2 2300 MHz 75119 s 0 s 56061 s 3662464 s 0 s
#3 2300 MHz 257768 s 0 s 120962 s 3414916 s 0 s
#4 2300 MHz 60438 s 0 s 40995 s 3692210 s 0 s
julia> Sys.KERNEL
:Darwin
Interesting, I get almost no performance improvement from LoopVecorization when I use julia -O3:
julia> @benchmark ML_avx!($(similar(x)), $x, $k)
BenchmarkTools.Trial:
memory estimate: 0 bytes
allocs estimate: 0
--------------
minimum time: 122.705 ms (0.00% GC)
median time: 123.716 ms (0.00% GC)
mean time: 123.640 ms (0.00% GC)
maximum time: 124.179 ms (0.00% GC)
--------------
samples: 41
evals/sample: 1
julia> @benchmark ML_map!($(similar(x)), $x, $k)
BenchmarkTools.Trial:
memory estimate: 16 bytes
allocs estimate: 1
--------------
minimum time: 132.446 ms (0.00% GC)
median time: 133.329 ms (0.00% GC)
mean time: 133.825 ms (0.00% GC)
maximum time: 137.926 ms (0.00% GC)
--------------
samples: 38
evals/sample: 1
julia> Sys.cpu_summary()
AMD FX(tm)-6300 Six-Core Processor:
speed user nice sys idle irq
#1 1951 MHz 2660670 s 22038 s 333236 s 4944166 s 143650 s
#2 2226 MHz 2686557 s 19279 s 283194 s 720676 s 26927 s
#3 1405 MHz 2837037 s 19918 s 310490 s 696377 s 29987 s
#4 1405 MHz 2676784 s 16408 s 279095 s 738176 s 26085 s
#5 1405 MHz 2791156 s 22128 s 310827 s 745150 s 29477 s
#6 1404 MHz 2726276 s 16260 s 275786 s 696377 s 25419 s
(I’m using a nonallocating version of the @avx code here to make it a fair comparison.)
Many thanks to those who took the time to reply! Different suggestion were proposed:
pre-allocating F: minor speedup, and useful since a lot of function calls may be required (similar as f!(F,x) in NLsolve.jl) in the optimisation proces;
using a map outside of the function for core calculations: good speedup. I had to modify the ML_map! suggestion from @josuagrw a bit to get it working with threading;
using a view combined with threading (max threads available) to calculate the function value. Because of the view, map! no longer worked as intended, hence the change;
I tried the @avx as proposed by @platawiec and noticed a similar speedup as what he/she observed, however if you want to use a pre-allocating ML_avx!(F, x, k) you need to reset each F(x[i]), which was giving me an error upon macro expansion;
using @simd and @inbounds in the hot loop as proposed by @longemen3000 also improved performance;
All this led to the concoction below. There might still be some room for improvement (I’ve you happen to be inspired, please do share), but this already is an increase in speed by a factor 7.5 (using 4 threads), with more threads allowing even more gains (reached x18.5 on a workstation)
To do:
I might look into obtaining further gains by using Distributed instead of/combined with Threads.
As long as the vectors fit in memory, GPU acceleration could worth looking into if the speedup outweighs the transfer time between the GPU RAM and CPU RAM (not sure, no experience)
using a Krylov solver to solve the system is also on my list. I would like to compare and rank different methods to solve this. The problem can also be formulated as an optimisation problem, but JuMP and IPopt (with mumps) were not having a good time with this…
using BenchmarkTools
f(x::Float64) = x / (1 + x)
function ff(xi::Float64, ki::Float64, x::Array{Float64,1})
return sum(xj -> f(xi * xj), x) - f(xi^2) - ki
end
function ML_mapbis!(F::SubArray{Float64,1,Array{Float64,1},Tuple{UnitRange{Int64}},true}, x::Array{Float64,1}, k::Array{Float64,1})
@simd for i in F.indices[1]
@inbounds F[i-F.offset1] = ff(x[i], k[i], x)
end
end
function ML_mapbispar!(F::Array{Float64,1}, x::Array{Float64,1}, k::Array{Float64,1})
n = length(x)
inds = collect(Iterators.partition(1:n,ceil(Int,n/Threads.nthreads())))
Threads.@threads for ind in inds
ML_mapbis!(view(F,ind), x, k)
end
return F
end
BenchmarkTools.Trial:
memory estimate: 78.20 KiB
allocs estimate: 2
--------------
minimum time: 230.218 ms (0.00% GC)
median time: 231.675 ms (0.00% GC)
mean time: 233.626 ms (0.00% GC)
maximum time: 272.767 ms (0.00% GC)
--------------
samples: 22
evals/sample: 1
BenchmarkTools.Trial:
memory estimate: 3.06 KiB
allocs estimate: 28
--------------
minimum time: 28.888 ms (0.00% GC)
median time: 31.267 ms (0.00% GC)
mean time: 31.588 ms (0.00% GC)
maximum time: 42.551 ms (0.00% GC)
--------------
samples: 159
evals/sample: 1
What about the actual nonlinear solver? In my very limited experience I have tried to use LoopVectorization with Nlsolve but it breaks (didn’t dig deeper into Nlsolve). Just wanted to know if @bdc managed to solve his problem by Tullio or some of the other solutions.