Unreasonable slow speed in numerical integration

I hope to compute an intergration of a function with intergration inside, however the efficiency is extremely slow.

using HCubature,QuadGK,NumericalIntegration
spinonsd(omega::Float64)::Float64=imag(hcubature(x->Float64(-1/pi*sqrt(3)/2*1/(2*pi)^2)*spinongf(omega,x[1],x[2]), [0.0,0.0], [2*pi,4*pi/sqrt(3)])[1])
@time @fastmath spinonsd(1.0)
0.000957 seconds (12.91 k allocations: 629.141 KiB)
@time @fastmath quadgk(x ->spinonsd(x), -3.0, 3.0)
357.872404 seconds (5.17 G allocations: 217.831 GiB, 7.37% gc time, 0.08% compilation time)
(0.9997873995051424, 1.385776717295205e-8)

Note that the first time running of @time @fastmath spinonsd(1.0) is about 2 seconds in my PC, very slow. Even I turn to use NumericalIntegration, that’s efficiency is still low, not comparable with Matlab and other programming language (one can check that Matlab only spend 1e-2 seconds on intergrating).

@time @fastmath integrate(-3.0:0.1:3.0,spinonsd.(-3.0:0.1:3.0))
  4.714930 seconds (66.61 M allocations: 2.798 GiB, 5.49% gc time)

How to optimize this problem?

1 Like

First, you should realize that the Matlab integrate function defaults to 1e-6 relative tolerance (and 1e-10 absolute tolerance), whereas quadgk and hcubature default to √Ρ β‰ˆ 1.5e-8 relative tolerance, so you are asking for 100x times more accurate an answer from Julia than from Matlab, or even more if the integral is small (so that the absolute tolerance comes into play).

Second you are doing nested numerical integration, which is generally a bad idea or at least something one should be extremely cautious about:

  1. Because your integrand is itself an adaptive numerical integral to 1e-8 relative tolerance, that effectively can mean it is β€œnoisy” at the 1e-8 level. This can cause problems if you also try to converge the β€œouter” integral to 1e-8 tolerance, because then it wastes a lot of function evaluations trying to converge the integral of the β€œnoisy” error in the interior integral. If you must use nested numerical integration, you should really be more careful of the tolerances (the outer integral should have a larger tolerance than the inner integral), and you may also need to set an absolute tolerance in order to avoid spending lots of time in regions where the integrand is small.

  2. Instead of nesting numerical integrations, it is sometimes much better to do a single multidimensional integral. (Here, a single 3d integral.)

Third, you realize that the first time you run Julia code, it spends a bunch of time compiling it, so that the second (and subsequent) calls are fast, right? See the performance tips on measuring time. You also need to be careful about benchmarking in global scope, since Julia is oriented towards performance-critical code in functions.

Note also that your argument-type and return-type declarations don’t impact performance and it is probably clearer to omit them. Similarly with the explicit calls to Float64(...).

(Also, I would recommend passing vectors to hcubature as StaticArrays so that the compiler knows the dimensionality of the integral.)

I looked into optimizing your code, but I find that I cannot replicate your results. I get:

julia> @time @fastmath spinonsd(1.0)
 21.642360 seconds (453.33 M allocations: 18.803 GiB, 6.77% gc time)

which is considerably different from the numbers you reported above. Did you miscopy something in your code?


Thanks for your reply. As your suggestions, I put everything into functions and reduce accuracy, however its speed is still uncomparable with Matlab (the same method, double integral first then 1d integral, 0.006396 seconds). And to use the same algorithm with dblquad, I try QuadGG now:

using QuadGG,HCubature


function test1()
@fastmath adaptsim(x ->spinonsd(x), -3.0, 3.0;tol=1e-4)
@time test1()
  4.544678 seconds (79.78 M allocations: 2.833 GiB, 5.15% gc time)

function test2()
    @fastmath integrate(-3.0:0.1:3.0,spinonsd.(-3.0:0.1:3.0))
@time test2()
  0.815915 seconds (14.36 M allocations: 524.357 MiB, 5.72% gc time)

In my problem, all intergrand are just scalar, therefore it’s unnecessary to call StaticArrays. I think the only possible reason is double integral step by HCubature.

I’m using Jupyter. Here is the screenshot.

julia> @time @fastmath spinonsd(1.0)
 30.272589 seconds (457.71 M allocations: 19.016 GiB, 6.22% gc time, 4.12% compilation time)

Also more than 10k times slower for me than for you…

Ran it a second time… same result.

julia> versioninfo()
Julia Version 1.8.2
Commit 36034abf260 (2022-09-29 15:21 UTC)
Platform Info:
  OS: Linux (x86_64-linux-gnu)
  CPU: 12 Γ— AMD Ryzen 5 3600 6-Core Processor
  LIBM: libopenlibm
  LLVM: libLLVM-13.0.1 (ORCJIT, znver2)
  Threads: 5 on 12 virtual cores
  JULIA_DEPOT_PATH = /var/local/dlakelan/dotjulia
  JULIA_PKG_SERVER = us-west.pkg.julialang.org

1 Like

Did you restart the notebook before running that cell? That would explain the discrepancy between the results here. Using @btime and @benchmark from BenchmarkTools.jl seems like a safer bet because one doesn’t have to remember to restart before every run.

@dlakelan @ToucheSir
Thx for your reply. In my pc, @btime doesn’t give drasticly different results.

@benchmark @fastmath spinonsd(1.0)
BenchmarkTools.Trial: 10000 samples with 1 evaluation.
 Range (min … max):  262.000 ΞΌs …   7.371 ms  β”Š GC (min … max): 0.00% … 95.36%
 Time  (median):     270.600 ΞΌs               β”Š GC (median):    0.00%
 Time  (mean Β± Οƒ):   294.829 ΞΌs Β± 338.330 ΞΌs  β”Š GC (mean Β± Οƒ):  6.01% Β±  5.03%

  β–‚β–†β–ˆβ–ˆβ–†β–…β–„β–„β–ƒβ–ƒβ–ƒβ–ƒβ–‚β–‚β–‚β–  ▁▁                                          β–‚
  β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‡β–‡β–†β–…β–…β–†β–†β–…β–†β–„β–†β–„β–ƒβ–„β–„β–„β–„β–„β–ƒβ–β–…β–„β–„β–„β–…β–†β–ƒβ–…β–…β–…β–…β–„β–…β–„β–„β–ƒβ–…β–…β–… β–ˆ
  262 ΞΌs        Histogram: log(frequency) by time        414 ΞΌs <

 Memory estimate: 194.69 KiB, allocs estimate: 5025.

function test1()
@fastmath adaptsim(x ->spinonsd(x), -3.0, 3.0;tol=1e-4)
@btime test1()
  4.602 s (79777904 allocations: 2.83 GiB)

function test2()
    @fastmath integrate(-3.0:0.1:3.0,spinonsd.(-3.0:0.1:3.0))
@btime test2()
  819.090 ms (14355788 allocations: 524.36 MiB)

function test3()
    @fastmath hcubature(x -> spinonsd(x[1]), [-3.0], [3.0],rtol=1e-6,atol=1e-10)
@btime test3()
  44.790 s (778124241 allocations: 27.69 GiB)
(0.9997873952011951, 9.872535800728156e-7)

You aren’t running the definition of spinonsd that you posted above. The version you posted at the start of the thread takes ~25 seconds to run on a very fast CPU.

Show us all the definitions you’re using and we can show you how to improve stuff.


I exactly run from the beginning and have never changed the definition of spinonsd. Here are the benchmark running after restarting kernels again.

using HCubature,QuadGG,NumericalIntegration,BenchmarkTools
@benchmark @fastmath spinonsd(1.0)
BenchmarkTools.Trial: 10000 samples with 1 evaluation.
 Range (min … max):  261.500 ΞΌs …   4.174 ms  β”Š GC (min … max): 0.00% … 92.76%
 Time  (median):     269.700 ΞΌs               β”Š GC (median):    0.00%
 Time  (mean Β± Οƒ):   290.769 ΞΌs Β± 232.041 ΞΌs  β”Š GC (mean Β± Οƒ):  5.13% Β±  6.00%

  β–…β–ˆβ–‡β–…β–„β–‚β–‚β–‚β–β– ▁                                                  β–‚
  β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‡β–†β–†β–†β–…β–†β–…β–†β–…β–„β–„β–…β–…β–…β–„β–…β–…β–β–…β–„β–„β–„β–…β–β–ƒβ–ƒβ–ƒβ–„β–β–„β–ƒβ–ƒβ–β–„β–ƒβ–„β–ƒβ–ƒβ–„β–β–β–β–β–ƒβ–„β–β–ƒ β–ˆ
  262 ΞΌs        Histogram: log(frequency) by time        524 ΞΌs <

 Memory estimate: 194.69 KiB, allocs estimate: 5025.

Edit: This is onedrive link of the notebook for this problem.
Dos intergration

I get a comparable time running the this latest version of your code but please do note that this does not match what you originally posted. When I run the code you originally posted, @time @fastmath spinonsd(1.0) takes 34 seconds, similar to what @Mason said.


Okay, yes, I can reproduce those timings to within a factor of 2. Now, do you have a copy of the Matlab code available? When comparing the integral over spinonsd(omega), I want to be sure that I’m comparing apples to apples.


Here are Matlab codes. Actually it is also very slow at the first time running and the second time you can have the similar timing. Maybe it’s better to divide those 3 functions into multiple files.

close all;
function f=DOS_omega(omega0)
global omega
kxmin=-pi; kxmax=pi; kymin=-2*pi/sqrt(3); kymax=2*pi/sqrt(3);
for i=1:lomega 
function f=DOS_xy(kx0,ky0)
global omega 
for i=1:lx

Hm, I’m a bit confused by the for loop in DOS_xy. Isn’t kx0 a number, not an array here? i.e. Is this loop from 1:1?

I wonder if the fact that it’s so slow the first time you run it, but nearly instant the second time is indicative of it caching answers.

Since dblquad only receive intergrand arguments as vector and scalar respectively, I wrote double integral in this form. Probably my understanding is not correct and there is a better solution.

Ah, right, I forgot that’s how dblquad works. Yeah, I tried running your script on my machine and it was very slow indeed.

So in this problem, because of the way the integrals separate, it turns out to be much much more efficient to integrate first over omega and then integrate over kx and ky. I’m assuming you want your frequency integral to go over the range -Inf to Inf, but if you actually only wanted to integrate from -3 to 3 the timings are the same.

#+begin_src julia
using HCubature, QuadGK, BenchmarkTools
function spinonenergy(kx, ky)
    2 * 0.05 * (cos(kx)+cos(-kx/2+sqrt(3)/2*ky)+cos(kx/2+sqrt(3)/2*ky)) + 0.04

function Sigma(Ξ·)
    hcubature((0.0,0.0),(2*pi,4*pi/sqrt(3))) do (kx, ky)
        Ο΅ = spinonenergy(kx, ky)
        quadgk(-Inf, Inf) do Ο‰
            -1/pi*sqrt(3)/2*1/(2*pi)^2 * imag(inv(Ο‰ - Ο΅ + im*Ξ·))

@benchmark Sigma(0.001) # the complex offset you originally used

: BenchmarkTools.Trial: 10000 samples with 1 evaluation.
:  Range (min … max):  193.428 ΞΌs …  2.202 ms  β”Š GC (min … max): 0.00% … 89.97%
:  Time  (median):     194.439 ΞΌs              β”Š GC (median):    0.00%
:  Time  (mean Β± Οƒ):   196.813 ΞΌs Β± 44.045 ΞΌs  β”Š GC (mean Β± Οƒ):  0.49% Β±  2.01%
:   β–ƒβ–‡β–ˆβ–‡β–†β–…β–ƒβ–ƒβ–‚β–β–  ▂▃▃▄▄▄▃▃▂▂▁▁                                    β–‚
:   β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‡β–ˆβ–‡β–†β–†β–†β–…β–‡β–‡β–ˆβ–ˆβ–‡β–‡β–‡β–ˆβ–‡β–†β–…β–†β–…β–„β–…β–†β–„β–…β–…β–…β–†β–„β–…β–„β–„β–„β–‚β–… β–ˆ
:   193 ΞΌs        Histogram: log(frequency) by time       210 ΞΌs <
:  Memory estimate: 29.78 KiB, allocs estimate: 71.

So here, I’m able to do the whole integral over kx, ky, and omega in a tenth of a milisecond, and do it with the incredibly high default accuracies that QuadGK and HCubature default to.

Furthermore, we can use Richardson.jl to extrapolate this result to Ξ· = 0:

#+begin_src julia
using Richardson
@btime extrapolate(Sigma, 1.0)

:   105.489 ΞΌs (114 allocations: 18.56 KiB)
: (0.9999999999998697, 1.3034018309099338e-13)

which is actually faster than plugging in Ξ· = 0.001 explicitly, and more accurate.


Note that this is -\pi\delta(\omega - \varepsilon) for Ξ·=0⁺, so you can actually take the \eta \to 0^+ limit analytically and completely eliminate the Ο‰ integral.

(Actually, since the integrand doesn’t otherwise depend on Ο‰, you can do the integral exactly for any Ξ·: imag(inv(Ο‰ - Ο΅ + im*Ξ·)) integrates to exactly -Ο€. But I’m guessing in your real problem you’ll have some more complicated Ο‰ dependence. Still, in the limit, delta functions are easy to integrate!)

This is also why swapping the integral order makes it much faster: in your current form, the (kx,ky) integrand is literally a constant. And the Richardson extrapolation is literally extrapolating a constant β€” Sigma(Ξ·) β‰ˆ 1 independent of Ξ·! β€” which is pretty easy to extrapolate! Moral: take your limits analytically if you can.


Thanks for your help. I finally understand your approach. I haven’t known that do with nested integration can speed up functions until now and it reduce allocations drastically. In my practice, I found that defining two nested functions has the similar timing as your method:

using QuadGK,HCubature,BenchmarkTools,Richardson

function Sigma1(Ξ·)

    hcubature((0.0,0.0),(2*pi,4*pi/sqrt(3))) do (kx, ky)

        Ο΅ = spinonenergy(kx, ky)

        quadgk(-Inf, Inf) do Ο‰

            -1/pi*sqrt(3)/2*1/(2*pi)^2 * imag(inv(Ο‰ - Ο΅ + im*Ξ·))




function Sigma2(Ξ·)

    function intergrand1(kx,ky)

        Ο΅ = spinonenergy(kx, ky);

        -1/pi*sqrt(3)/2*1/(2*pi)^2 * quadgk(-Inf, Inf) do Ο‰

            imag(inv(Ο‰ - Ο΅ + im*Ξ·))



    int=hcubature((0.0,0.0),(2*pi,4*pi/sqrt(3))) do (kx, ky)






@btime Sigma1(0.001)

  219.100 ΞΌs (71 allocations: 29.78 KiB)


@btime Sigma2(0.001)

  213.100 ΞΌs (71 allocations: 29.78 KiB)


However, I still cannot understand why integration order influencing radically. Since I have to integrate frequency first then integrate over (kx,ky) in further codes. I guess you already know that this order add huge allocations, do you have some idea to reduce this?

function Sigma4(Ξ·)

    function intergrand1(Ο‰)

        -1/pi*sqrt(3)/2*1/(2*pi)^2 * hcubature((0.0,0.0),(2*pi,4*pi/sqrt(3)),rtol=1e-4,atol=1e-4) do (kx, ky)

            imag(inv(Ο‰ - spinonenergy(kx, ky) + im*Ξ·))



    int=quadgk(-Inf, Inf,rtol=1e-4,atol=1e-4) do Ο‰





@btime Sigma4(0.001)

  9.222 s (157371111 allocations: 5.71 GiB)