Optimise calculation with NxNxMxM matrix

I have been working on a simulation in python where I try to solve the following integral:

f(x,y) = \iint A(x-\xi', y-\eta') g(\xi',\eta') h(x,\xi',y,\eta') \,d\xi' \,d\eta'

With multiprocessing on a server I could get the simulation to run in roughly 30 minutes for a 512x512x1024x1024 grid. Recently I was made aware by colleagues that Julia could improve the computation time further. I have tried to implement the code in Julia and optimize it as much as possible. As shown below I can only half the computation time by using Julia compared to Python.

The calculation could be vectorised over a matrix of the size NxNxMxM, but in my case I want to use N=512 and M=1024. Using a matrix with ComplexF64 would require me to have ~18TeraBytes of ram. Instead, my current method consists of doing a point-wise calculation of f(x,y) by iterating over the x and y coordinates with multi-threading. This method means that each iteration requires low amounts of RAM. A MWE of the code is as follows:

using  Bessels, Profile, BenchmarkTools, LinearAlgebra
function run_simulation(freq_size, space_size)
    t = @elapsed begin
        space_grid = -1:2/space_size:(1-2/space_size)              #1d matrix for a square grid
        dx = step(space_grid)                                      #stepsize of the grid
        freq_grid = -1/2/dx:1/dx/freq_size:1/2/dx- 1/(dx*freq_size)#1d matrix for a square frequency grid
        dfx = step(freq_grid)                                      #simulated stepsize for the frequency grid

        tf_fft = rand(freq_size,freq_size) .+ im      # Constant function on 2d frequency grid
        quadratic = @. cis.(-pi * rand(freq_size, freq_size)) # constant function on frequency grid 
        airy_x = freq_grid .- rand(1)            # constant vector 

        results = Matrix{ComplexF64}(undef, (length(space_grid), length(space_grid)))
        ix = length(space_grid)
        iy = length(space_grid)

        # Loop over all space grid datapoints
        Threads.@threads for j = 1:iy 
            for i = 1:ix
                @inbounds results[i,j] = integration_process([space_grid[i],space_grid[j]],
                                                              freq_grid, dfx, tf_fft, quadratic, airy_x)

            end
        end
    end
    println("time elapsed is $t seconds for 4d set of  $space_size x $space_size x $freq_size x $freq_size ")
    return #results, x1_array
end

function integration_process(xy_in::Vector{Float64}, fx_in::StepRangeLen{Float64, Base.TwicePrecision{Float64}, Base.TwicePrecision{Float64}, Int64}, dfx_in::Float64, tf_in::Matrix{ComplexF64}, quadratic_in::Matrix{ComplexF64}, airy_in::Vector{Float64})::ComplexF64
    x, y = xy_in
    airy_x = @. x-airy_in
    airy_y = @. y-airy_in

    el2int = exp_element(x, y, fx_in) .* bessel_element(airy_x, airy_y) .* tf_in.* quadratic_in
    matrix_element = trapzoid_2D(el2int, dfx_in)
    return matrix_element
end

function bessel_element(airy_inx::Vector{Float64}, airy_iny::Vector{Float64})::Matrix{ComplexF64}
    # Takes 2 vectors and return a 2d matrix 
    r2 = @. abs2(airy_inx) + abs2(airy_iny')
    output = @. im * pi * cis.(r2) * Bessels.besselj1.(sqrt(r2))/sqrt(r2)
    return output
end

function exp_element(x_in::Float64, y_in::Float64, freq_in)::Matrix{ComplexF64}
    out = @. cis.(pi * (x_in .* freq_in) .+ (y_in .* freq_in'))
    return out
end

function trapzoid_2D(data::Matrix{ComplexF64}, dx::Float64)::ComplexF64
    @views I = sum(data[2:end-1,2:end-1]) 
    I += (sum(data) * dx * dx /2) 
    return I
end

run_simulation(1024, 16)

Running this on a 48 core server with 48 threads in the REPL for different grids results in the following performance:
image

I tried to improve the performance on a single core using the general tips for Julia but I feel that I should be able to further reduce calculation time. Are there improvements to be made that I missed?

One minor you can do that will speed things up to use cispi(x) rather than cis(pi*x). That said, almost all of the time right now is spent in besselj1, so the most important optimization would be if you are computing it on the same values multiple times (and removing the redundancy if you are. That said, you can get a free ~2x speedup by using Float32 rather than Float64. To get much more than this, you likely will need to totally change the algorithm to use a more efficient integration method (some form of quadrature) which will give you better accuracy without having to evaluate as many points.

Aside from performance, your code can be substantially cleaned up. Most of your type annotations are way narrower than they need to be, and you use a lot of . that are redundant with the @. that you are using at the beginning of the line. I would probably write it something like this.

function run_simulation(T, freq_size, space_size)
	space_grid = -1:2/T(space_size):(1-2/T(space_size))              #1d matrix for a square grid
	dx = step(space_grid)                                      #stepsize of the grid
	freq_grid = (-1/(2*dx):1/(dx*freq_size):1/(2*dx)- 1/(dx*freq_size)) #1d matrix for a square frequency grid
	dfx = step(freq_grid)                                      #simulated stepsize for the frequency grid
	tf_fft = rand(Float32, freq_size,freq_size) .+ im      # Constant function on 2d frequency grid
	quadratic = cispi.(-rand(T, freq_size, freq_size)) # constant function on frequency grid 
	airy_x = freq_grid .- rand(T)            # constant vector 

	results = Matrix{Complex{T}}(undef, (length(space_grid), length(space_grid)))
	ix = length(space_grid)
	iy = length(space_grid)

	# Loop over all space grid datapoints
	Threads.@threads for j = 1:iy 
		for i = 1:ix
			results[i,j] = integration_process(space_grid[i],space_grid[j],
											  freq_grid, dfx, tf_fft, quadratic, airy_x)

		end
	end
    return #results, x1_array
end

function integration_process(x, y, fx_in::AbstractVector, dfx_in::Number, tf_in::AbstractMatrix, quadratic_in::AbstractMatrix, airy_in::AbstractVector)
    airy_x = @. x-airy_in
    airy_y = @. y-airy_in

    el2int = exp_element(x, y, fx_in) .* bessel_element(airy_x, airy_y) .* tf_in.* quadratic_in
    matrix_element = trapzoid_2D(el2int, dfx_in)
    return matrix_element
end

function bessel_element(airy_inx::AbstractVector, airy_iny::AbstractVector)
    # Takes 2 vectors and return a 2d matrix 
    r2 = @. abs2(airy_inx) + abs2(airy_iny')
    output = @. im * pi * cis(r2) * Bessels.besselj1(sqrt(r2))/sqrt(r2)
    return output
end

function exp_element(x_in::Number, y_in::Number, freq_in::AbstractVector)
    out = @. cispi((x_in * freq_in) + (y_in * freq_in'))
    return out
end

function trapzoid_2D(data::AbstractMatrix, dx)
    @views I = sum(data[2:end-1,2:end-1]) 
    I += (sum(data) * dx * dx /2) 
    return I
end
7 Likes

I agree with @Oscar_Smith. I haven’t studied it in detail, but I think the integrand might be smooth. If it is, you will probably get away with evaluating many fewer points (potentially exponentially so!) by using a p-adaptive quadrature scheme. Try Cubature.jl. With an adaptive scheme you’ll get the benefit of an error estimate too.

If this doesn’t work for you, there are many other integration schemes which are typically better than the trapezoidal rule.

4 Likes

Thank you for the suggestions to rewrite my code @Oscar_Smith, I will look into this for my actual simulation and see how I can improve. Setting the types to Float32 might work as well but I will have to test if this will not reduce precision in the simulations too much.

I will also look into the other integration methods @jondea. If I can reduce the amount of datapoints without sacrificing the quality of the result it will help me lots.

I did notice that if I run the code on a server with 48 physical cores that uses hyperthreading (so 96 are available), I will not actually use half of the (virtual) cores. Instead of 50% roughly 30-40% of the computation capacity will be used. Do you maybe know what could cause this? Maybe some overhead in the computations?

As for further improvement of the computation efficiency I was contemplating on separating the space_grid into blocks instead of doing the calculation pointwise in nested loops. The only reason why I think this could work is because you allocate more memory at the same time and maybe reduce part of the overhead in the computations. Please let me know if I am wrong because I still have a hard time understanding the memory allocation and usage in Julia.

If most of the time is spent in besselj1 then reordering calculation to reduce calls to besselj1 might be crucial. Especially, the iteration goes over x and y and then calculates for each x,y a 2D matrix of besselj1s (centered at x,y - see airy_x = @. x-airy_in).

Instead, going over the possible airy_x,airy_y values, and using the besselj1 result with all the relevant x,y might save a lot of time.

Sorry for the low clarity. Haven’t run the stuff myself (yet). A simple check would be to wrap besselj1 with a function which prints/logs arguments, and see if there is substantial repetition (a smaller scale problem is advisable in this case). Even memoization might be helpful (see https://en.wikipedia.org/wiki/Memoization)

Yes I think you are onto something here. Also what is important and what I failed to mention is that if I simulate for a system it is important that I can rerun the code for different tf_fft. If I can reuse the Bessel function output it should help a lot. The big issue for me is that if I cannot properly overlap the arguments of the Bessel functions I would need to have a 512x512x1024x1024 matrix of ComplexF128 or F64 resulting in a 18 or 9 Terabyte matrix. If I can actually find overlap it should help reduce computational complexity a lot and storing the Bessel output should make rerunning code much faster.

For me the big issue is that the actual frequency grid and the x,y grid are not directly related and they are based on physical parameters. But even so because the frequency grid is equidistant there should be some overlap in shifting all x and y values.

I will have to think about how I can test the argument passing into the Bessel function and see if i can find a pattern that reduces the computational complexity. Thank you for pointing this out!

EDIT:

dx = 3.45e-6
x = -2*dx:dx:2*dx

df = 4.697e-6
f = -3*df:df:3*df

grid = x .- f'
grid

As small example with the step sizes that i would have in my actual simulation. The output matrix looks like
image
There is definitely symmetry that i should try to take advantage off.

Before you worry about code optimization, it seems like you should really re-think your algorithm. Why are you storing the integrand in an array at all, rather than just accumulating sums in loops? (Loops are fast in Julia — insisting on “vectorizing” everything as in Python or Matlab is often a mistake.) Why are you using a trapezoidal rule, which is a terribly inefficient way to do integrals? Why do you need equispaced points?

2 Likes

I agree that the algorithm needs to be redone and that is one of the reasons why I posted the question. I am not sure I understand all your comments but here is what I would answer:

  1. I am not storing the integrand, only part of the integrand which is constant. E.g., g(\xi', \eta') from the equation that i posted. However if I use @profview I see that most of the time(roughly 1/3) is spent on evaluating the Bessel function in the nested loops. If I calculate the function on the same gridpoints as @Dan suggested I might get something out of reducing the complexity.
  2. I do not actively try to vectorise everything but I probably do it too much as the last few years I have only worked with python or matlab :wink: .
  3. Honestly the trapzoidal rule is used because it is easy for me to implement. I do not have a lot of experience with numerical integration of datasets. Most information I find about integration is on integrating functions, but in my case some functions are not expressed analytically. So: lacking knowledge.
  4. The grids that I use are based on physical properties of sensors. I am not sure how much I can alter the grids in the code and still get the same x, y grid in the end.

With the input from all the responses I hope to gain new insights on how to improve the algorithm, in regards to computational complexity and in avoiding pitfalls of Julia.

There are two grids here.

  1. One is the grid of \xi' and \eta' points that you use for computing the integral — in principle, integration of smooth functions can be done much more efficiently (fewer points for the same accuracy) using non-equispaced points. e.g. you can use h-adaptive quadrature schemes (like hcubature from HCubature.jl or Cubature.jl) or high-order Gaussian or Clenshaw–Curtis quadrature (like pcubature from Cubature.jl)

  2. Another is the set of output points x,y. These can be on a regular grid, independent of where you evaluate \xi' and \eta' in the integration. However, if f(x,y) is a smooth function, you might be better off evaluating x,y on a Chebyshev grid, forming a Chebyshev interpolating polynomial (see e.g. the FastChebInterp.jl package), and then evaluating your interpolating polynomial on whatever x,y points you need. The reason is that you may need many fewer integrals for the same accuracy (Chebyshev interpolation converges exponentially fast).

What you need to look for is information on “numerical integration” (sometimes called “quadrature” or, in higher dimensions, “cubature”) — this is integration that only uses numerical values of the integrand to approximate the integral to any desired accuracy. There are plenty of sources of information on this — whole books, in fact. And there are algorithms that are exponentially better than the trapezoidal rule for smooth functions that you can evaluate at arbitrary points.

It’s really worth learning some of the basics here if you are running into a huge computational problem like this — you will probably gain much more from a better algorithm than from fine tuning the code. And you only need a high-level understanding, since the details of the algorithms are mostly already available in Julia packages.

4 Likes

Thank you for the elaborate response!

One important thing you mention is that this works for functions that you can evaluate at arbitrary points. For me this is not possible for g(\xi', \eta') as it is not a function for which I have an expression. In my case it is a Fourier transformation of a function on an (x',y') grid. For some functions an analytic expression of g(\xi', \eta') could be known but unfortunately the goal is to be able to simulate for an arbitrary field on the (x',y') grid.

Based on your answer and the documentation of Cubature.jl I have doubts if the cubature methods work. Is my understanding wrong and can the integration still be done the way you propose?

You can always make an interpolation for g that allows it to be evaluated at in-between points.

Is your original function only on an (x',y') grid (e.g. from a finite-difference calculation)? If so, you are probably stuck. (You could interpolate as @Oscar_Smith suggests, but you generally can’t get high-order accuracy interpolating from a regular grid … unless it is periodic.) You could use Romberg integration on an equispaced grid, however.

Comparing @Oscar_Smith’s and the original poster’s version of exp_element there seems to be a difference. In the original version, only (x_in .* freq_in) (but not (y_in .* freq_in')) is multiplied by pi. I’m guessing this is a bug in the original version, since it seems like these two terms should be treated in a symmetrical fashion.

1 Like

Another possible approach would be to customize the implementation of Bessels.besselj1 for achieving just as much accuracy as is necessary for you. I only state this as a possibility, because there well may be better ways to improve the performance, however, if you don’t manage to get improvements in other ways, tag me and I’ll try to help. Bessels.besselj1 is implemented using polynomial approximations, so the idea is to create custom polynomial approximations that would only have as great degrees as you need them to.

2 Likes

This is correct I made an error in writing the example version of the code.

This sounds very interesting but I want to try the idea proposed by @Dan first. Because my input in the bessel function is always the square root r = \sqrt{(x-\xi')^2 + (y-\eta')^2}, I think it should be possible to significantly reduce the calls to the besselj1 function. Especially because the x-y and \xi’-\eta’ grid are equidistant grids of identical sizes in the respective dimensions.

You might try a few trial runs that specify to use more than 96 cores. I’ve read blog posts were people have gotten speedups by specifying more threads than they have cores (physical and hyper threaded). Give 120, 144, 192 threads a try.

As @stevengj said, equispaced quadrature is really not optimal for definite integration, but it is actually pretty good when the integrand is periodic, or when it decays quickly (eg you’re approximating an integral over the whole space). From your variables names it sounds like you might be in one of these cases; if that’s the case, you might not gain much from switching to a fancier grid.

If computing the bessels is the dominant cost, then yes precomputing might be good. Try also to play with different implementations (eg GNU GSL is wrapped in julia; you could also try performance - What is the fastest opensource implementation of Bessel functions computation? - Computational Science Stack Exchange)

Otherwise, maybe there are tricks you can play. If I understand correctly A and h are analytic, but g is data? What are the analytic expressions of A and h? Some questions I would ask: Is something like stationary phase relevant? Multipole expansion? It would be nice if h didn’t depend on x and y because then you could do FFTs; maybe h varies relatively slowly with x and y, so you can assume it’s constant over regions? Maybe you can write it in separable form as f(x)g(xi), or a sum of terms like this? Maybe if g is analytic of a certain form you can perform the integral exactly? If so, can you expand g over functions of that form? Etc, etc. Generally speaking, any structure you can notice in the problem, or simplifications in certain cases, or physical approximations, etc can possibly be used to speed up the numerical method.

This reminds me:

There are actually multiple possible approaches for polynomial approximation, for example, instead of approximating Bessels.besselj1(x) with lower degree polynomials, it might make sense to consider the entire expression above as one univariate function and then approximate this function directly. NB, even though the denominator can be close to zero, the function’s absolute value is bounded by 0.5 for all positive values.

It might even make sense to approximate π * Bessels.besselj1(sqrt(r2))/sqrt(r2) directly.

Another note: the function’s (either one’s) domain would certainly need to be split into multiple intervals for approximation, each with its own polynomial. If you’re going to be evaluating at many values of r2, it might make sense to batch by interval, so that values with r2 from the same interval are evaluated together.