I have been working on a simulation in python where I try to solve the following integral:

With multiprocessing on a server I could get the simulation to run in roughly 30 minutes for a 512x512x1024x1024 grid. Recently I was made aware by colleagues that Julia could improve the computation time further. I have tried to implement the code in Julia and optimize it as much as possible. As shown below I can only half the computation time by using Julia compared to Python.

The calculation could be vectorised over a matrix of the size NxNxMxM, but in my case I want to use N=512 and M=1024. Using a matrix with ComplexF64 would require me to have ~18TeraBytes of ram. Instead, my current method consists of doing a point-wise calculation of f(x,y) by iterating over the x and y coordinates with multi-threading. This method means that each iteration requires low amounts of RAM. A MWE of the code is as follows:

```
using Bessels, Profile, BenchmarkTools, LinearAlgebra
function run_simulation(freq_size, space_size)
t = @elapsed begin
space_grid = -1:2/space_size:(1-2/space_size) #1d matrix for a square grid
dx = step(space_grid) #stepsize of the grid
freq_grid = -1/2/dx:1/dx/freq_size:1/2/dx- 1/(dx*freq_size)#1d matrix for a square frequency grid
dfx = step(freq_grid) #simulated stepsize for the frequency grid
tf_fft = rand(freq_size,freq_size) .+ im # Constant function on 2d frequency grid
quadratic = @. cis.(-pi * rand(freq_size, freq_size)) # constant function on frequency grid
airy_x = freq_grid .- rand(1) # constant vector
results = Matrix{ComplexF64}(undef, (length(space_grid), length(space_grid)))
ix = length(space_grid)
iy = length(space_grid)
# Loop over all space grid datapoints
Threads.@threads for j = 1:iy
for i = 1:ix
@inbounds results[i,j] = integration_process([space_grid[i],space_grid[j]],
freq_grid, dfx, tf_fft, quadratic, airy_x)
end
end
end
println("time elapsed is $t seconds for 4d set of $space_size x $space_size x $freq_size x $freq_size ")
return #results, x1_array
end
function integration_process(xy_in::Vector{Float64}, fx_in::StepRangeLen{Float64, Base.TwicePrecision{Float64}, Base.TwicePrecision{Float64}, Int64}, dfx_in::Float64, tf_in::Matrix{ComplexF64}, quadratic_in::Matrix{ComplexF64}, airy_in::Vector{Float64})::ComplexF64
x, y = xy_in
airy_x = @. x-airy_in
airy_y = @. y-airy_in
el2int = exp_element(x, y, fx_in) .* bessel_element(airy_x, airy_y) .* tf_in.* quadratic_in
matrix_element = trapzoid_2D(el2int, dfx_in)
return matrix_element
end
function bessel_element(airy_inx::Vector{Float64}, airy_iny::Vector{Float64})::Matrix{ComplexF64}
# Takes 2 vectors and return a 2d matrix
r2 = @. abs2(airy_inx) + abs2(airy_iny')
output = @. im * pi * cis.(r2) * Bessels.besselj1.(sqrt(r2))/sqrt(r2)
return output
end
function exp_element(x_in::Float64, y_in::Float64, freq_in)::Matrix{ComplexF64}
out = @. cis.(pi * (x_in .* freq_in) .+ (y_in .* freq_in'))
return out
end
function trapzoid_2D(data::Matrix{ComplexF64}, dx::Float64)::ComplexF64
@views I = sum(data[2:end-1,2:end-1])
I += (sum(data) * dx * dx /2)
return I
end
run_simulation(1024, 16)
```

Running this on a 48 core server with 48 threads in the REPL for different grids results in the following performance:

I tried to improve the performance on a single core using the general tips for Julia but I feel that I should be able to further reduce calculation time. Are there improvements to be made that I missed?