The bottleneck is in my time evolution function. I am using an explicit 8th order Runge-Kutta scheme (Dormand-Prince), solving the following ODE: \vec{\dot S}_\alpha(\vec R) = - \vec S_\alpha(\vec R)\times \sum_{\beta \vec R'} J_{\alpha\beta}(\vec R' - \vec R) \vec S_\beta(\vec R') .

At each site in the lattice, there is a spin, stored as a unit 3D vector. In the previous expression, “x” is the cross product of 3D vectors.

I applied common performance tips (from the Julia manual), including checking the ambiguous types with @code_warntype, avoiding cache misses with properly ordered nested loops and so one.

I currently achieve 170 ms per RK step on a system of 3 * 144^2 = 62K spins. I would like this number to go down as much as possible (it should apparently be possible to do it in a few ms).

I would suggest providing a MWE. If possible provide a MWE of that function working. You then will probably get lots of advices here.

For example, one line like this:

uold = v[:, s, i, j]

is allocating a new array inside a loop and probably one would advice to use a view or something else, but I have no idea if that is relevant for the overall performance of the code.

(edit: I see that you have pointed to the that specific function there now, still a MWE would be good)

Indeed there are many places in my code where I do not use views yet. I will change it soon. However, this is in the Monte Carlo part of my program, which is fast enough for now. The slow part is the semiclassical dynamics, which (I think) I made sure I use @views whenever needed.

For me the bottleneck is in my function makef. My function to do the RK8 step is dormandprince(f, v, dt), and accepts as first parameter a function f, taking a state vector, and returning its time derivative.
To construct this function for a given hamiltonian, I use makef:

"Build the function representing the time derivative of v"
function makef(H)
# give it a name for the profiler
function f(v)
Ns = size(v)[2]
L = size(v)[3]
ret = zeros(3, Ns, L, L)
for j in 1:L
for i in 1:L
for s in 1:Ns
@views ret[:, s, i, j] = localfield(H, v, i, j, s) × v[:, s, i, j]
end
end
end
ret
end
end

include("scmc.jl")
const H = loadhamiltonian("hamiltonians/kagome.dat", [1])
const L = 144
const f = makef(H)
v = randomstate(H.Ns, L)
@time dormandprince(f, v, 0.1) # advance 0.1s in time the state vector v

I won’t have time to test things now. But for that specific function, I see two possible improvements: if ret is a small array (not sure about the sizes of Ns and L, then you can try using a static array. If ret is a big array, is is very likely that preallocating it will provide a good speedup.

If that function is called many times, you should try to make it not allocate anything.

Looking more closely, it is likely that you want ret to be an array of static arrays of dimension 3. And v and H probably as well. That way the cross product will be non-allocating:

julia> function xp(x,y,z)
for i in eachindex(x)
z[i] = x[i] × y[i]
end
end
xp (generic function with 1 method)
julia> z = [ zeros(SVector{3,Float64}) for i in 1:10 ];
julia> x = [ rand(SVector{3,Float64}) for i in 1:10 ];
julia> y = [ rand(SVector{3,Float64}) for i in 1:10 ];
julia> @btime xp(x,y,z)
26.526 ns (0 allocations: 0 bytes)

The analogous of what you are doing is:

julia> function xp(x,y)
z = similar(x)
for i in eachindex(x)
z[i] = x[i] × y[i]
end
z
end
xp (generic function with 1 method)
julia> x = [ rand(3) for i in 1:10 ];
julia> y = [ rand(3) for i in 1:10 ];
julia> @btime xp(x,y);
296.653 ns (11 allocations: 1.25 KiB)

Note that the Dormand-Prince 8-7 is very suboptimal, and the 8-5-3 is still pretty suboptimal in comparison to some of the newer Verner schemes. In this range I would usually recommend Vern7 or Vern9 given the benchmarks.

I’ll probably change to another scheme later. I’ve chosen Dormand Prince since my goal is to reproduce results from a paper (https://arxiv.org/pdf/1403.7903.pdf) which uses DP.

I first tried to skip the allocation in the function f (and in dormandprince) though it did not have a significant speedup somehow. I’ve now rework the code to use static vectors, and it works !

I am not sure if I can suggest this because the new code with the branch became confusing to me. But I think you can define v (from randomstate) as a vector of static vectors as well, and that will probably accelerate that function.

Or, at least, use broadcasting there, because otherwise you might be allocating a vector on the right of

ret += c * v[s2, wrapindex(i + Δi, L), wrapindex(j + Δj, L)]

(I suppose v[i,j,k] here is returning a vector). If c is a scalar and both v[i,j,k] and ret are static arrays, this will be fast and non-allocating. If v[i,j,k] returns a standard array you might be allocating c*v at each iteration. You can at least avoid that by making ret a mutable array (MArray or simply a standard array of dimension 3) and broadcasting (ret .+= c .* v[i,j,k]).

After being sure that that loop is not allocating anything (suggestion) you may try to use inbounds, simd, to see if you can get any speedup there.