Optimizing semiclassical simulation for a lattice system

Joh11 · March 26, 2021, 1:11pm

Hi everyone,

I am currently working on implementing an implementation of semiclassical Monte Carlo for a lattice system (https://github.com/Joh11/semiclassical-monte-carlo). I would like to make my code faster.

The bottleneck is in my time evolution function. I am using an explicit 8th order Runge-Kutta scheme (Dormand-Prince), solving the following ODE: \vec{\dot S}_\alpha(\vec R) = - \vec S_\alpha(\vec R)\times \sum_{\beta \vec R'} J_{\alpha\beta}(\vec R' - \vec R) \vec S_\beta(\vec R') .

At each site in the lattice, there is a spin, stored as a unit 3D vector. In the previous expression, “x” is the cross product of 3D vectors.

I applied common performance tips (from the Julia manual), including checking the ambiguous types with @code_warntype, avoiding cache misses with properly ordered nested loops and so one.

I currently achieve 170 ms per RK step on a system of 3 * 144^2 = 62K spins. I would like this number to go down as much as possible (it should apparently be possible to do it in a few ms).

I would like to focus on optimizing the way I compute the time derivative (see https://github.com/Joh11/semiclassical-monte-carlo/blob/7660384bb2ff41ffa503484db80d3e2edad21610/scmc.jl#L62 ).

Do you have any advice on how to do it ? I wouldn’t mind totally reworking my code if it makes it faster.

lmiq · March 26, 2021, 1:22pm

I would suggest providing a MWE. If possible provide a MWE of that function working. You then will probably get lots of advices here.

For example, one line like this:

uold = v[:, s, i, j]

is allocating a new array inside a loop and probably one would advice to use a view or something else, but I have no idea if that is relevant for the overall performance of the code.

(edit: I see that you have pointed to the that specific function there now, still a MWE would be good)

Joh11 · March 26, 2021, 1:36pm

Thanks !

Indeed there are many places in my code where I do not use views yet. I will change it soon. However, this is in the Monte Carlo part of my program, which is fast enough for now. The slow part is the semiclassical dynamics, which (I think) I made sure I use @views whenever needed.

For me the bottleneck is in my function makef. My function to do the RK8 step is dormandprince(f, v, dt), and accepts as first parameter a function f, taking a state vector, and returning its time derivative.
To construct this function for a given hamiltonian, I use makef:

"Build the function representing the time derivative of v"
function makef(H)
    # give it a name for the profiler
    function f(v)
        Ns = size(v)[2]
        L = size(v)[3]
        
        ret = zeros(3, Ns, L, L)
        for j in 1:L
            for i in 1:L
                for s in 1:Ns
                    @views ret[:, s, i, j] = localfield(H, v, i, j, s) × v[:, s, i, j]
                end
            end
        end
        ret
    end
end

MWE (see also https://github.com/Joh11/semiclassical-monte-carlo/blob/master/complexity_rk8.jl for the benchmark I used):

include("scmc.jl")
const H = loadhamiltonian("hamiltonians/kagome.dat", [1])
const L = 144
const f = makef(H)

v = randomstate(H.Ns, L)
@time dormandprince(f, v, 0.1) # advance 0.1s in time the state vector v

lmiq · March 26, 2021, 1:40pm

I won’t have time to test things now. But for that specific function, I see two possible improvements: if ret is a small array (not sure about the sizes of Ns and L, then you can try using a static array. If ret is a big array, is is very likely that preallocating it will provide a good speedup.

If that function is called many times, you should try to make it not allocate anything.

Looking more closely, it is likely that you want ret to be an array of static arrays of dimension 3. And v and H probably as well. That way the cross product will be non-allocating:

julia> function xp(x,y,z)
         for i in eachindex(x)
           z[i] = x[i] × y[i]
         end
       end
xp (generic function with 1 method)

julia> z = [ zeros(SVector{3,Float64}) for i in 1:10 ];

julia> x = [ rand(SVector{3,Float64}) for i in 1:10 ];

julia> y = [ rand(SVector{3,Float64}) for i in 1:10 ];

julia> @btime xp(x,y,z)
  26.526 ns (0 allocations: 0 bytes)

The analogous of what you are doing is:

julia> function xp(x,y)
         z = similar(x)
         for i in eachindex(x)
           z[i] = x[i] × y[i]
         end
         z
       end
xp (generic function with 1 method)

julia> x = [ rand(3) for i in 1:10 ];

julia> y = [ rand(3) for i in 1:10 ];

julia> @btime xp(x,y);
  296.653 ns (11 allocations: 1.25 KiB)

As you see, the difference is huge.

ChrisRackauckas · March 26, 2021, 1:44pm

Note that the Dormand-Prince 8-7 is very suboptimal, and the 8-5-3 is still pretty suboptimal in comparison to some of the newer Verner schemes. In this range I would usually recommend Vern7 or Vern9 given the benchmarks.

Joh11 · March 26, 2021, 2:34pm

Thanks ! That’s good to know.

I’ll probably change to another scheme later. I’ve chosen Dormand Prince since my goal is to reproduce results from a paper (https://arxiv.org/pdf/1403.7903.pdf) which uses DP.

Joh11 · March 26, 2021, 4:26pm

Thanks !

In the end, Ns=3 and L=144.

I first tried to skip the allocation in the function f (and in dormandprince) though it did not have a significant speedup somehow. I’ve now rework the code to use static vectors, and it works !

For L=144, it went from 150 ms to 25 ms per RK8 step.
The code can be found in the other branch: https://github.com/Joh11/semiclassical-monte-carlo/blob/45d1f1ec1027658f0a65d352214f938a88ae0a7b/scmc.jl#L70

Would you have any other tips to go even faster ? In particular, I feel like the way I compute the localfield (https://github.com/Joh11/semiclassical-monte-carlo/blob/45d1f1ec1027658f0a65d352214f938a88ae0a7b/HamiltonianMod.jl#L95) is really inefficient, but I am not sure how to make it faster.

lmiq · March 26, 2021, 4:55pm

I am not sure if I can suggest this because the new code with the branch became confusing to me. But I think you can define v (from randomstate) as a vector of static vectors as well, and that will probably accelerate that function.

Or, at least, use broadcasting there, because otherwise you might be allocating a vector on the right of

               ret += c * v[s2, wrapindex(i + Δi, L), wrapindex(j + Δj, L)]

(I suppose v[i,j,k] here is returning a vector). If c is a scalar and both v[i,j,k] and ret are static arrays, this will be fast and non-allocating. If v[i,j,k] returns a standard array you might be allocating c*v at each iteration. You can at least avoid that by making ret a mutable array (MArray or simply a standard array of dimension 3) and broadcasting (ret .+= c .* v[i,j,k]).

After being sure that that loop is not allocating anything (suggestion) you may try to use inbounds, simd, to see if you can get any speedup there.

Topic		Replies	Views
Integrator AutoVern9(Rodas5) take a lot of time DifferentialEquations.jl Modelling & Simulations	8	507	August 9, 2022
DiffEq and SVector Modelling & Simulations	16	684	December 10, 2020
Request guidance in optimising the performance of my implementation (running time and allocation) General Usage question	9	982	April 30, 2018
Reduce allocations (increase the running speed) for large scale ODEs (136 equations) General Usage question	3	370	October 25, 2022
Reducing the memory allocation in ODEs Performance question	11	1334	January 22, 2022

Optimizing semiclassical simulation for a lattice system

Related topics