Performance optimisation: Julia vs C++

Did you do something else? (compilation options, etc?) I cannot reproduce that.

Ok, with everything constant (L, Nx, Ny), I also get the 6x speedup.

Yeah sorry I made a stupid mistake (changed the number of iteration when converting it to a constant x_x). I edited my message. I still get a significant x6 speed up by declaring everything to be constant.

1 Like

Inspired by your observation, @Kolaru , here is a version that runs in 404ms (vs. 2s from the original one) without having to make those variables constant. Did that (hopefully correctly) by putting the sizes in the type signature of the spin configuration:

(running with --check-bounds=no makes it run in 313ms).

using Random
using BenchmarkTools

const rng = MersenneTwister()

struct SpinConf{Nx,Ny,T}
  m::Matrix{T}
end
Base.getindex(M::SpinConf,i,j) = @inbounds M.m[i,j]
Base.setindex!(M::SpinConf,x,i,j) = @inbounds (M.m[i,j] = x) 
Base.sum(M::SpinConf) = sum(M.m)

function get(spin_conf::SpinConf{Nx,Ny,T}, x, y) where {Nx,Ny,T}
    return @inbounds spin_conf[(x-1+Nx)%Nx + 1, (y-1+Ny)%Ny + 1]
end

function deltaE(spin_conf, i, j)
    return 2.0*get(spin_conf,i,j) * (get(spin_conf,i-1,j) + get(spin_conf,i+1,j) +
                                     get(spin_conf,i,j+1) + get(spin_conf,i,j-1))
end

function energy(spin_conf::SpinConf{Nx,Ny,T}) where {Nx,Ny,T}
    res = 0
    for i = 1:Nx
        @simd for j = 1:Ny
            res += -get(spin_conf,i,j)*(get(spin_conf,i+1,j) + get(spin_conf,i,j+1))
        end
    end
    return res
end

magnetism(spin_conf) = sum(spin_conf)

function sweep!(rec, spin_conf::SpinConf{Nx,Ny,T}, beta, M) where {Nx,Ny,T}
    @inbounds @simd for i = 1:M
        x = rand(rng, 1:Nx)
        y = rand(rng, 1:Ny)
        r = rand(rng)
        dE = deltaE(spin_conf, x, y)
        if exp(-beta*dE) > r
            spin_conf[x,y] *= -1
        end
        rec[i] = energy(spin_conf)
    end
    return rec
end

function main()
    beta = 0.2
    L = 64
    total_iteration = 100_000

    spin_conf = SpinConf{L,L,Int64}(rand(rng, (-1,1), L, L))
    rec = Vector{Int64}(undef,total_iteration)

    sweep!(rec, spin_conf, beta, total_iteration)

    s = sum(last(rec, 1000))/1000
    s /= L^2
    return nothing
end

@btime main()

obs: with @inbounds in these lines, no need to run without bounds check (~310ms):

Base.getindex(M::SpinConf,i,j) = @inbounds M.m[i,j]
Base.setindex!(M::SpinConf,x,i,j) = @inbounds (M.m[i,j] = x) 
7 Likes

Interesting that the type parameters for Nx Ny help. As you say there are still some boundschecks, Base.@propagate_inbounds Base.getindex(M::SpinConf,i,j) = M.m[i,j] helps, then 285.629 ms on my computer. Down to 243.265 ms if you switch the loop order in energy.

4 Likes

Some timing updates on my machine (i.e. comparable to the OP)

--check-bounds=no + making L/Nx/Ny const (or using a SpinConf struct)

  295.933 ms (5 allocations: 821.34 KiB)

--check-bounds=no + making L/Nx/Ny const (or using a SpinConf struct) + @unroll

  217.009 ms (5 allocations: 821.34 KiB)

--check-bounds=no + making L/Nx/Ny const (or using a SpinConf struct) + breaking up the square (@mcabbott )

  158.804 ms (5 allocations: 821.34 KiB)

--check-bounds=no + making L/Nx/Ny const (or using a SpinConf struct) + breaking up the square (@mcabbott ) + @unroll

  154.410 ms (5 allocations: 821.34 KiB)
8 Likes

All that shows how hard is to perform “fair” benchmarking. In this case the C++ code has a constraint on the parameters that the Julia code didn’t and the made a huge difference. The cool thing is that the Julia code now can be run efficiently for different sizes of the array, as the functions will get specialized versions for each size, even if run from the same “main”.

My first attempt was to use a static matrix, but compilation times became way too high.

4 Likes

Change for i = 1:Nx @simd for j = 1:Ny order from i, j into j, i ???

Also, you can try making spin_conf an array of static arrays, so the loop should unroll.

UPD: Also, since Nx constant is power of two, you can change
from (x-1+Nx)%Nx + 1
into (x-1+Nx)&(Nx-1) + 1
where (Nx-1) is also a constant.

I’m not sure whether this is relevant to your apples-to-apples C++ vs julia comparison, but…

You don’t need to recompute the energy at every step. You already compute the change in energy for each ising flip, so just keep a running tally:

julia> function sweep!(rec, spin_conf, beta, M)
           (Nx, Ny) = size(spin_conf)
           E = Float64(energy(spin_conf))
           @inbounds @simd for i = 1:M
               x = rand(rng, 1:Nx)
               y = rand(rng, 1:Ny)
               r = rand(rng)
               dE = deltaE(spin_conf, x, y)
               if exp(-beta*dE) > r
                   spin_conf[x,y] *= -1
                   E += dE
               end
               rec[i] = E
           end
           energy(spin_conf)!= E && error("mismatch -- expected energy $(E), found $(energy(spin_conf))")
           return rec
       end
julia> @btime main()
  3.458 ms (5 allocations: 821.34 KiB)

Typically you should first think about what you compute, and how you do so. Then you should lay out your data (and the Ising config cries for a BitMatrix instead of an integer matrix!). Then at last you should care about @inbounds and loop vectorization.

(also, the @simd in front of the above loop is magical thinking – how is the loop possibly supposed to vectorize? @simd only makes sense if subsequent iterations are independent, it is not magical performance pixy dust)

6 Likes

It’s difficult to parallelize writes to a BitMatrix, even if the problem is embarrassingly parallel across indices, since many indices share the same memory location, and I guess there is often no atomic operation to set/clear a bit.

Relatedly, there may be situations where the BitArray is slower, again because of the bit-level operations that need to be performed. That’s all to say, it isn’t so obvious to me that going with a BitArray is beneficial.

2 Likes

The good part of BitMatrix for this kind of ising model is not the denser layout. The point is: Given two +/- 1 vectors u, v of length 64, you can compute dot(u, v) = 64 - 2 * count_ones(xor(u_chunk, v_chunk)), where u_chunk::UInt64 is the bit-representation of the vector (true for +1, false for -1). In other words, you get 64-wide SIMD for free, without even using fancy vector instructions of your CPU.

Out of this, you can build a really fast energy implementation.

Depending on the problem, you may need to pad your matrix, possibly store both the matrix and its transpose, etc.

In other words, you don’t use BitMatrix as a black-box replacement for Matrix; instead you directly operate on the bitmatrix.chunks, and use the “BitArray” interface only for convenient low-effort slow operations that are either hard to write using the raw memory representation, or are outside of the hot path.

BitArray is a toolkit, not a black-box drop-in replacement for Array{Bool}.

3 Likes

Thanks for your comments! Indeed, I’m aware of faster 2D Ising model implementation strategies. AFAIK, eventually, the high performance variants are mostly limited by the random number generation. The fact that this is not the case for the implementation here speaks for itself :slight_smile: But, for this challenge / discussion, moving to BitArrays or changing the algorithm (summing up the dEs) would have to be done on the C++ side as well, which would give us a new timing to chase.

(Just to be clear, in a “real application”, I do agree that algorithmic improvements should come first, of course!)

4 Likes

Gotcha. Can the updates also be performed in terms of the underlying representation?

dE = deltaE(spin_conf, x, y)
if exp(-beta*dE) > r
    spin_conf[x,y] *= -1
    E += dE
end

I know we’re off topic, but I sure would like to see more of this 64-wide SIMD bit stuff. I played around with some different layouts here:

includes e.g. transposing an 8-by-8 array:

This talk has many more examples:

1 Like