Performance optimisation: Julia vs C++

Hi all!

This is for the performance tweaking addicts out there :smiley:

I had a conversation with a “C++ friend” today in which upon request I tried to showcase to him that Julia can be as performant as C++ (say, within 30% performance difference). I spare you the entire back-and-forth but, eventually, we ended up with the following Julia code:

using Random
using BenchmarkTools

const rng = MersenneTwister()

function get(spin_conf, x, y)
    (Nx, Ny) = size(spin_conf)
    return @inbounds spin_conf[(x-1+Nx)%Nx + 1, (y-1+Ny)%Ny + 1]
end

function deltaE(spin_conf, i, j)
    return 2.0*get(spin_conf,i,j) * (get(spin_conf,i-1,j) + get(spin_conf,i+1,j) +
                                     get(spin_conf,i,j+1) + get(spin_conf,i,j-1))
end

function energy(spin_conf)
    (Nx, Ny) = size(spin_conf)
    res = 0
    for i = 1:Nx
        @simd for j = 1:Ny
            res += -get(spin_conf,i,j)*(get(spin_conf,i+1,j) + get(spin_conf,i,j+1))
        end
    end
    return res
end

magnetism(spin_conf) = sum(spin_conf)

function sweep!(rec, spin_conf, beta, M)
    (Nx, Ny) = size(spin_conf)
    @inbounds @simd for i = 1:M
        x = rand(rng, 1:Nx)
        y = rand(rng, 1:Ny)
        r = rand(rng)
        dE = deltaE(spin_conf, x, y)
        if exp(-beta*dE) > r
            spin_conf[x,y] *= -1
        end
        rec[i] = energy(spin_conf)
    end
    return rec
end

function main()
    beta = 0.2
    L = 64
    total_iteration = 100_000

    spin_conf = rand(rng, (-1,1), L, L)
    rec = Vector{Int64}(undef,total_iteration)

    sweep!(rec, spin_conf, beta, total_iteration)

    s = sum(last(rec, 1000))/1000
    s /= L^2
    # println(s)
    return nothing
end

Timings on my machine:

julia> @btime main();
  1.836 s (5 allocations: 821.34 KiB)

Unfortunately, on my machine, this is still 5.56x slower than the C++ code which takes 330 ms (see all the C++ details below.)


Anticipating that the C++ version unrolls some loops I tried to optimise the energy function by using
Explicit loop-unrolling. (Note that, unfortunately, LoopVectorization.jl’s @avx doesn’t seem to work here.)

# Copied from https://github.com/StephenVavasis/Unroll.jl
include("Unroll.jl")
using .Unroll

function energy(spin_conf)
    (Nx, Ny) = size(spin_conf)
    res = 0
    for i = 1:Nx
        @unroll for j = 1:64 # THIS LINE HAS CHANGED!
            res += -get(spin_conf,i,j)*(get(spin_conf,i+1,j) + get(spin_conf,i,j+1))
        end
    end
    return res
end

Timings on my machine:

julia> @btime main();
  429.526 ms (5 allocations: 821.34 KiB)

This gets the Julia code in the same ballpark at the cost of using a macro from an old unmaintained package and hard-coding the iterator length 64. However, adding an explicit #pragma GCC unroll 64 statement to the C++ code as well (to be fair) the latter only takes 150 ms, so the Julia code is, again, ~2.9x slower.

So, the mission, if you choose to accept it, is to tweak this Julia code to the C++ speed (30% performance difference is the goal). While restructuring the code is perfectly fine, algorithmic improvements are not allowed. The imaginary Julia and C++ programmers should be similarly smart :smiley:

Of course, I would also just appreciate hints/tipps/tricks/comments of any kind! :smiley: Thanks in advance, guys!

Best,
Carsten

Unroll.jl
module Unroll


copy_and_substitute_tree(e, varname, newtext) = e

copy_and_substitute_tree(e::Symbol, varname, newtext) = 
    e == varname ? newtext : e

function copy_and_substitute_tree(e::Expr, varname, newtext)
    e2 = Expr(e.head)
    for subexp in e.args
        push!(e2.args, copy_and_substitute_tree(subexp, varname, newtext))
    end
    if e.head == :if
        newe = e2
        try
            u = Core.eval(@__MODULE__, e2.args[1])
            if u == true
                newe = e2.args[2]
            elseif u == false
                if length(e2.args) == 3
                    newe = e2.args[3]
                else
                    newe = :nothing
                end
            end
        catch
        end
        e2 = newe
    end
    e2 
end


macro unroll(expr)
    if expr.head != :for || length(expr.args) != 2 ||
        expr.args[1].head != :(=) || 
        typeof(expr.args[1].args[1]) != Symbol ||
        expr.args[2].head != :block
        error("Expression following unroll macro must be a for-loop as described in the documentation")
    end
    varname = expr.args[1].args[1]
    ret = Expr(:block)
    for k in Core.eval(@__MODULE__, expr.args[1].args[2])
        e2 = copy_and_substitute_tree(expr.args[2], varname, k)
        push!(ret.args, e2)
    end
    esc(ret)
end

macro tuplegen(expr)
    if expr.head != :comprehension || length(expr.args) != 2 ||
        expr.args[2].head != :(=) ||
        typeof(expr.args[2].args[1]) != Symbol
        error("Expression following tuplegen macro must be a comprehension as described in the documentation")
    end
    varname = expr.args[2].args[1]
    ret = Expr(:tuple)
    for k in Core.eval(@__MODULE__, expr.args[2].args[2])
        e2 = copy_and_substitute_tree(expr.args[1], varname, k)
        push!(ret.args,e2)
    end
    esc(ret)
end

export @unroll
export @tuplegen

end

C++ code, compilation & timings

Ising2D.cpp:

#include <Eigen/Dense>
#include <random>
#include <fstream>
#include <iostream>

class Ising2D
{
private:
	int64_t L_;
	Eigen::Array<int64_t,Eigen::Dynamic,Eigen::Dynamic> confs_;

public:
	template<class RandomEngine>
	Ising2D(int64_t L, RandomEngine& re)
		: L_{L}, confs_(L, L)
	{
		std::uniform_int_distribution<int64_t> uid(0, 1);

		for(int64_t i = 0; i < L; i++)
		{
			for(int64_t j = 0; j < L; j++)
			{
				confs_(i,j) = 2*uid(re)-1;
			}
		}
	}

	__attribute__((always_inline)) inline int64_t get(int64_t i, int64_t j) const
	{
		return confs_((i+L_)%L_, (j+L_)%L_);
	}

	int64_t H() const
	{
		int64_t res = 0;
		for(int64_t i = 0; i < L_; i++)
		{
			for(int64_t j = 0; j < L_; j++)
			{
				res += -get(i,j)*(get(i+1,j) + get(i,j+1));
			}
		}
		return res;
	}
	
	//energy diff when flipping (i,j)
	inline double dH(int64_t i, int64_t j) const 
	{
		return 2.0*get(i, j)*(get(i+1,j) + get(i-1,j) + get(i,j+1) + get(i, j-1));
	}
	
	template<class RandomEngine>
	Eigen::ArrayXi sweep(RandomEngine& re, const double beta, const int64_t steps)
	{
		Eigen::ArrayXi res(steps);
		std::uniform_int_distribution<int64_t> uid(0, L_-1);
		std::uniform_real_distribution<double> urd(0., 1.);

		for(int64_t k = 0; k < steps; k++)
		{
			int64_t x = uid(re);
			int64_t y = uid(re);

			double p = std::min(std::exp(-beta*dH(x, y)),1.);
			if(p > urd(re))
			{
				confs_(x,y) *= -1;
			}
			res(k) = H();
		}
		return res;
	}

	double magnetization() const
	{
		return double(confs_.sum())/(L_*L_);
	}

	void swapConf(Ising2D& other)
	{
		confs_.swap(other.confs_);
	}

};

int main()
{
	const double beta = 0.2;
	const int64_t L = 64;
	const int64_t total_iteration = 100'000;

	std::uniform_real_distribution<double> urd(0., 1.);

	std::random_device rd;
	std::default_random_engine re{rd()};

	Ising2D ising(L, re);
	auto rec = ising.sweep(re, beta, total_iteration);

	printf("%f\n", (double)rec.tail(1000).sum()/1000.0/(L*L));
	return 0;
}

Dependency: Eigen (simply git clone https://gitlab.com/libeigen/eigen.git and link with -I)

Naive compilation

g++ -O3 -march=native -I ~/.local/lib/eigen/ Ising2D.cpp -o Ising2D

Timings on my machine: 8.16s user 0.01s system 96% cpu 8.465 total

-funroll-loops

g++ -O3 -march=native -funroll-loops -I ~/.local/lib/eigen/ Ising2D.cpp -o Ising2D

Timings on my machine: 5.61s user 0.01s system 96% cpu 5.801 total

-fwhole-program

g++ -O3 -march=native -fwhole-program -I ~/.local/lib/eigen/ Ising2D.cpp -o Ising2D

Timings on my machine: 0.42s user 0.00s system 61% cpu 0.687 total

-fwhole-program + -funroll-loops

g++ -O3 -march=native -funroll-loops -fwhole-program -I ~/.local/lib/eigen/ Ising2D.cpp -o Ising2D

Timings on my machine: 0.33s user 0.00s system 99% cpu 0.332 total

#pragma GCC unroll 64

Adding explicit loop unrolling by putting #pragma GCC unroll 64 in front of the j loop in H() (which is energy in the Julia code):

g++ -O3 -march=native -funroll-loops -fwhole-program -I ~/.local/lib/eigen/ Ising2D_pragma.cpp

Timings on my machine: 0.15s user 0.00s system 98% cpu 0.158 total

7 Likes

const rng = MersenneTwister()

One thing to consider is changing your Random generator. https://sunoru.github.io/RandomNumbers.jl/stable/man/benchmark/#Speed-Test-1 has some benchmarks of alternatives.

3 Likes

Thanks for the hint! I had the same idea (but didn’t fully try it). I tried VectorizedRNG.jl but this doesn’t allow sampling from (-1,1) for example AFAICS. Any particular RNG from RandomNumbers.jl in mind?

    @inbounds @simd for i = 1:M
        x = rand(rng, 1:Nx)
        y = rand(rng, 1:Ny)

You’re allocating 100k times 2 64-element vectors, you can allocate them once before the loop and use rand! to fill them with random elements

1 Like

No, I just draw two random numbers from the ranges. There are only 5 allocations for main() as you can see from the @btime output.

Uh, sorry, you’re right!

1 Like

Xoroshiro128Star is the fastest, but I’d be very surprised if that sped it up more than 15%

Deleting the %Nx in indexing speeds things up by a factor of 2 for me. I thought that what was often done is to pad the array slightly, so that you can always blindly read, and then when writing near the edge be sure to update both. (But you read at least N^2 times per write.)

Commenting out the rand calls didn’t change times much for me.

1 Like

I think you need to change the order of loops somehow because julia differs from C++ in the contiguous memory locations of arrays. In some spots, get seems unnecessary because the indexes are within array sizes.

1 Like

Wouldn’t changing the RNG be an algorithmic improvement? What RNG is the C++ code using?

1 Like

Profiling shows the RNG is not a bottleneck at all.

2 Likes

Yeah, the mod operations certainly are at the core of this I think. See also @Elrod’s comment in the LoopVectorization issue.

Note, though, that the C++ code does the same (no padding there either).

I just tried const rng = Xoroshiro128Star() and, in fact, for the version without @unroll this even slows things down a bit for me:

julia> @btime main();
  1.936 s (5 allocations: 821.34 KiB)

(compared to 1.8 in the OP)

1 Like

I’ve also seen this in something I am trying. When profiling, using mod in indexing dominates my timings.

Thanks for the comment. However, the order doesn’t really seem to matter here (I tested explicitly at some point). I guess that’s because we are accessing different columns and rows in the “kernel” anyways.

Using my (unregistered and recent) package FastModulo.jl, I can get a 2x speedup. Here I have just declared Nx, Ny as a constant, since we are always doing computing mod 64. It would be the same if you created modNx, modNy in main and passed them through to get(..) though.

using Random
using BenchmarkTools
using FastModulo
const rng = MersenneTwister()
const modNx = Mod(64)
const modNy = Mod(64)
function get(spin_conf, x, y)
    (Nx, Ny) = size(spin_conf)
    return @inbounds spin_conf[fastmod(x-1+Nx,modNx) + 1, fastmod(y-1+Ny,modNy) + 1]
end

function deltaE(spin_conf, i, j)
    return 2.0*get(spin_conf,i,j) * (get(spin_conf,i-1,j) + get(spin_conf,i+1,j) +
                                     get(spin_conf,i,j+1) + get(spin_conf,i,j-1))
end

function energy(spin_conf,modNx,modNy)
    (Nx, Ny) = size(spin_conf)
    res = 0
    for i = 1:Nx
        @simd for j = 1:Ny
            res += -get(spin_conf,i,j)*(get(spin_conf,i+1,j) + get(spin_conf,i,j+1))
        end
    end
    return res
end

magnetism(spin_conf) = sum(spin_conf)

function sweep!(rec, spin_conf, beta, M)
    (Nx, Ny) = size(spin_conf)
    @inbounds @simd for i = 1:M
        x = rand(rng, 1:Nx)
        y = rand(rng, 1:Ny)
        r = rand(rng)
        dE = deltaE(spin_conf, x, y)
        if exp(-beta*dE) > r
            spin_conf[x,y] *= -1
        end
        rec[i] = energy(spin_conf)
    end
    return rec
end

function main()
    beta = 0.2
    L = 64
    total_iteration = 100_000

    spin_conf = rand(rng, (-1,1), L, L)
    rec = Vector{Int64}(undef,total_iteration)

    sweep!(rec, spin_conf, beta, total_iteration)

    s = sum(last(rec, 1000))/1000
    s /= L^2
    # println(s)
    return nothing
end

@btime main()

edit: if you don’t need the modulo then nvm of course
edit2: it appears that this speedup is from the constant declaration

1 Like

For me the speedup is from 3 seconds to 0.2 seconds (compilation included) :flushed:

Even without explicit padding, just breaking up the square by hand helps a lot. Maybe the C++ compiler figures this out?

@inline function unsafe_get(spin_conf, x, y)
    (Nx, Ny) = size(spin_conf)
    return @inbounds spin_conf[(x-1+Nx) + 1, (y-1+Ny) + 1] # just delete %Nx, %Ny
end
function energy(spin_conf)
    (Nx, Ny) = size(spin_conf)
    res = 0
    for j = 2:Ny-1  # loop order changed, as @tomaklutfu suggested
        @simd for i in 2:Nx-1 
            res += -unsafe_get(spin_conf,i,j)*(unsafe_get(spin_conf,i+1,j) + unsafe_get(spin_conf,i,j+1))
        end
    end
    @simd for i in 1:Nx
        j = 1
        res += -get(spin_conf,i,j)*(get(spin_conf,i+1,j) + get(spin_conf,i,j+1))
        j = Ny
        res += -get(spin_conf,i,j)*(get(spin_conf,i+1,j) + get(spin_conf,i,j+1))
    end
    @simd for j in 2:Ny-1
        i = 1
        res += -get(spin_conf,i,j)*(get(spin_conf,i+1,j) + get(spin_conf,i,j+1))
        i = Nx
        res += -get(spin_conf,i,j)*(get(spin_conf,i+1,j) + get(spin_conf,i,j+1))
    end
    return res
end
@btime main(); # 165.784 ms (5 allocations: 821.34 KiB)
# first version: 628.393 ms (5 allocations: 821.34 KiB)

# with @unroll for i in 2:63:  143.385 ms (5 allocations: 821.34 KiB)
3 Likes
using VectorizedRNG, VectorizationBase
x = Vector{Float64}(undef, 1024);
rand!(local_rng(), x, StaticInt(0), StaticInt(-1), StaticInt(2))

Could also replace the statics (except for the first one) with -1.0 and 2.0.
Arguments are scale multiplier for x, offset, and scale multiplier for the random number generated.
The scale multiplier for x should be StaticInt(0) because x can contain NaN, Inf, etc. Also, because it won’t load from x if the first arg is a StaticInt{0}.

julia> using VectorizedRNG, VectorizationBase

julia> @benchmark rand!(local_rng(), $x, StaticInt(0), StaticInt(-1), StaticInt(2))
BenchmarkTools.Trial:
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     185.411 ns (0.00% GC)
  median time:      185.461 ns (0.00% GC)
  mean time:        185.701 ns (0.00% GC)
  maximum time:     209.802 ns (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     671

julia> @benchmark @. $x = 2rand() - 1
BenchmarkTools.Trial:
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     3.904 μs (0.00% GC)
  median time:      3.920 μs (0.00% GC)
  mean time:        3.935 μs (0.00% GC)
  maximum time:     5.703 μs (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     9

julia> @benchmark begin
           rand!($x)
           @. $x = 2 * $x - 1
       end
BenchmarkTools.Trial:
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     986.429 ns (0.00% GC)
  median time:      990.286 ns (0.00% GC)
  mean time:        992.971 ns (0.00% GC)
  maximum time:     2.135 μs (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     14
3 Likes

Simply defining const L = 64 outside of the main function gave me a x10 speed up. Defining const Nx = 64 and const Ny = 64 and removing all the size call improved it by x5 more, for at total of x50 speed up compared to you original version.

Erratum: only x6 speed up in total because I am not very smart.

I suspect defining L as a constant in the C++ programm is what helped a lot.

6 Likes