Compare julia sum to a cpp implementation - julia is extremely slow?!

Hello, I am about to test the speed of a simple summation function of a 2D array in cpp and julia.
However I see that julia code is becoming slower as the array increases while cpp code remains const in time. Could someone test it and maybe find my mistake in cpp time measurement or whatever…

julia> using BenchmarkTools

julia> for L ∈ [10, 100, 1000, 10000]
               @show L
               A = fill(Int8(1), L, L)
               @btime sum($A)
               @show sum(A)
       end
L = 10
  120.125 ns (0 allocations: 0 bytes)
sum(A) = 100
L = 100
  7.543 μs (0 allocations: 0 bytes)
sum(A) = 10000
L = 1000
  727.194 μs (0 allocations: 0 bytes)
sum(A) = 1000000
L = 10000
  75.382 ms (0 allocations: 0 bytes)
sum(A) = 100000000
#include <iostream>
// #include <unistd.h> 

#include <vector>
#include <chrono>


template <typename T>
int sum(std::vector<std::vector<T>> &A)
{
	int S = 0;
	for (auto x = 0; x < size(A); ++x)
		for (auto y = 0; y < size(A[0]); ++y)
			// usleep(0.000001);
			S += A[x][y];
	return S;
}


int main()
{
	for (auto L : {10, 100, 1000, 10000})
	{
		std::vector<std::vector<int>> A(L, std::vector<int>(L,1));
			for (auto x = 0; x < size(A); ++x)
				for (auto y = 0; y < size(A[0]); ++y)
					A[x][y] = 1;

		auto N = 10;
		auto t_total = std::chrono::duration<double>(0);
		for (auto i = 1; i <= N; ++i)
		{
			auto t_start = std::chrono::steady_clock::now();
			sum(A);
			auto t_end = std::chrono::steady_clock::now();
			t_total += t_end - t_start;
		}

		t_total /= N;
		std::cout << "L = " << L << std::endl;
		std::cout << "  t = " << std::chrono::duration_cast<std::chrono::nanoseconds>(t_total).count()  << " ns\n" << std::endl;
		std::cout << "sum = " << sum(A) << std::endl;
	}
	return 0;
}

Compiled with g++ -std=c++17 -O3 -march=native main.cpp -o main this gives me:

L = 10
  t = 4078 ns

sum = 100
L = 100
  t = 4442 ns

sum = 10000
L = 1000
  t = 4414 ns

sum = 1000000
L = 10000
  t = 4274 ns

sum = 100000000

Note that replacing the summation with usleep() as is commented in the cpp code, I get longer times for larger arrays which proves that the time measurements works as expected…

The cpp compiler is allowed to do the computation at compile time, which might happen here. Try giving L as an argument in argv.

That’s clearly not computing at runtime since the run time is constant.

1 Like

Passing L as an argument does not change the times, unfortunately.

Here are my numbers:

L = 10
  t = 99 ns

sum = 100
L = 100
  t = 5409 ns

sum = 10000
L = 1000
  t = 387184 ns

sum = 1000000
L = 10000
  t = 28921359 ns

sum = 100000000

Hm, strange thing. What kind of optimization can my compiler do here to get constant time?

It can calculate the sums at compile time

Even after passing L as an argument?

I will try with random numbers.

Could you show the new code?

Here is the L as argument:

#include <iostream>
#include <cstdlib>

// #include <unistd.h> 

#include <vector>
#include <chrono>


template <typename T>
int sum(std::vector<std::vector<T>> &A)
{
	int S = 0;
	for (auto x = 0; x < size(A); ++x)
		for (auto y = 0; y < size(A[0]); ++y)
			// usleep(0.000001);
			S += A[x][y];
	return S;
}


int main(int argc, char *argv[])
{
	auto L = atoi(argv[1]);
		std::vector<std::vector<int>> A(L, std::vector<int>(L,1));
			for (auto x = 0; x < size(A); ++x)
				for (auto y = 0; y < size(A[0]); ++y)
					A[x][y] = 1;

		auto N = 10;
		auto t_total = std::chrono::duration<double>(0);
		for (auto i = 1; i <= N; ++i)
		{
			auto t_start = std::chrono::steady_clock::now();
			sum(A);
			auto t_end = std::chrono::steady_clock::now();
			t_total += t_end - t_start;
			std::cout << i << std::endl;
		}

		t_total /= N;
		std::cout << "L = " << L << std::endl;
		std::cout << "  t = " << std::chrono::duration_cast<std::chrono::nanoseconds>(t_total).count()  << " ns\n" << std::endl;
		std::cout << "sum = " << sum(A) << std::endl;
	return 0;
}

Even using random arrays which clearly cannot be calculated at compile time (or is it possible as well?) do not solve the problem: here the example:

#include <iostream>
#include <vector>
#include <chrono>
#include <random>
#include <functional>

std::mt19937::result_type seed = 0;
auto uniform_rand = std::bind(std::uniform_real_distribution<double>(0,1), std::mt19937(seed));


template <typename T>
int sum(std::vector<std::vector<T>> &A)
{
	int S = 0;
	for (auto x = 0; x < size(A); ++x)
		for (auto y = 0; y < size(A[0]); ++y)
			S += A[x][y];
	return S;
}


int main()
{
	for (auto L : {10, 100, 1000, 10000})
	{

		auto N = 10;
		auto t_total = std::chrono::duration<double>(0);
		for (auto i = 1; i <= N; ++i)
		{
			std::vector<std::vector<double>> A(L, std::vector<double>(L,1));
				for (auto x = 0; x < size(A); ++x)
					for (auto y = 0; y < size(A[0]); ++y)
						A[x][y] = uniform_rand();
			auto t_start = std::chrono::steady_clock::now();
			sum(A);
			auto t_end = std::chrono::steady_clock::now();
			t_total += t_end - t_start;
		}

		t_total /= N;
		std::cout << "L = " << L << std::endl;
		std::cout << "  t = " << std::chrono::duration_cast<std::chrono::nanoseconds>(t_total).count()  << " ns\n" << std::endl;
	}
	return 0;
}

returns:

L = 10
  t = 4106 ns

L = 100
  t = 4805 ns

L = 1000
  t = 5978 ns

L = 10000
  t = 5391 ns

if the compiler is clever it will not evaluate the sum.

1 Like

Try printing the sum

Yeah, this was the case here. The following exampel works

#include <iostream>
#include <cstdlib>

#include <vector>
#include <chrono>

template <typename T>
int sum(std::vector<std::vector<T>> &A)
{
	int S = 0;
	for (auto x = 0; x < size(A); ++x)
		for (auto y = 0; y < size(A[0]); ++y)
			S += A[x][y];
	return S;
}


int main()
{
	for (auto L : {10, 100, 1000, 10000})
	{
		std::vector<std::vector<short int>> A(L, std::vector<short int>(L,1));
			for (auto x = 0; x < size(A); ++x)
				for (auto y = 0; y < size(A[0]); ++y)
					A[x][y] = 1;

		auto N = 100;
		auto t_total = std::chrono::duration<double>(0);
		for (auto i = 1; i <= N; ++i)
		{
			int S = 0;
			auto t_start = std::chrono::steady_clock::now();
			S = sum(A);
			auto t_end = std::chrono::steady_clock::now();
			t_total += t_end - t_start;
			std::cout << S << std::endl;
		}

		t_total /= N;
		std::cout << "L = " << L << std::endl;
		std::cout << "  t = " << std::chrono::duration_cast<std::chrono::nanoseconds>(t_total).count()  << " ns\n" << std::endl;
	}
	return 0;
}

And produces (after removing the printed sums :wink: :

L = 10
  t = 4514 ns

L = 100
  t = 12138 ns

L = 1000
  t = 474792 ns

L = 10000
  t = 53492100 ns

This is somehow faster that julia still, but now the time is in the same range…

You are summing Int8 in Julia and Int32 in cpp

Ok, so this manually defined version of a sum is much faster in julia:

using BenchmarkTools

function mysum(A) 
	S = zero(eltype(A))
	@simd for i ∈ eachindex(A) 
		@inbounds S += A[i] 
	end
	return S 
end

for L ∈ [10, 100, 1000, 10000]
	@show L
	A = fill(Int8(1), L, L)

	@btime sum($A)
	@btime mysum($A)
end

results in

L = 10
  120.125 ns (0 allocations: 0 bytes)
  24.963 ns (0 allocations: 0 bytes)
L = 100
  7.543 μs (0 allocations: 0 bytes)
  478.913 ns (0 allocations: 0 bytes)
L = 1000
  724.961 μs (0 allocations: 0 bytes)
  116.496 μs (0 allocations: 0 bytes)
L = 10000
  75.416 ms (0 allocations: 0 bytes)
  17.051 ms (0 allocations: 0 bytes)

And using int8_t in the cpp code gives:

L = 10
  t = 4237 ns

L = 100
  t = 8573 ns

L = 1000
  t = 354.543 μs

L = 10000
  t = 34.411147 ms

I think the Julia summation algorithm is more sophisticated and tries reducing rounding errors.

1 Like

That could be for sure. However, in a hot-loop sometimes the faster way is the way to go :wink: