Compare julia sum to a cpp implementation - julia is extremely slow?!

stakaz · October 2, 2019, 9:38am

Hello, I am about to test the speed of a simple summation function of a 2D array in cpp and julia.
However I see that julia code is becoming slower as the array increases while cpp code remains const in time. Could someone test it and maybe find my mistake in cpp time measurement or whatever…

julia> using BenchmarkTools

julia> for L ∈ [10, 100, 1000, 10000]
               @show L
               A = fill(Int8(1), L, L)
               @btime sum($A)
               @show sum(A)
       end
L = 10
  120.125 ns (0 allocations: 0 bytes)
sum(A) = 100
L = 100
  7.543 μs (0 allocations: 0 bytes)
sum(A) = 10000
L = 1000
  727.194 μs (0 allocations: 0 bytes)
sum(A) = 1000000
L = 10000
  75.382 ms (0 allocations: 0 bytes)
sum(A) = 100000000

#include <iostream>
// #include <unistd.h> 

#include <vector>
#include <chrono>


template <typename T>
int sum(std::vector<std::vector<T>> &A)
{
	int S = 0;
	for (auto x = 0; x < size(A); ++x)
		for (auto y = 0; y < size(A[0]); ++y)
			// usleep(0.000001);
			S += A[x][y];
	return S;
}


int main()
{
	for (auto L : {10, 100, 1000, 10000})
	{
		std::vector<std::vector<int>> A(L, std::vector<int>(L,1));
			for (auto x = 0; x < size(A); ++x)
				for (auto y = 0; y < size(A[0]); ++y)
					A[x][y] = 1;

		auto N = 10;
		auto t_total = std::chrono::duration<double>(0);
		for (auto i = 1; i <= N; ++i)
		{
			auto t_start = std::chrono::steady_clock::now();
			sum(A);
			auto t_end = std::chrono::steady_clock::now();
			t_total += t_end - t_start;
		}

		t_total /= N;
		std::cout << "L = " << L << std::endl;
		std::cout << "  t = " << std::chrono::duration_cast<std::chrono::nanoseconds>(t_total).count()  << " ns\n" << std::endl;
		std::cout << "sum = " << sum(A) << std::endl;
	}
	return 0;
}

Compiled with g++ -std=c++17 -O3 -march=native main.cpp -o main this gives me:

L = 10
  t = 4078 ns

sum = 100
L = 100
  t = 4442 ns

sum = 10000
L = 1000
  t = 4414 ns

sum = 1000000
L = 10000
  t = 4274 ns

sum = 100000000

Note that replacing the summation with usleep() as is commented in the cpp code, I get longer times for larger arrays which proves that the time measurements works as expected…

tobias.knopp · October 2, 2019, 9:44am

The cpp compiler is allowed to do the computation at compile time, which might happen here. Try giving L as an argument in argv.

ChrisRackauckas · October 2, 2019, 9:45am

That’s clearly not computing at runtime since the run time is constant.

stakaz · October 2, 2019, 9:57am

Passing L as an argument does not change the times, unfortunately.

tobias.knopp · October 2, 2019, 10:11am

Here are my numbers:

L = 10
  t = 99 ns

sum = 100
L = 100
  t = 5409 ns

sum = 10000
L = 1000
  t = 387184 ns

sum = 1000000
L = 10000
  t = 28921359 ns

sum = 100000000

stakaz · October 2, 2019, 10:12am

Hm, strange thing. What kind of optimization can my compiler do here to get constant time?

tobias.knopp · October 2, 2019, 10:13am

It can calculate the sums at compile time

stakaz · October 2, 2019, 10:13am

Even after passing L as an argument?

stakaz · October 2, 2019, 10:13am

I will try with random numbers.

tobias.knopp · October 2, 2019, 10:14am

Could you show the new code?

stakaz · October 2, 2019, 10:18am

Here is the L as argument:

#include <iostream>
#include <cstdlib>

// #include <unistd.h> 

#include <vector>
#include <chrono>


template <typename T>
int sum(std::vector<std::vector<T>> &A)
{
	int S = 0;
	for (auto x = 0; x < size(A); ++x)
		for (auto y = 0; y < size(A[0]); ++y)
			// usleep(0.000001);
			S += A[x][y];
	return S;
}


int main(int argc, char *argv[])
{
	auto L = atoi(argv[1]);
		std::vector<std::vector<int>> A(L, std::vector<int>(L,1));
			for (auto x = 0; x < size(A); ++x)
				for (auto y = 0; y < size(A[0]); ++y)
					A[x][y] = 1;

		auto N = 10;
		auto t_total = std::chrono::duration<double>(0);
		for (auto i = 1; i <= N; ++i)
		{
			auto t_start = std::chrono::steady_clock::now();
			sum(A);
			auto t_end = std::chrono::steady_clock::now();
			t_total += t_end - t_start;
			std::cout << i << std::endl;
		}

		t_total /= N;
		std::cout << "L = " << L << std::endl;
		std::cout << "  t = " << std::chrono::duration_cast<std::chrono::nanoseconds>(t_total).count()  << " ns\n" << std::endl;
		std::cout << "sum = " << sum(A) << std::endl;
	return 0;
}

stakaz · October 2, 2019, 10:24am

Even using random arrays which clearly cannot be calculated at compile time (or is it possible as well?) do not solve the problem: here the example:

#include <iostream>
#include <vector>
#include <chrono>
#include <random>
#include <functional>

std::mt19937::result_type seed = 0;
auto uniform_rand = std::bind(std::uniform_real_distribution<double>(0,1), std::mt19937(seed));


template <typename T>
int sum(std::vector<std::vector<T>> &A)
{
	int S = 0;
	for (auto x = 0; x < size(A); ++x)
		for (auto y = 0; y < size(A[0]); ++y)
			S += A[x][y];
	return S;
}


int main()
{
	for (auto L : {10, 100, 1000, 10000})
	{

		auto N = 10;
		auto t_total = std::chrono::duration<double>(0);
		for (auto i = 1; i <= N; ++i)
		{
			std::vector<std::vector<double>> A(L, std::vector<double>(L,1));
				for (auto x = 0; x < size(A); ++x)
					for (auto y = 0; y < size(A[0]); ++y)
						A[x][y] = uniform_rand();
			auto t_start = std::chrono::steady_clock::now();
			sum(A);
			auto t_end = std::chrono::steady_clock::now();
			t_total += t_end - t_start;
		}

		t_total /= N;
		std::cout << "L = " << L << std::endl;
		std::cout << "  t = " << std::chrono::duration_cast<std::chrono::nanoseconds>(t_total).count()  << " ns\n" << std::endl;
	}
	return 0;
}

returns:

L = 10
  t = 4106 ns

L = 100
  t = 4805 ns

L = 1000
  t = 5978 ns

L = 10000
  t = 5391 ns

tobias.knopp · October 2, 2019, 10:25am

if the compiler is clever it will not evaluate the sum.

tobias.knopp · October 2, 2019, 10:26am

Try printing the sum

stakaz · October 2, 2019, 10:31am

Yeah, this was the case here. The following exampel works

#include <iostream>
#include <cstdlib>

#include <vector>
#include <chrono>

template <typename T>
int sum(std::vector<std::vector<T>> &A)
{
	int S = 0;
	for (auto x = 0; x < size(A); ++x)
		for (auto y = 0; y < size(A[0]); ++y)
			S += A[x][y];
	return S;
}


int main()
{
	for (auto L : {10, 100, 1000, 10000})
	{
		std::vector<std::vector<short int>> A(L, std::vector<short int>(L,1));
			for (auto x = 0; x < size(A); ++x)
				for (auto y = 0; y < size(A[0]); ++y)
					A[x][y] = 1;

		auto N = 100;
		auto t_total = std::chrono::duration<double>(0);
		for (auto i = 1; i <= N; ++i)
		{
			int S = 0;
			auto t_start = std::chrono::steady_clock::now();
			S = sum(A);
			auto t_end = std::chrono::steady_clock::now();
			t_total += t_end - t_start;
			std::cout << S << std::endl;
		}

		t_total /= N;
		std::cout << "L = " << L << std::endl;
		std::cout << "  t = " << std::chrono::duration_cast<std::chrono::nanoseconds>(t_total).count()  << " ns\n" << std::endl;
	}
	return 0;
}

And produces (after removing the printed sums :

L = 10
  t = 4514 ns

L = 100
  t = 12138 ns

L = 1000
  t = 474792 ns

L = 10000
  t = 53492100 ns

stakaz · October 2, 2019, 10:32am

This is somehow faster that julia still, but now the time is in the same range…

tobias.knopp · October 2, 2019, 10:33am

You are summing Int8 in Julia and Int32 in cpp

stakaz · October 2, 2019, 10:44am

Ok, so this manually defined version of a sum is much faster in julia:

using BenchmarkTools

function mysum(A) 
	S = zero(eltype(A))
	@simd for i ∈ eachindex(A) 
		@inbounds S += A[i] 
	end
	return S 
end

for L ∈ [10, 100, 1000, 10000]
	@show L
	A = fill(Int8(1), L, L)

	@btime sum($A)
	@btime mysum($A)
end

results in

L = 10
  120.125 ns (0 allocations: 0 bytes)
  24.963 ns (0 allocations: 0 bytes)
L = 100
  7.543 μs (0 allocations: 0 bytes)
  478.913 ns (0 allocations: 0 bytes)
L = 1000
  724.961 μs (0 allocations: 0 bytes)
  116.496 μs (0 allocations: 0 bytes)
L = 10000
  75.416 ms (0 allocations: 0 bytes)
  17.051 ms (0 allocations: 0 bytes)

And using int8_t in the cpp code gives:

L = 10
  t = 4237 ns

L = 100
  t = 8573 ns

L = 1000
  t = 354.543 μs

L = 10000
  t = 34.411147 ms

tobias.knopp · October 2, 2019, 10:46am

I think the Julia summation algorithm is more sophisticated and tries reducing rounding errors.

stakaz · October 2, 2019, 10:47am

That could be for sure. However, in a hot-loop sometimes the faster way is the way to go

Topic		Replies	Views
How to improve performance of sum() Performance question , fast-math , tullio , loopvectorization , sum	19	4798	July 19, 2021
Summing matrix elements is >1000X slower than summing vector elements General Usage performance	8	1321	April 17, 2017
Julia is slower than Python when appending elements to untyped arrays General Usage question	42	794	February 10, 2025
Puzzled by the difference in performance when summing an array in different ways Performance array , memory-allocation	4	872	July 19, 2019
A fast sum. Any downsides? Performance sum	18	913	December 16, 2024

Compare julia sum to a cpp implementation - julia is extremely slow?!

Related topics