Compare julia sum to a cpp implementation - julia is extremely slow?!

Not for integers though.

That said, while comparisons are always interesting, summing an array is an O(1) operation that most mature compilers optimize to death, and interpreted languages have built-ins for. So it may not be the most informative benchmark for real-life use.

True, then its interesting, why Julia is slower.

Have you tried starting julia with O3 as well?

When this happens, your instinct should be to start question the C++ code, not the Julia code… Clearly, if the time measured to sum something doesn’t change with the size of the thing you sum, something is off.

They don’t give the same answer though. Julia widens in sum to prevent overflow:

julia> mysum(A)
-117 # Int8

julia> sum(A)
1376139 # Int64

see the docstring for sum:

The return type is Int for signed integers of less than system word size, and UInt for unsigned integers of less than system word size.

Keeping things in the Int8-domain means that you can fit a lot more numbers in each simd lane which will dramatically speed things up.

When benchmarking:
1 ) Be careful you are actually measuring what you want to measure.
2) Make sure that the things you compare compute the same thing.

6 Likes

Summing an array is O(n) though?

3 Likes

Sorry, my fault. With S = 0 as a starting point I now get the following results:

L = 10
  139.358 ns (0 allocations: 0 bytes)
  102.533 ns (0 allocations: 0 bytes)
L = 100
  7.543 μs (0 allocations: 0 bytes)
  6.099 μs (0 allocations: 0 bytes)
L = 1000
  722.445 μs (0 allocations: 0 bytes)
  635.842 μs (0 allocations: 0 bytes)
L = 10000
  75.421 ms (0 allocations: 0 bytes)
  65.100 ms (0 allocations: 0 bytes)

Yeah, thanks, I’ve tried but failed at this particular points :wink:

Yes, I am always with -O3 flag…

I challenge you to find any workload that will be faster with -O3 than the default. Since the SLP optimization pass got moved from -O3 to -O2 (default) there is almost no point of -O3.

3 Likes

Ok, nice to know.

Yes, this is also a question for me here.

sum dispatches to a generic mapreduce that does some blocking in the evaluation which has a tiny bit of overhead:

julia> @btime sum($A)
  30.160 μs (0 allocations: 0 bytes)
1377411

julia> @btime mysum($A)
  29.863 μs (0 allocations: 0 bytes)
1377411
1 Like

Could you re-post/consolidate the MWE where Julia is still summing slower the C++? It’s gone through so many iterations that it’s no longer clear if or how things are still slower.

2 Likes

Right, thanks for the correction.

Yes, I finally got some time to test it further. I simplified to 1d array as it turns out to be the case for them as well.

The julia code is:

using BenchmarkTools

function mysum(A) 
	S = 0
	@simd for i ∈ eachindex(A) 
		@inboundsS += A[i] 
	end
	return S 
end

for L ∈ [1_000, 10_000, 100_000, 1000_000]
	@show L
	A = fill(Int8(1), L)

	@assert sum(A) == mysum(A)
	@btime sum($A)
	@btime mysum($A)
end

and the cpp code is:

#include <iostream>
#include <cstdlib>

#include <vector>
#include <chrono>

template <typename T>
int sum(std::vector<T> &A)
{
	int S = 0;
	for (auto x = 0; x < size(A); ++x)
			S += A[x];
	return S;
}


int main()
{
	for (auto L : {1000, 10000, 100000, 1000000})
	{
		std::vector<int8_t> A(L, 1);
			for (auto x = 0; x < size(A); ++x)
					A[x] = 1;

		auto N = 1000;
		auto t_total = std::chrono::duration<double>(0);
		for (auto i = 1; i <= N; ++i)
		{
			int S = 0;
			auto t_start = std::chrono::steady_clock::now();
			S = sum(A);
			auto t_end = std::chrono::steady_clock::now();
			t_total += t_end - t_start;
			std::cerr << S << std::endl;
		}

		t_total /= N;
		std::cout << "L = " << L << std::endl;
		std::cout << "  t = " << std::chrono::duration_cast<std::chrono::nanoseconds>(t_total).count()  << " ns\n" << std::endl;
	}
	return 0;
}

This gives the following times on my machine (julia version 1.3.0-rc2, gcc 9.1.0)

function julia sum julia mysum cpp
L = 1E3 0.647 μs 0.632 μs 0.467 μs
L = 1E4 7.543 μs 6.099 μs 6.655 μs
L = 1E5 73.753 μs 63.695 μs 30.370 μs
L = 1E6 725.797 μs 638.634 μs 469.401 μs

It is clearly that cpp outperforms for larger L but sum and mysum are close while mysum is a bit faster as expected from @smid macro.

I hope this helps…

1 Like

Interesting. On my machine with both g++ (9.2.0) and clang++ (10.0.1) with flags -std=c++11 -O3, I’m seeing Julia faster across the board with your script.

function julia sum julia mysum cpp (clang) cpp (g++)
L = 1E3 0.061 μs 0.064 μs 0.136 μs 0.184 μs
L = 1E4 0.728 μs 0.599 μs 0.909 μs 1.336 μs
L = 1E5 6.637 μs 5.707 μs 8.418 μs 12.773 μs
L = 1E6 65.766 μs 58.442 μs 70.558 μs 102.826 μs

Edit: although note that @btime prints the minimum time whereas your C++ program is computing the average time. Changing it to mean(@benchmark(...)) only increases the Julia numbers by about 10%, leaving the conclusion the same.

What CPU are you on? This is on an i7-9750H.

1 Like

Julia Int and C int are different. Consider using either long int or Int32.

2 Likes

Yeah, again I was too hasty about the comparison. Now, whne using long int and taking the mean value of the time in julia as

I finally have these results:

function julia sum julia mysum cpp
L = 1E3 0.678 μs 0.662 μs 0.582 μs
L = 1E4 8.396 μs 6.706 μs 18.32 μs
L = 1E5 77.10 μs 67.13 μs 142.2 μs
L = 1E6 751.0 μs 670.3 μs 1435 μs

This is nice to see for a julia user but I now wonder why cpp is so much slower here? :wink: I don’t see a reason for it but maybe someone still sees some ā€œbugsā€ in the cpp code…

The times are measured on a Intel(R) Pentium(R) CPU N3530 @ 2.16GHz.