Compare julia sum to a cpp implementation - julia is extremely slow?!

Not for integers though.

That said, while comparisons are always interesting, summing an array is an O(1) operation that most mature compilers optimize to death, and interpreted languages have built-ins for. So it may not be the most informative benchmark for real-life use.

True, then its interesting, why Julia is slower.

Have you tried starting julia with O3 as well?

When this happens, your instinct should be to start question the C++ code, not the Julia codeā€¦ Clearly, if the time measured to sum something doesnā€™t change with the size of the thing you sum, something is off.

They donā€™t give the same answer though. Julia widens in sum to prevent overflow:

julia> mysum(A)
-117 # Int8

julia> sum(A)
1376139 # Int64

see the docstring for sum:

The return type is Int for signed integers of less than system word size, and UInt for unsigned integers of less than system word size.

Keeping things in the Int8-domain means that you can fit a lot more numbers in each simd lane which will dramatically speed things up.

When benchmarking:
1 ) Be careful you are actually measuring what you want to measure.
2) Make sure that the things you compare compute the same thing.

6 Likes

Summing an array is O(n) though?

3 Likes

Sorry, my fault. With S = 0 as a starting point I now get the following results:

L = 10
  139.358 ns (0 allocations: 0 bytes)
  102.533 ns (0 allocations: 0 bytes)
L = 100
  7.543 Ī¼s (0 allocations: 0 bytes)
  6.099 Ī¼s (0 allocations: 0 bytes)
L = 1000
  722.445 Ī¼s (0 allocations: 0 bytes)
  635.842 Ī¼s (0 allocations: 0 bytes)
L = 10000
  75.421 ms (0 allocations: 0 bytes)
  65.100 ms (0 allocations: 0 bytes)

Yeah, thanks, Iā€™ve tried but failed at this particular points :wink:

Yes, I am always with -O3 flagā€¦

I challenge you to find any workload that will be faster with -O3 than the default. Since the SLP optimization pass got moved from -O3 to -O2 (default) there is almost no point of -O3.

3 Likes

Ok, nice to know.

Yes, this is also a question for me here.

sum dispatches to a generic mapreduce that does some blocking in the evaluation which has a tiny bit of overhead:

julia> @btime sum($A)
  30.160 Ī¼s (0 allocations: 0 bytes)
1377411

julia> @btime mysum($A)
  29.863 Ī¼s (0 allocations: 0 bytes)
1377411
1 Like

Could you re-post/consolidate the MWE where Julia is still summing slower the C++? Itā€™s gone through so many iterations that itā€™s no longer clear if or how things are still slower.

2 Likes

Right, thanks for the correction.

Yes, I finally got some time to test it further. I simplified to 1d array as it turns out to be the case for them as well.

The julia code is:

using BenchmarkTools

function mysum(A) 
	S = 0
	@simd for i āˆˆ eachindex(A) 
		@inboundsS += A[i] 
	end
	return S 
end

for L āˆˆ [1_000, 10_000, 100_000, 1000_000]
	@show L
	A = fill(Int8(1), L)

	@assert sum(A) == mysum(A)
	@btime sum($A)
	@btime mysum($A)
end

and the cpp code is:

#include <iostream>
#include <cstdlib>

#include <vector>
#include <chrono>

template <typename T>
int sum(std::vector<T> &A)
{
	int S = 0;
	for (auto x = 0; x < size(A); ++x)
			S += A[x];
	return S;
}


int main()
{
	for (auto L : {1000, 10000, 100000, 1000000})
	{
		std::vector<int8_t> A(L, 1);
			for (auto x = 0; x < size(A); ++x)
					A[x] = 1;

		auto N = 1000;
		auto t_total = std::chrono::duration<double>(0);
		for (auto i = 1; i <= N; ++i)
		{
			int S = 0;
			auto t_start = std::chrono::steady_clock::now();
			S = sum(A);
			auto t_end = std::chrono::steady_clock::now();
			t_total += t_end - t_start;
			std::cerr << S << std::endl;
		}

		t_total /= N;
		std::cout << "L = " << L << std::endl;
		std::cout << "  t = " << std::chrono::duration_cast<std::chrono::nanoseconds>(t_total).count()  << " ns\n" << std::endl;
	}
	return 0;
}

This gives the following times on my machine (julia version 1.3.0-rc2, gcc 9.1.0)

function julia sum julia mysum cpp
L = 1E3 0.647 Ī¼s 0.632 Ī¼s 0.467 Ī¼s
L = 1E4 7.543 Ī¼s 6.099 Ī¼s 6.655 Ī¼s
L = 1E5 73.753 Ī¼s 63.695 Ī¼s 30.370 Ī¼s
L = 1E6 725.797 Ī¼s 638.634 Ī¼s 469.401 Ī¼s

It is clearly that cpp outperforms for larger L but sum and mysum are close while mysum is a bit faster as expected from @smid macro.

I hope this helpsā€¦

1 Like

Interesting. On my machine with both g++ (9.2.0) and clang++ (10.0.1) with flags -std=c++11 -O3, Iā€™m seeing Julia faster across the board with your script.

function julia sum julia mysum cpp (clang) cpp (g++)
L = 1E3 0.061 Ī¼s 0.064 Ī¼s 0.136 Ī¼s 0.184 Ī¼s
L = 1E4 0.728 Ī¼s 0.599 Ī¼s 0.909 Ī¼s 1.336 Ī¼s
L = 1E5 6.637 Ī¼s 5.707 Ī¼s 8.418 Ī¼s 12.773 Ī¼s
L = 1E6 65.766 Ī¼s 58.442 Ī¼s 70.558 Ī¼s 102.826 Ī¼s

Edit: although note that @btime prints the minimum time whereas your C++ program is computing the average time. Changing it to mean(@benchmark(...)) only increases the Julia numbers by about 10%, leaving the conclusion the same.

What CPU are you on? This is on an i7-9750H.

1 Like

Julia Int and C int are different. Consider using either long int or Int32.

2 Likes

Yeah, again I was too hasty about the comparison. Now, whne using long int and taking the mean value of the time in julia as

I finally have these results:

function julia sum julia mysum cpp
L = 1E3 0.678 Ī¼s 0.662 Ī¼s 0.582 Ī¼s
L = 1E4 8.396 Ī¼s 6.706 Ī¼s 18.32 Ī¼s
L = 1E5 77.10 Ī¼s 67.13 Ī¼s 142.2 Ī¼s
L = 1E6 751.0 Ī¼s 670.3 Ī¼s 1435 Ī¼s

This is nice to see for a julia user but I now wonder why cpp is so much slower here? :wink: I donā€™t see a reason for it but maybe someone still sees some ā€œbugsā€ in the cpp codeā€¦

The times are measured on a Intel(R) Pentium(R) CPU N3530 @ 2.16GHz.