Compare julia sum to a cpp implementation - julia is extremely slow?!

Tamas_Papp · October 2, 2019, 10:49am

Not for integers though.

That said, while comparisons are always interesting, summing an array is an O(1) operation that most mature compilers optimize to death, and interpreted languages have built-ins for. So it may not be the most informative benchmark for real-life use.

tobias.knopp · October 2, 2019, 10:51am

True, then its interesting, why Julia is slower.

Sukera · October 2, 2019, 11:18am

Have you tried starting julia with O3 as well?

kristoffer.carlsson · October 2, 2019, 11:48am

When this happens, your instinct should be to start question the C++ code, not the Julia code… Clearly, if the time measured to sum something doesn’t change with the size of the thing you sum, something is off.

They don’t give the same answer though. Julia widens in sum to prevent overflow:

julia> mysum(A)
-117 # Int8

julia> sum(A)
1376139 # Int64

see the docstring for sum:

The return type is Int for signed integers of less than system word size, and UInt for unsigned integers of less than system word size.

Keeping things in the Int8-domain means that you can fit a lot more numbers in each simd lane which will dramatically speed things up.

When benchmarking:
1 ) Be careful you are actually measuring what you want to measure.
2) Make sure that the things you compare compute the same thing.

kristoffer.carlsson · October 2, 2019, 11:55am

Summing an array is O(n) though?

stakaz · October 2, 2019, 12:13pm

Sorry, my fault. With S = 0 as a starting point I now get the following results:

L = 10
  139.358 ns (0 allocations: 0 bytes)
  102.533 ns (0 allocations: 0 bytes)
L = 100
  7.543 μs (0 allocations: 0 bytes)
  6.099 μs (0 allocations: 0 bytes)
L = 1000
  722.445 μs (0 allocations: 0 bytes)
  635.842 μs (0 allocations: 0 bytes)
L = 10000
  75.421 ms (0 allocations: 0 bytes)
  65.100 ms (0 allocations: 0 bytes)

Yeah, thanks, I’ve tried but failed at this particular points

stakaz · October 2, 2019, 12:13pm

Yes, I am always with -O3 flag…

kristoffer.carlsson · October 2, 2019, 12:14pm

I challenge you to find any workload that will be faster with -O3 than the default. Since the SLP optimization pass got moved from -O3 to -O2 (default) there is almost no point of -O3.

stakaz · October 2, 2019, 12:17pm

Ok, nice to know.

Yes, this is also a question for me here.

kristoffer.carlsson · October 2, 2019, 12:39pm

sum dispatches to a generic mapreduce that does some blocking in the evaluation which has a tiny bit of overhead:

julia> @btime sum($A)
  30.160 μs (0 allocations: 0 bytes)
1377411

julia> @btime mysum($A)
  29.863 μs (0 allocations: 0 bytes)
1377411

mbauman · October 2, 2019, 2:10pm

Could you re-post/consolidate the MWE where Julia is still summing slower the C++? It’s gone through so many iterations that it’s no longer clear if or how things are still slower.

Tamas_Papp · October 2, 2019, 3:04pm

Right, thanks for the correction.

stakaz · October 4, 2019, 1:29pm

Yes, I finally got some time to test it further. I simplified to 1d array as it turns out to be the case for them as well.

The julia code is:

using BenchmarkTools

function mysum(A) 
	S = 0
	@simd for i ∈ eachindex(A) 
		@inboundsS += A[i] 
	end
	return S 
end

for L ∈ [1_000, 10_000, 100_000, 1000_000]
	@show L
	A = fill(Int8(1), L)

	@assert sum(A) == mysum(A)
	@btime sum($A)
	@btime mysum($A)
end

and the cpp code is:

#include <iostream>
#include <cstdlib>

#include <vector>
#include <chrono>

template <typename T>
int sum(std::vector<T> &A)
{
	int S = 0;
	for (auto x = 0; x < size(A); ++x)
			S += A[x];
	return S;
}


int main()
{
	for (auto L : {1000, 10000, 100000, 1000000})
	{
		std::vector<int8_t> A(L, 1);
			for (auto x = 0; x < size(A); ++x)
					A[x] = 1;

		auto N = 1000;
		auto t_total = std::chrono::duration<double>(0);
		for (auto i = 1; i <= N; ++i)
		{
			int S = 0;
			auto t_start = std::chrono::steady_clock::now();
			S = sum(A);
			auto t_end = std::chrono::steady_clock::now();
			t_total += t_end - t_start;
			std::cerr << S << std::endl;
		}

		t_total /= N;
		std::cout << "L = " << L << std::endl;
		std::cout << "  t = " << std::chrono::duration_cast<std::chrono::nanoseconds>(t_total).count()  << " ns\n" << std::endl;
	}
	return 0;
}

This gives the following times on my machine (julia version 1.3.0-rc2, gcc 9.1.0)

function	julia `sum`	julia `mysum`	cpp
L = 1E3	0.647 μs	0.632 μs	0.467 μs
L = 1E4	7.543 μs	6.099 μs	6.655 μs
L = 1E5	73.753 μs	63.695 μs	30.370 μs
L = 1E6	725.797 μs	638.634 μs	469.401 μs

It is clearly that cpp outperforms for larger L but sum and mysum are close while mysum is a bit faster as expected from @smid macro.

I hope this helps…

mbauman · October 4, 2019, 2:26pm

Interesting. On my machine with both g++ (9.2.0) and clang++ (10.0.1) with flags -std=c++11 -O3, I’m seeing Julia faster across the board with your script.

function	julia `sum`	julia `mysum`	cpp (clang)	cpp (g++)
L = 1E3	0.061 μs	0.064 μs	0.136 μs	0.184 μs
L = 1E4	0.728 μs	0.599 μs	0.909 μs	1.336 μs
L = 1E5	6.637 μs	5.707 μs	8.418 μs	12.773 μs
L = 1E6	65.766 μs	58.442 μs	70.558 μs	102.826 μs

Edit: although note that @btime prints the minimum time whereas your C++ program is computing the average time. Changing it to mean(@benchmark(...)) only increases the Julia numbers by about 10%, leaving the conclusion the same.

What CPU are you on? This is on an i7-9750H.

foobar_lv2 · October 4, 2019, 3:09pm

Julia Int and C int are different. Consider using either long int or Int32.

stakaz · October 7, 2019, 9:18am

Yeah, again I was too hasty about the comparison. Now, whne using long int and taking the mean value of the time in julia as

I finally have these results:

function	julia `sum`	julia `mysum`	cpp
L = 1E3	0.678 μs	0.662 μs	0.582 μs
L = 1E4	8.396 μs	6.706 μs	18.32 μs
L = 1E5	77.10 μs	67.13 μs	142.2 μs
L = 1E6	751.0 μs	670.3 μs	1435 μs

This is nice to see for a julia user but I now wonder why cpp is so much slower here? I don’t see a reason for it but maybe someone still sees some “bugs” in the cpp code…

The times are measured on a Intel(R) Pentium(R) CPU N3530 @ 2.16GHz.

Topic		Replies	Views
How to improve performance of sum() Performance question , fast-math , tullio , loopvectorization , sum	19	4705	July 19, 2021
Summing matrix elements is >1000X slower than summing vector elements General Usage performance	8	1321	April 17, 2017
Puzzled by the difference in performance when summing an array in different ways Performance array , memory-allocation	4	866	July 19, 2019
A fast sum. Any downsides? Performance sum	18	873	December 16, 2024
Julia not 220 times faster than C++ General Usage question , performance	10	1233	May 5, 2022

Compare julia sum to a cpp implementation - julia is extremely slow?!

Related topics