That said, while comparisons are always interesting, summing an array is an O(1) operation that most mature compilers optimize to death, and interpreted languages have built-ins for. So it may not be the most informative benchmark for real-life use.
When this happens, your instinct should be to start question the C++ code, not the Julia codeā¦ Clearly, if the time measured to sum something doesnāt change with the size of the thing you sum, something is off.
They donāt give the same answer though. Julia widens in sum to prevent overflow:
The return type is Int for signed integers of less than system word size, and UInt for unsigned integers of less than system word size.
Keeping things in the Int8-domain means that you can fit a lot more numbers in each simd lane which will dramatically speed things up.
When benchmarking:
1 ) Be careful you are actually measuring what you want to measure.
2) Make sure that the things you compare compute the same thing.
I challenge you to find any workload that will be faster with -O3 than the default. Since the SLP optimization pass got moved from -O3 to -O2 (default) there is almost no point of -O3.
Could you re-post/consolidate the MWE where Julia is still summing slower the C++? Itās gone through so many iterations that itās no longer clear if or how things are still slower.
Yes, I finally got some time to test it further. I simplified to 1d array as it turns out to be the case for them as well.
The julia code is:
using BenchmarkTools
function mysum(A)
S = 0
@simd for i ā eachindex(A)
@inboundsS += A[i]
end
return S
end
for L ā [1_000, 10_000, 100_000, 1000_000]
@show L
A = fill(Int8(1), L)
@assert sum(A) == mysum(A)
@btime sum($A)
@btime mysum($A)
end
and the cpp code is:
#include <iostream>
#include <cstdlib>
#include <vector>
#include <chrono>
template <typename T>
int sum(std::vector<T> &A)
{
int S = 0;
for (auto x = 0; x < size(A); ++x)
S += A[x];
return S;
}
int main()
{
for (auto L : {1000, 10000, 100000, 1000000})
{
std::vector<int8_t> A(L, 1);
for (auto x = 0; x < size(A); ++x)
A[x] = 1;
auto N = 1000;
auto t_total = std::chrono::duration<double>(0);
for (auto i = 1; i <= N; ++i)
{
int S = 0;
auto t_start = std::chrono::steady_clock::now();
S = sum(A);
auto t_end = std::chrono::steady_clock::now();
t_total += t_end - t_start;
std::cerr << S << std::endl;
}
t_total /= N;
std::cout << "L = " << L << std::endl;
std::cout << " t = " << std::chrono::duration_cast<std::chrono::nanoseconds>(t_total).count() << " ns\n" << std::endl;
}
return 0;
}
This gives the following times on my machine (julia version 1.3.0-rc2, gcc 9.1.0)
function
julia sum
julia mysum
cpp
L = 1E3
0.647 Ī¼s
0.632 Ī¼s
0.467 Ī¼s
L = 1E4
7.543 Ī¼s
6.099 Ī¼s
6.655 Ī¼s
L = 1E5
73.753 Ī¼s
63.695 Ī¼s
30.370 Ī¼s
L = 1E6
725.797 Ī¼s
638.634 Ī¼s
469.401 Ī¼s
It is clearly that cpp outperforms for larger L but sum and mysum are close while mysum is a bit faster as expected from @smid macro.
Interesting. On my machine with both g++ (9.2.0) and clang++ (10.0.1) with flags -std=c++11 -O3, Iām seeing Julia faster across the board with your script.
function
julia sum
julia mysum
cpp (clang)
cpp (g++)
L = 1E3
0.061 Ī¼s
0.064 Ī¼s
0.136 Ī¼s
0.184 Ī¼s
L = 1E4
0.728 Ī¼s
0.599 Ī¼s
0.909 Ī¼s
1.336 Ī¼s
L = 1E5
6.637 Ī¼s
5.707 Ī¼s
8.418 Ī¼s
12.773 Ī¼s
L = 1E6
65.766 Ī¼s
58.442 Ī¼s
70.558 Ī¼s
102.826 Ī¼s
Edit: although note that @btime prints the minimum time whereas your C++ program is computing the average time. Changing it to mean(@benchmark(...)) only increases the Julia numbers by about 10%, leaving the conclusion the same.
Yeah, again I was too hasty about the comparison. Now, whne using long int and taking the mean value of the time in julia as
I finally have these results:
function
julia sum
julia mysum
cpp
L = 1E3
0.678 Ī¼s
0.662 Ī¼s
0.582 Ī¼s
L = 1E4
8.396 Ī¼s
6.706 Ī¼s
18.32 Ī¼s
L = 1E5
77.10 Ī¼s
67.13 Ī¼s
142.2 Ī¼s
L = 1E6
751.0 Ī¼s
670.3 Ī¼s
1435 Ī¼s
This is nice to see for a julia user but I now wonder why cpp is so much slower here? I donāt see a reason for it but maybe someone still sees some ābugsā in the cpp codeā¦
The times are measured on a Intel(R) Pentium(R) CPU N3530 @ 2.16GHz.