That said, while comparisons are always interesting, summing an array is an O(1) operation that most mature compilers optimize to death, and interpreted languages have built-ins for. So it may not be the most informative benchmark for real-life use.
When this happens, your instinct should be to start question the C++ code, not the Julia code⦠Clearly, if the time measured to sum something doesnāt change with the size of the thing you sum, something is off.
They donāt give the same answer though. Julia widens in sum to prevent overflow:
The return type is Int for signed integers of less than system word size, and UInt for unsigned integers of less than system word size.
Keeping things in the Int8-domain means that you can fit a lot more numbers in each simd lane which will dramatically speed things up.
When benchmarking:
1 ) Be careful you are actually measuring what you want to measure.
2) Make sure that the things you compare compute the same thing.
I challenge you to find any workload that will be faster with -O3 than the default. Since the SLP optimization pass got moved from -O3 to -O2 (default) there is almost no point of -O3.
Could you re-post/consolidate the MWE where Julia is still summing slower the C++? Itās gone through so many iterations that itās no longer clear if or how things are still slower.
Yes, I finally got some time to test it further. I simplified to 1d array as it turns out to be the case for them as well.
The julia code is:
using BenchmarkTools
function mysum(A)
S = 0
@simd for i ā eachindex(A)
@inboundsS += A[i]
end
return S
end
for L ā [1_000, 10_000, 100_000, 1000_000]
@show L
A = fill(Int8(1), L)
@assert sum(A) == mysum(A)
@btime sum($A)
@btime mysum($A)
end
and the cpp code is:
#include <iostream>
#include <cstdlib>
#include <vector>
#include <chrono>
template <typename T>
int sum(std::vector<T> &A)
{
int S = 0;
for (auto x = 0; x < size(A); ++x)
S += A[x];
return S;
}
int main()
{
for (auto L : {1000, 10000, 100000, 1000000})
{
std::vector<int8_t> A(L, 1);
for (auto x = 0; x < size(A); ++x)
A[x] = 1;
auto N = 1000;
auto t_total = std::chrono::duration<double>(0);
for (auto i = 1; i <= N; ++i)
{
int S = 0;
auto t_start = std::chrono::steady_clock::now();
S = sum(A);
auto t_end = std::chrono::steady_clock::now();
t_total += t_end - t_start;
std::cerr << S << std::endl;
}
t_total /= N;
std::cout << "L = " << L << std::endl;
std::cout << " t = " << std::chrono::duration_cast<std::chrono::nanoseconds>(t_total).count() << " ns\n" << std::endl;
}
return 0;
}
This gives the following times on my machine (julia version 1.3.0-rc2, gcc 9.1.0)
function
julia sum
julia mysum
cpp
L = 1E3
0.647 μs
0.632 μs
0.467 μs
L = 1E4
7.543 μs
6.099 μs
6.655 μs
L = 1E5
73.753 μs
63.695 μs
30.370 μs
L = 1E6
725.797 μs
638.634 μs
469.401 μs
It is clearly that cpp outperforms for larger L but sum and mysum are close while mysum is a bit faster as expected from @smid macro.
Interesting. On my machine with both g++ (9.2.0) and clang++ (10.0.1) with flags -std=c++11 -O3, Iām seeing Julia faster across the board with your script.
function
julia sum
julia mysum
cpp (clang)
cpp (g++)
L = 1E3
0.061 μs
0.064 μs
0.136 μs
0.184 μs
L = 1E4
0.728 μs
0.599 μs
0.909 μs
1.336 μs
L = 1E5
6.637 μs
5.707 μs
8.418 μs
12.773 μs
L = 1E6
65.766 μs
58.442 μs
70.558 μs
102.826 μs
Edit: although note that @btime prints the minimum time whereas your C++ program is computing the average time. Changing it to mean(@benchmark(...)) only increases the Julia numbers by about 10%, leaving the conclusion the same.
Yeah, again I was too hasty about the comparison. Now, whne using long int and taking the mean value of the time in julia as
I finally have these results:
function
julia sum
julia mysum
cpp
L = 1E3
0.678 μs
0.662 μs
0.582 μs
L = 1E4
8.396 μs
6.706 μs
18.32 μs
L = 1E5
77.10 μs
67.13 μs
142.2 μs
L = 1E6
751.0 μs
670.3 μs
1435 μs
This is nice to see for a julia user but I now wonder why cpp is so much slower here? I donāt see a reason for it but maybe someone still sees some ābugsā in the cpp codeā¦
The times are measured on a Intel(R) Pentium(R) CPU N3530 @ 2.16GHz.