Hello, I am about to test the speed of a simple summation function of a 2D array in cpp and julia.
However I see that julia code is becoming slower as the array increases while cpp code remains const in time. Could someone test it and maybe find my mistake in cpp time measurement or whatever…
julia> using BenchmarkTools
julia> for L ∈ [10, 100, 1000, 10000]
@show L
A = fill(Int8(1), L, L)
@btime sum($A)
@show sum(A)
end
L = 10
120.125 ns (0 allocations: 0 bytes)
sum(A) = 100
L = 100
7.543 μs (0 allocations: 0 bytes)
sum(A) = 10000
L = 1000
727.194 μs (0 allocations: 0 bytes)
sum(A) = 1000000
L = 10000
75.382 ms (0 allocations: 0 bytes)
sum(A) = 100000000
#include <iostream>
// #include <unistd.h>
#include <vector>
#include <chrono>
template <typename T>
int sum(std::vector<std::vector<T>> &A)
{
int S = 0;
for (auto x = 0; x < size(A); ++x)
for (auto y = 0; y < size(A[0]); ++y)
// usleep(0.000001);
S += A[x][y];
return S;
}
int main()
{
for (auto L : {10, 100, 1000, 10000})
{
std::vector<std::vector<int>> A(L, std::vector<int>(L,1));
for (auto x = 0; x < size(A); ++x)
for (auto y = 0; y < size(A[0]); ++y)
A[x][y] = 1;
auto N = 10;
auto t_total = std::chrono::duration<double>(0);
for (auto i = 1; i <= N; ++i)
{
auto t_start = std::chrono::steady_clock::now();
sum(A);
auto t_end = std::chrono::steady_clock::now();
t_total += t_end - t_start;
}
t_total /= N;
std::cout << "L = " << L << std::endl;
std::cout << " t = " << std::chrono::duration_cast<std::chrono::nanoseconds>(t_total).count() << " ns\n" << std::endl;
std::cout << "sum = " << sum(A) << std::endl;
}
return 0;
}
Compiled with g++ -std=c++17 -O3 -march=native main.cpp -o main this gives me:
L = 10
t = 4078 ns
sum = 100
L = 100
t = 4442 ns
sum = 10000
L = 1000
t = 4414 ns
sum = 1000000
L = 10000
t = 4274 ns
sum = 100000000
Note that replacing the summation with usleep() as is commented in the cpp code, I get longer times for larger arrays which proves that the time measurements works as expected…
Yeah, this was the case here. The following exampel works
#include <iostream>
#include <cstdlib>
#include <vector>
#include <chrono>
template <typename T>
int sum(std::vector<std::vector<T>> &A)
{
int S = 0;
for (auto x = 0; x < size(A); ++x)
for (auto y = 0; y < size(A[0]); ++y)
S += A[x][y];
return S;
}
int main()
{
for (auto L : {10, 100, 1000, 10000})
{
std::vector<std::vector<short int>> A(L, std::vector<short int>(L,1));
for (auto x = 0; x < size(A); ++x)
for (auto y = 0; y < size(A[0]); ++y)
A[x][y] = 1;
auto N = 100;
auto t_total = std::chrono::duration<double>(0);
for (auto i = 1; i <= N; ++i)
{
int S = 0;
auto t_start = std::chrono::steady_clock::now();
S = sum(A);
auto t_end = std::chrono::steady_clock::now();
t_total += t_end - t_start;
std::cout << S << std::endl;
}
t_total /= N;
std::cout << "L = " << L << std::endl;
std::cout << " t = " << std::chrono::duration_cast<std::chrono::nanoseconds>(t_total).count() << " ns\n" << std::endl;
}
return 0;
}
And produces (after removing the printed sums :
L = 10
t = 4514 ns
L = 100
t = 12138 ns
L = 1000
t = 474792 ns
L = 10000
t = 53492100 ns
Ok, so this manually defined version of a sum is much faster in julia:
using BenchmarkTools
function mysum(A)
S = zero(eltype(A))
@simd for i ∈ eachindex(A)
@inbounds S += A[i]
end
return S
end
for L ∈ [10, 100, 1000, 10000]
@show L
A = fill(Int8(1), L, L)
@btime sum($A)
@btime mysum($A)
end
results in
L = 10
120.125 ns (0 allocations: 0 bytes)
24.963 ns (0 allocations: 0 bytes)
L = 100
7.543 μs (0 allocations: 0 bytes)
478.913 ns (0 allocations: 0 bytes)
L = 1000
724.961 μs (0 allocations: 0 bytes)
116.496 μs (0 allocations: 0 bytes)
L = 10000
75.416 ms (0 allocations: 0 bytes)
17.051 ms (0 allocations: 0 bytes)
And using int8_t in the cpp code gives:
L = 10
t = 4237 ns
L = 100
t = 8573 ns
L = 1000
t = 354.543 μs
L = 10000
t = 34.411147 ms