I find that summing elements using a for loop is more than 1000 times slower for a matrix than for a vector. Consider the following two functions that sum the elements of a 100×100 matrix and the elements of a vector with same size:
julia> VERSION
v"0.6.0pre.beta.112"
julia> function sumelem_mat(X)
s = 0.0
for j = 1:100, i = 1:100
@inbounds s += X[i,j]
end
return s
end
sumelem_mat (generic function with 1 method)
julia> function sumelem_vec(x)
s = 0.0
for n = 1:10000
@inbounds s += x[n]
end
return s
end
sumelem_vec (generic function with 1 method)
If these two functions are applied to a 100×100 matrix and a vector linearized from the matrix, they return the same value:
julia> X = rand(100,100); x = X[:];
julia> sumelem_mat(X)
5005.445543677584
julia> sumelem_vec(x)
5005.445543677584
However, the matrix version is more than 1000 times slower than the vector version:
julia> using BenchmarkTools
julia> @btime sumelem_mat($X)
3.777 μs (0 allocations: 0 bytes)
julia> @btime sumelem_vec($x)
3.046 ns (0 allocations: 0 bytes)
Note the unit difference in the above result: one is in microseconds and the other is in nanoseconds.
Here are my questions:

What is the origin of such difference in performance between the two functions?

Is there a way to make the matrix version as fast as the vector version?