Have you seen this comparison of Python, Julia, R, Matlab and IDL. There are interesting tables with results:
Oh yeah, this guy who we had to email to tell him that Julia is column major and not row major (I think it was @IsmaelVC who ended up emailing him back in January). He says he fixed that , but if you open up even just problem 3, youâ€™ll see:
function regularTimeStep(u::Matrix)
(n, m) = size(u)
err = 0.0
for i in 2:n1
for j in 2:m1
tmp = u[i,j]
u[i,j] = ((u[i1, j] + u[i+1, j] +
u[i, j1] + u[i, j+1])*4.0 +
u[i1,j1] + u[i1,j+1] +
u[i+1,j1] + u[i+1,j+1])/20.0
diff = u[i,j]  tmp
err = err + diff*diff
end
end
return u, sqrt(err)
end
Hell, itâ€™s not just getting the indexing ordering wrong, itâ€™s also keeping bounds checks on when itâ€™s unnecessary. Even the simplest problem, Problem 1 is missing it:
function foo1(A)
N = size(A, 1)
for j = 1:N, i = 1:N
A[i,j,1] = A[i,j,2]
A[i,j,3] = A[i,j,1]
A[i,j,2] = A[i,j,3]
end
end
Iâ€™m sure anyone whoâ€™s been using Julia for more than a week would see that @inbounds
is missing:
function foo2(A)
N = size(A, 1)
@inbounds for j = 1:N, i = 1:N
A[i,j,1] = A[i,j,2]
A[i,j,3] = A[i,j,1]
A[i,j,2] = A[i,j,3]
end
end
A = rand(100,100,100)
using BenchmarkTools
@btime foo1($A)
14.051 ÎĽs (0 allocations: 0 bytes)
@btime foo1($A)
10.538 ÎĽs (0 allocations: 0 bytes)
So take it as a lower bound for Juliaâ€™s performance since any user whoâ€™s been using it for a week will get better results. Hell, I would take this whole set of benchmarks with a grain of salt if you havenâ€™t checked the codes yourself given the author didnâ€™t even get row vs column ordering before first posting the results (twice! After he read the email he updated the table to say Julia is columnmajor, but didnâ€™t update the code ) .
These benchmarks seem pretty flawed to me:

In the first problem, they permute an NxNx3 matrix in all languages, despite the fact that that some are columnmajor and some are rowmajor so the cache characteristics are different. And they benchmark the Julia â€śvectorizedâ€ť version in global scope, with outofplace slicing (as opposed to
@views
). 
The second benchmark just measures BLAS
dgemm
performance. i.e. all that matters is which BLAS library is linked. 
In the third benchmark (GaussJacobi solver), the Julia implementationâ€™s loops are in the wrong order for columnmajor (unlike the Fortran and Numba versions, which both have the correct loop order).

The fourth benchmark includes a plotting library in the timing, which seems weird. (And they â€ścouldnâ€™t build the plotting libraryâ€ť in Julia.) A big part of the benchmark seems to be reading NetCDF files, which all use the same external C library. A bunch of the Julia code seems to be really suboptimal (allocating temporary arrays all over the place), but itâ€™s hard to tell if it matters because itâ€™s really unclear to me what this benchmark is measuring.
(This is putting aside the opportunity for lots of microoptimizations, like @inbounds
or @simd
annotations. There is also likely an inconsistency in how many threads were used.)