Dear all, thanks in advance for any reply.
Coming from Matlab, new to julia,
I was impressed by the benchmarks reported for Julia and thought to rewrite some of my codes in Julia. Before doing so I wanted to run a simple test myself

I am running a simulation in which an N by 3 array needs to be updated in each step by matrix multiplication and the addition of a random noise vector. Simply put, I’m running a loop as follows:

for nIdx in 1:50000
pos = pos + L*pos +randn(N,3)
end

where L is a 100x100 matrix and pos is preallocated. All variables have the same type, so I believe there is no type instability.
a comparison of this loop using @time with an equivalent MATLAB code yields similar timings for both (about 1 sec).

Am I missing something in the implementation? can I speed up the implementation of this loop in Julia?

I also wouldn’t expect this particular code to be much faster than matlab anyway since it’s probably spending most of the time in the L * pos or other library functions.

thanks yuyichao.
you’re right that about 75% of the time is spent on the L*pos
I’m preallocating the pos vector, just didn’t include it in the code above.
In addition, is it normal to have about 800Mb of memory allocation for these kind of loops? measured with @time by running a function wrapped around this loops with pos as an input

That’s what I mean by it’s not preallocating. None of the pre-allocation is used.

Each of the operator returns a new matrix so the code above allocates multiple arrays. Writing it in a way that fully utilize pre-allocation will make the code less readable. We are adding syntax sugar on 0.6/master so that pos[:] .+= L*pos .+ randn(N, 3) should only allocate L * pos and randn(N, 3). Doing this on current release requires writing out the loop manually (shouldn’t be particularly hard but does make the code longer) or using the broadcast! function broadcast!(+, pos, pos, L*pos, randn(N, 3)). You can use A_mul_B! and randn! to make L*pos and randn(N, 3) use preallocated buffers too.

Probably you already figured this out, but if you rewrite the code entirely with scalar ops, then your won’t require any memory allocation and deallocation. There may be some loss of performance because of poorer use of memory hierarchy than built-in matrix-multiplication, but I think overall it will be a win since your matrices and vectors are not so large. Explicit loops reduce the readability of the code, but this can be partly hidden using macro packages like devectorize or einsum.

In my finite element code, I use explicit loops and scalar ops for all dense matrix-vector operations on smaller matrices and vectors. I use matrix/vector ops for operations on big matrices like the global stiffness matrix. I’m reasonably happy with the performance.

This is probably about the fastest you can get easily:

function f!(pos,L,N)
pos2 = similar(pos)
for n=1:N
A_mul_B!(pos2,L,pos)
@inbounds for i in eachindex(pos)
pos[i] += pos2[i] + randn()
end
end
return pos
end

indeed @yuyichao the broadcast! function reduces the memory usage up to a factor of 3, timing remains somewhat similar. I’l try to implement the other suggestion you gave me, see what I come up with.

true @juthohaegeman, it scrapes off about 0.1, sec and about 120 Mb in allocations. It seems insignificant now, but when I’ll increase the size of the pos vector and the number of steps it will save me some precious seconds. Thanks!

@juthohaegeman tried both. Doesn’t seem to go below 0.8., I’m guessing differences in architecture might point to the difference in your timing compared to mine?
should probably report timing relative to Matlab or C. with your f! function I’m getting 1:1 ratio with Matlab.

I don’t want to be too greedy, although this piece of code is the basis for a stochastic simulation, which with time should include several thousands of particles, interactions and so forth. So I wanted to see the gain in time compared to the Matlab code I’m currently using (any fold increase in speed means less days/weeks of waiting for results). and particularly, i wanted to know if I’m doing it right.

Would also be interested in seeing a comparison with GPU programming, which I see now Julia has (another thread maybe). specifically the rand() function in the benchmarks seems to be nearly 100 times faster on a GPU, and hopefully also the L*pos operations. Any experience with that?

One complication in doing timing comparisons with Matlab is that a lot of vectorized operations in Matlab default to using multiple threads, whereas the corresponding Julia code does not. If you start Matlab with the -singleCompThread option and do BLAS.set_num_threads(1) in Julia, you can compare single-threaded performance to single-threaded performance. Once you are satisfied with the serial performance, then you can parallelize.

The question has been asked for sometime now, but I wanted to add some new and important updates. JuliaPro v0.6 binaries are now available with MKL support. This is an important update because comparison with MATLAB will now be fair as both use MKL for matrix multiplication. I’ve run the same code in MATLAB and Julia as follows:

tic
N = 100;
pos = rand(N,3);
L = rand(N,N);
for i = 1:50000
pos = pos + L*pos + randn(N,3);
end
toc

and Julia:

function fm(pos,L,n)
Lpos = similar(pos)
rr = similar(pos)
for i = 1:n
pos .= pos .+ A_mul_B!(Lpos, L, pos) .+ randn!(rr)
end
end
@time fm(rand(100,3), rand(100,100), 50_000)

So, the difference between OpenBlas and MKL shows up again in this test, but more importantly, Julia shines here when compared to MATLAB even if they both call the same library routines.

function fgm(pos,L,n)
rr = similar(pos)
for i = 1:n
randn!(rr)
pos .= pos .+ Base.LinAlg.BLAS.gemm!('N', 'N', 1.0, L, pos, 1.0, rr)
end
end

EDIT: with OpenBLAS gemm’s advantage was relatively smaller but absolutely bigger (542.5 → 533 ms).
This was a different install on the same computer. Both are Julia v0.6, the MKL version from JuliaPro and OpenBLAS built from source. I also built MKL from source, but that one is currently on v0.7 where benchmark tools is broken. There @time tends to give both version just under 187.8 ms, with gemm! likely a little (negligibly) faster.