I recently did some benchmark testing between Julia and Matlab, focusing on basic for-loops and vectorized numerical operations. To my surprise, Julia performed significantly worse than Matlab in some very simple scenarios. I’m trying to determine:
Is this due to my code implementation (missing flags, wrong usage, etc)?
Or is this an actual limitation of Julia’s current handling of such math-heavy element-wise loops?
Here’s the test I ran in Julia:
using BenchmarkTools
arr = randn(Int(1e7))
vals = similar(arr)
# for loop
@btime @inbounds @simd for i in eachindex(arr)
vals[i] = sin(arr[i])
end
# for loop with multi-threading (JULIA_NUM_THREADS = 8)
println("threads: $(Threads.nthreads())")
@btime Threads.@threads for i in eachindex(arr)
vals[i] = sin(arr[i])
end
# vectorized
@btime vals2 = sin.(arr);
The outputs are:
for loop: 922.618 ms (39998981 allocations: 610.34 MiB)
for loop with multi-threading: 204.304 ms (39999023 allocations: 610.34 MiB)
vectorized: 87.596 ms (4 allocations: 76.29 MiB)
Matlab version:
clear;
clc;
arr = randn(1e7,1);
vals = zeros(size(arr));
tic;
for ii = 1:length(arr)
vals(ii) = sin(arr(ii));
end
t1 = toc;
tic;
vals2 = sin(arr);
t2 = toc;
The outputs are
for loop: t1 = 0.1735 s
vectorized: t2 = 0.0285 s
I expected Julia’s loop to be competitive, especially with @inbounds, @simd, and @threads. But in my test, even Matlab’s loop beat Julia’s multi-threaded version — and Matlab’s broadcast was 3x faster than Julia’s.
I’m evaluating Julia for scientific computing (currently using Matlab heavily), and would appreciate any insights or correction of my assumptions!
Welcome! Yes, this is an artifact of how you’re benchmarking. Julia — unlike Matlab — is very function-centric. You get significantly better performance by putting your code inside functions and passing the data as function arguments. And the way you’re using @btime there is treating arr and vals like non-constant globals — as though you wrote that loop outside a function.
You can tell @btime to treat these like function arguments with a special $ flag. This is just a special thing for BenchmarkTools.
julia> @btime @inbounds @simd for i in eachindex(arr)
vals[i] = sin(arr[i])
end
918.916 ms (39998981 allocations: 610.34 MiB)
julia> @btime for i in eachindex($arr)
$vals[i] = sin($arr[i])
end
90.637 ms (0 allocations: 0 bytes)
No need for @inbounds and @simd here.
I’m not sure what Matlab is doing in its vectorized version, but I’d suspect threads. Broadcast doesn’t use threads by default, but it does provide a speedup:
julia> Threads.nthreads()
6
julia> @btime Threads.@threads for i in eachindex($arr)
$vals[i] = sin($arr[i])
end
16.317 ms (32 allocations: 3.20 KiB)
Also keep in mind that when you call code like this in Matlab, it is directly running a highly optimized C or Fortran implementation, possibly multithreaded. So it’s not Julia vs Matlab, but Julia vs C/Fortran. You should not normally expect Julia to outperform Matlab in those cases.
Keep in mind that adding $ is not a performance tip. It is only a trick to substitute something into the @btime macro by value instead of by name. If anything, I would suggest writing as follows if you don’t want to write a separate function:
@btime (@inbounds for i in eachindex(arr)
vals[i] = sin(arr[i])
end) setup=(arr=$arr, vals=$vals)
Typically, you want to benchmark function calls:
function set_sin!(target, args)
@inbounds for i in eachindex(target, args)
target[i] = sin(args[i])
end
end
@btime set_sin!(vals, arr)
As for why Matlab is still faster than single-threaded Julia, there are two possible explanation which may actually come to play together.
First, as already said. Matlab may use multiple threads by default.
Second, the algorithm to compute element-wise sine may be tailored for handling array inputs: there might be a faster rearrangement of arithmetic operations if you want to compute multiple sines than just computing each one sequentially.
As I wrote in another thread, comparing to optimized library routines (like a vectorized sin(array) call) is not particularly interesting except to verify that Julia can produce good compiled code. Where you hope to get significant performance improvements is in cases where you have to write new performance-critical code, not simply call one or two library functions.
The reason is that in the loop example, julia needed to access an untyped global variable at each iteration of the loop, so every time it loops it has no idea what vals or arr could possibly be, whereas in the broadcast example, it doesn’t know what vals and arr are only at the very start before they get passed into the broadcast machinery, at which point it is able to loop knowing what they are.