I was surprised to find a fairly simple broadcasting operation on somewhat large arrays be considerably slower in Julia relative to MATLAB. Is there some unattainable level of optimization that folks at MathWorks have figured out or can I improve my Julia code to match/beat it?
Here’s an example. To give Julia the best chance, I’m going to use mutating functions to keep the number of allocations to zero. I tried it with allocating functions too – they’re predictably slower.
function f!(D,A,B,C)
D .= max.(A.*B.+C, 0.2)
end
A = rand(100,100,1);
B = rand(1,100,100);
C = rand(100,1,100);
D = zeros(100,100,100);
@btime f!($D,$A,$B,$C);
gives me 1.822 ms (0 allocations: 0 bytes)
. The MATLAB equivalent is 0.35 ms, 5x slower (see MATLAB code below).
I tried decomposing f! into the arithmetic operation and the max to see if one is particularly slower. Not really:
function g!(D,A,B,C)
D .= A.*B.+C
end
function h!(D)
D .= max.(D,0.2)
end
@btime g!($D,$A,$B,$C);
@btime h!($D);
and I get
1.062 ms (0 allocations: 0 bytes)
1.146 ms (0 allocations: 0 bytes)
so each takes about the same amount of time.
Compare to MATLAB:
A = rand(100,100,1);
B = rand(1,100,100);
C = rand(100,1,100);
tf = 0; tg = 0; th = 0;
N = 10;
for i=1:N
tic; tmp = f(A,B,C); ti = toc; tf = tf+ti;
tic; tmp = g(A,B,C); ti = toc; tg = tg+ti;
tic; tmp = h(tmp); ti = toc; th = th+ti;
end
fprintf(' %0.3f ms\n', tg/N*1000);
fprintf(' %0.3f ms\n', th/N*1000);
fprintf(' %0.3f ms\n', tf/N*1000);
function D = f(A,B,C)
D = max(A.*B+C, 0.2);
end
function D = g(A,B,C)
D = A.*B+C;
end
function D = h(D)
D = max(D, 0.2);
end
which prints
0.339 ms
0.483 ms
0.350 ms
Notice that:
- Component functions g and h are at least 2x faster in MATLAB (0.339 ms vs 1.06 ms and 0.483 ms vs 1.146 ms)
- The combined function f is about the same speed as eachcomponent in MATLAB (0.35 ms ~ 0.339 and 0.483), while the corresponding function in Julia is just a little faster than the sum of the components (1.062 + 1.146 ~ 1.822).
Clearly, MATLAB must be doing some low-level optimization here. I don’t think BLAS is involved since it’s all element operations being broadcasted (there’s no matrix math), but I’m out of my depth when it comes to low-level stuff like this.
So my question is – can I speed up the Julia code? In my application it’s in a hot loop and the arrays are big, so there’s a large payoff to shaving microseconds off the MWE’s running time, if possible.