Arithmetic broadcasting in Julia 5x slower than MATLAB

I was surprised to find a fairly simple broadcasting operation on somewhat large arrays be considerably slower in Julia relative to MATLAB. Is there some unattainable level of optimization that folks at MathWorks have figured out or can I improve my Julia code to match/beat it?

Here’s an example. To give Julia the best chance, I’m going to use mutating functions to keep the number of allocations to zero. I tried it with allocating functions too – they’re predictably slower.

function f!(D,A,B,C)
	D .= max.(A.*B.+C, 0.2)
end

A = rand(100,100,1);
B = rand(1,100,100);
C = rand(100,1,100);
D = zeros(100,100,100);

@btime f!($D,$A,$B,$C);

gives me 1.822 ms (0 allocations: 0 bytes). The MATLAB equivalent is 0.35 ms, 5x slower (see MATLAB code below).

I tried decomposing f! into the arithmetic operation and the max to see if one is particularly slower. Not really:

function g!(D,A,B,C)
	D .= A.*B.+C
end

function h!(D)
	D .= max.(D,0.2)
end

@btime g!($D,$A,$B,$C);
@btime h!($D);

and I get

  1.062 ms (0 allocations: 0 bytes)
  1.146 ms (0 allocations: 0 bytes)

so each takes about the same amount of time.

Compare to MATLAB:

A = rand(100,100,1);
B = rand(1,100,100);
C = rand(100,1,100);

tf = 0; tg = 0; th = 0;
N = 10;
for i=1:N
	tic; tmp = f(A,B,C); ti = toc; tf = tf+ti;
	tic; tmp = g(A,B,C); ti = toc; tg = tg+ti;
	tic; tmp = h(tmp); ti = toc; th = th+ti;
end
fprintf(' %0.3f ms\n', tg/N*1000);
fprintf(' %0.3f ms\n', th/N*1000);
fprintf(' %0.3f ms\n', tf/N*1000);

function D = f(A,B,C)
	D = max(A.*B+C, 0.2);
end

function D = g(A,B,C)
	D = A.*B+C;
end

function D = h(D)
	D = max(D, 0.2);
end

which prints

 0.339 ms
 0.483 ms
 0.350 ms

Notice that:

  • Component functions g and h are at least 2x faster in MATLAB (0.339 ms vs 1.06 ms and 0.483 ms vs 1.146 ms)
  • The combined function f is about the same speed as eachcomponent in MATLAB (0.35 ms ~ 0.339 and 0.483), while the corresponding function in Julia is just a little faster than the sum of the components (1.062 + 1.146 ~ 1.822).

Clearly, MATLAB must be doing some low-level optimization here. I don’t think BLAS is involved since it’s all element operations being broadcasted (there’s no matrix math), but I’m out of my depth when it comes to low-level stuff like this.

So my question is – can I speed up the Julia code? In my application it’s in a hot loop and the arrays are big, so there’s a large payoff to shaving microseconds off the MWE’s running time, if possible.

1 Like

Is matlab automatically threading this? You can probably make this faster by using LoopVectorization. I see the following at 5x faster than your original f!

function f2!(D,A,B,C)
	@turbo D .= max.(A.*B.+C, 0.2)
end

and using @tturbo (threaded version of @turbo) gives another 4x speedup.

function f2!(D,A,B,C)
	@tturbo D .= max.(A.*B.+C, 0.2)
end
7 Likes

Thanks! Good call on the multi-threading. I keep forgetting MATLAB does it by default with a lot of functions. I restarted it with matlab -singleCompThread and re-ran my tests and g() and h() times are now on par with those original Julia times. But f() is still as fast as g() and h() are individually, making it still faster than Julia:

 1.140 ms
 1.324 ms
 1.301 ms

I tried LoopVectorization with @turbo and did get >2x speed ups in each component function as well as the interaction effect that leads to f() being about as fast as each component (just like in MATLAB). This indeed yields about a 4x speedup compared the original – you got 5x but that can be hardware difference or sampling error.

Two questions:

  1. For @ttturbo to get further speed-ups, I need to run Julia with more threads, right? Your 4x speed-up is a function of the number of threads you ran it with, right?
  2. What does LoopVectorization do? I just looked at the documentation and I honestly couldn’t figure it out.
  1. Yes. julia --threads=4 to launch Julia with 4 threads
  2. LoopVectorization basically replaces Julia’s compiler with a smarter one when you put @turbo or @tturbo in front of them. Basically it has a cost model and can rewrite loops or vectorized code to more efficiently take advantage of processor vectorization.
3 Likes

Have you tried using MKL?

Is this really the way to measure the time? To me it seems like you want tf to sum up all ti, so it should be tf = tf + ti, right now it seems like you just would get tf = 2*toc, though then it is a little strange that the versions are so equal. l’m not well versed in matlab so might have missed something here.

2 Likes

I have not. Just looked into it and seems like it’s kind of a global change – documentation says you need to load MKL.jl before any other package meaning it has to be in my startup script instead of at the top of whatever module I’m writing.

Do you think it’s worth it for this problem? Do you think it’s worth it elsewhere to the point that I should use it as a default?

You’re right, of course. I copied my code from the wrong window where I wrote it first and made the mistake you pointed out, but the times were still right. I’ll edit the code above.

I think it’s more accurate to use the timeit function when benchmarking in Matlab. It will do something similar to Julia’s @btime.

2 Likes

It doesn’t seem to matter for me:

julia> @btime f!($D,$A,$B,$C);
  2.270 ms (0 allocations: 0 bytes)

julia> using MKL

julia> @btime f!($D,$A,$B,$C);
  2.269 ms (0 allocations: 0 bytes)

but this is somewhat CPU dependent I believe and the computer I’m on now is ancient. Overall since Julia 1.7 MKL is not a fixed global setting anymore, you can dynamically change the BLAS at runtime.

1 Like

@tturbo inline=true seems to speed things up by another 20-30% or so relative to @tturbo.

1 Like

BLAS isn’t involved in broadcasting, afaik.

3 Likes

Could you share your versioninfo()?
Your CPU may have AVX512, which LLVM won’t really take advantage of by default, but LoopVectorization will. This would explain a 2x speedup.
You can also check if you have AVX512 via:

julia> LoopVectorization.VectorizationBase.has_feature(Val(:x86_64_avx512f))
static(true)

Returns static(true) if LV will try to use AVX512, static(false) otherwise.

If you do have AVX512, you can try starting julia with julia -C"native,-prefer-256-bit" -t4.
-t4 is a short hand for using 4 threads.
-prefer-256-bit turns off the option prefer-256-bit, i.e. disables the preference for 256 bit vectors over the 512 bits AVX512 offers.

the corresponding function in Julia is just a little faster than the sum of the components (1.062 + 1.146 ~ 1.822)

Note that much of the cost of these operations is actually the process of iterating over the arrays, both fetching the memory and loading them into registers.
If the arrays are small, loading them into registers can easily be more expensive than the actual arithmetic.
If the arrays are large so that they do not fit in high cache levels, memory bandwidth will dominate.

Hence, “fusing” the operations can give a speed up, as it saves you from looping over D – by far your largest array – twice.

BLAS isn’t involved in broadcasting, afaik.

Yes, BLAS is not involved in broadcasting.

@tturbo inline=true seems to speed things up by another 20-30% or so relative to @tturbo .

Interesting! I should adjust the heuristic so that it does this automatically for that problem.

1 Like

Oh, also this is critically important to note when using LoopVectorization here:

A = rand(100,100,1);
B = LowDimArray{(false, true, true)}(rand(1,100,100));
C = rand(100,1,100);

You need to wrap arrays in LowDimArrays when their contiguous (the first dimension for column major arrays) have size only 1!
Otherwise, you’ll likely get a segfault, or at least wrong answers!

It probably wouldn’t hurt to give A and B the same treatment.

I have a plan of how to fix this, but do not have the time myself.
If anyone else wants to dive in, I’d be happy to explain how.
Otherwise, that’d be something to expect eventually with the rewrite.

1 Like

versioninfo() returns

Julia Version 1.6.1
Commit 6aaedecc44 (2021-04-23 05:59 UTC)        
Platform Info:
  OS: Windows (x86_64-w64-mingw32)
  CPU: Intel(R) Xeon(R) CPU E3-1240 v3 @ 3.40GHz
  WORD_SIZE: 64    
  LIBM: libopenlibm
  LLVM: libLLVM-11.0.1 (ORCJIT, haswell)
Environment:
  JULIA_EDITOR = code
  JULIA_NUM_THREADS = 4

But has_feature() returns static(false) so looks like that’s not it.

Not sure if I should’ve mentioned I’m still on Julia 1.6. But this is an old project with lots of existing code and I’m worried stuff will break if I upgrade.

I understand that, but I was just surprised that MATLAB seems to fuse “better” than Julia i.e. time(f) / [time(g) + time(h)] in MATLAB was smaller than in Julia, a finding that seems separate from the relative speeds of individual components.

Heh, good to know. Thanks. Is this documented somewhere? I’m happy that @tturbo sped things up but it still feels more like a random piece of code I copied from a forum than a feature that I understand how and where to use.

If it’s so smart why isn’t it the default? Is it suboptimal sometimes or are the times it can be used rare enough that testing for its use cases everywhere slows down compilation too much?

  1. LoopVectorization is not safe for all loops.
  2. The compilation cost is higher. The first excution needs > 5s with a fresh REPL. (latter call would be faster)
  3. We don’t have it in Base.
3 Likes

a rewrite of it is currently in the works that will make it much easier to incorporate as a compiler pass

7 Likes