Element-wise vector multiplication and fusing dot

Hi, Sorry for asking once again this frequently-asked question.

I’m using v5.0 and wondering what the current status is for element-wise vector multiplication (except the newly merged PR https://github.com/JuliaLang/julia/pull/17623, which will only affect the future versions).

I ran the following code:

foo!(x, y, z) = z .= x .* y

x = rand(1000)
y = rand(1000)
z = similar(y)

for k = 1:5
  @time foo!(x, y, z)
  @time z .= x .* y
  @time z .= .*(x, y)
  @time z = x .* y
  @time z = .*(x, y)
  println("-----------------------------------------------------------------")
end

and here’s the outputs:

  0.000016 seconds (3 allocations: 7.984 KB)
  0.000003 seconds (3 allocations: 7.984 KB)
  0.000003 seconds (3 allocations: 7.984 KB)
  0.000002 seconds (3 allocations: 7.984 KB)
  0.000002 seconds (3 allocations: 7.984 KB)
-----------------------------------------------------------------
  0.000002 seconds (3 allocations: 7.984 KB)
  0.000004 seconds (3 allocations: 7.984 KB)
  0.000002 seconds (3 allocations: 7.984 KB)
  0.000002 seconds (3 allocations: 7.984 KB)
  0.000002 seconds (3 allocations: 7.984 KB)
-----------------------------------------------------------------
  0.000002 seconds (3 allocations: 7.984 KB)
  0.000002 seconds (3 allocations: 7.984 KB)
  0.000002 seconds (3 allocations: 7.984 KB)
  0.000002 seconds (3 allocations: 7.984 KB)
  0.000002 seconds (3 allocations: 7.984 KB)
-----------------------------------------------------------------
  0.000002 seconds (3 allocations: 7.984 KB)
  0.000002 seconds (3 allocations: 7.984 KB)
  0.000002 seconds (3 allocations: 7.984 KB)
  0.000001 seconds (3 allocations: 7.984 KB)
  0.000001 seconds (3 allocations: 7.984 KB)
-----------------------------------------------------------------
  0.000002 seconds (3 allocations: 7.984 KB)
  0.000002 seconds (3 allocations: 7.984 KB)
  0.000002 seconds (3 allocations: 7.984 KB)
  0.000002 seconds (3 allocations: 7.984 KB)
  0.000001 seconds (3 allocations: 7.984 KB)
-----------------------------------------------------------------

So it seems that I don’t have to use .= and .*(x, y)? Instead, I can feel free to use regular equal sign = and auto fusing dot x.*y? Am I right? Any comments are appreciated!!

Here are a few old threads that I have referred in the past in this regard:

https://groups.google.com/forum/?hl=en#!topic/julia-users/rvsLd8bdTHU
http://julia-programming-language.2336112.n4.nabble.com/Does-new-quot-dot-quot-syntax-fuse-td48830.html#a48840

Thanks a lot!!

Don’t do timing measurements in global scope. I would also strongly recommend the BenchmarkTools package, and using @benchmark myfunc(...) rather than @time myfunc(...), since @benchmark will run a timing loop for you and gather statistics.

In Julia 0.5 and older versions, the .* operator is defined, but it is just a function call and is not fusing. In particular, it does not fuse with the assignment in .=, so z .= x .* y is equivalent to:

tmp = x .* y  # allocate a new temporary array for elementwise x * y
z .= tmp      # write the tmp into z, equivalent to copy!(z, tmp)

If you want the fusing version of this operation in 0.5, you need z .= (*).(x,y)

With bar!(x, y, z) = z .= (*).(x, y), I get:

julia> @benchmark foo!($x, $y, $z)
BenchmarkTools.Trial: 
  samples:          10000
  evals/sample:     77
  time tolerance:   5.00%
  memory tolerance: 1.00%
  memory estimate:  7.98 kb
  allocs estimate:  3
  minimum time:     795.00 ns (0.00% GC)
  median time:      1.23 μs (0.00% GC)
  mean time:        1.95 μs (25.90% GC)
  maximum time:     36.09 μs (67.13% GC)

julia> @benchmark bar!($x, $y, $z)
BenchmarkTools.Trial: 
  samples:          10000
  evals/sample:     413
  time tolerance:   5.00%
  memory tolerance: 1.00%
  memory estimate:  32.00 bytes
  allocs estimate:  1
  minimum time:     241.00 ns (0.00% GC)
  median time:      248.00 ns (0.00% GC)
  mean time:        289.79 ns (0.58% GC)
  maximum time:     4.35 μs (92.72% GC)
2 Likes

Thank you so much for the hints. I have very limited knowledge in this regard. So bear with me for all kinds of dumb questions.

Why is this?

Thanks for the tips!

So under the hood, the dot . is just simple looping in Julia 0.5 and older versions?

[quote=“stevengj, post:2, topic:1228, full:true”]
In particular, it does not fuse with the assignment in .=, so z .= x .* y is equivalent to:

tmp = x .* y  # allocate a new temporary array for elementwise x * y
z .= tmp      # write the tmp into z, equivalent to copy!(z, tmp)
``` [/quote]
I think my understanding of the term "fusing" is very vague. I googled but couldn't find a good article explaining what fusing is and its notion. Could you please point me to a good explanation for fusing. 

[quote="stevengj, post:2, topic:1228, full:true"]
If you want the fusing version of this operation in 0.5, you need `z .= (*).(x,y)`
[/quote]
Thanks! Is it same for addition, minus, division, etc? 

Sorry for being annoying. Thanks!!

More precisely, you can do @benchmark or @time in the global scope, but the actual code that is being benchmarked should always be in a function. i.e. @benchmark foo(...), not @benchmark ...do something...

For one reason, see the performance tips in the manual on global variables. There are other more subtle reasons having to do with anonymous functions (produced by fusion) sometimes forcing a recompile as part of your timing measurement.

Technically, everything is “simple looping” under the hood. All vectorized operations must have a loop somewhere. (In other languages like Matlab, that loop might have to occur in a low-level language like C or Fortran, buried in a library routine, but it is still there.)

The question is how many loops you have, and how many arrays you allocate in the meantime. In Julia 0.5 and earlier, x .* y is a call to a function .*(x,y) that does not “know” about the context in which you are calling it, so it has to perform its own loop and allocate its own array. This is true even if you are evaluating z .= x .* y, in which case the caller discards the x .* y temporary array after copying it to z.

As it happens, I’m working on a blog post on this very topic; you can read a draft of the article here if you want.

Yes. To be honest, if I were you I wouldn’t worry about it at this point. Fusion is just an optimization, and it sounds like you are worrying about optimizations prematurely. Element-wise operations (.*, .^, etcetera) work fine in Julia 0.5 and earlier, and have performance comparable to vectorized operations in other languages. In Julia 0.6, they will be faster (because they will combine/fuse with surrounding vectorized operations), that’s all.

4 Likes

Thanks. I see now.

Thanks for the explanation. It’s much clearer now.

Thanks for the pointer to the blog article, which really helpful to me.

Indeed, I think I don’t have to worry much about fusion. The existing element-wise operations are already good enough for me. I asked about fusion simply because I thought fusion and element-wise operations are pretty much a same thing. Apparently, I was wrong.

Thanks for all your answers, which teach me a lot.

I thought the last section “The importance of higher-order inlining” was very interesting. I’ve been wondering whether there was a tension between loop fusion and SIMD optimizations. It seemed to me that A.*B.+C could end up faster than f(a, b, c) = a*b+c; f.(A, B, C) if the first version used SIMD. I thought the second version wouldn’t be able do that, but now I see how inlining solves that problem.

One thing I’m unclear about. If f contains some operations without SIMD support (erf?), will SIMD occur at all, or is this impossible? I’d think it could operate chunk by chunk, doing SIMD operations on what it can, then finishing serially, but this is probably tricky.

You could potentially divide your computations in blocks that fit in the cache and then do vectorized operations on those blocks to easier allow for SIMD. Could potentially be faster than fusing the whole loop.

I don’t follow. First, SIMD and cache locality are rather different concerns. Second, fusing the whole loop (for elementwise operations) already maximizes locality — you don’t get any more locality by dividing the loop into blocks.

Oh, I see what you mean: if you have an operation like X .= 2 .* erf.(X) where erf doesn’t use SIMD, you would compute X .= erf.(X) in-place in X for blocks that fit into cache, then while the block is still in cache compute X .= 2 .* X`.

To actually gain from this, you would probably have to do it for cache=registers. e.g. you load 4 elements of X into registers, compute erf for them, then load the 4 results into a SIMD register and compute 2x on them all at once.

(Ideally, this would be something that compilers can do for you, but it doesn’t look like LLVM is capable of this.)

In practice, however, it is probably rarely worth it to do this optimization by hand. The functions that don’t work with SIMD are mostly going to be relatively complicated functions like erf, which will be expensive enough that the relative improvement of using SIMD for the remaining scalar operations is probably negligible unless you have a lot of them.

3 Likes

Yes, that is what I meant. You are likely right it is seldom worth it in practice.

A quick question - How could I have the results printed for multiple use of @benchmark? I’m running the following file but only the result for the last @benchmark is returned at console.

using BenchmarkTools

x = rand(1000)
y = rand(1000)
z = similar(y)

foo!(x, y, z) = z .= x .* y
bar!(x, y, z) = z .= (*).(x, y)
function foobar!(x, y, z)
  for j = 1:length(x)
    z[j] = x[j] * y[j]
  end
end

@benchmark foo!($x, $y, $z)
@benchmark bar!($x, $y, $z)
@benchmark foobar!($x, $y, $z)

Thanks!

println(@benchmark foo!($x, $y, $z))
println(@benchmark bar!($x, $y, $z))
println(@benchmark foobar!($x, $y, $z))

Thanks for the prompt answer!