Performance of loops

With the functions as in the OP I get

julia> x = Random.rand(Int64, 2)
2-element Vector{Int64}:
  -260682396598444552
 -7056639715971435629

julia> loop_with_map(x);

julia> x
2-element Vector{Int64}:
  -260682396598444552
 -7056639715971435629
# x still the same

but

julia> loop_with_map_dot(x);

julia> x
2-element Vector{Int64}:
 -521364793196889104
 4333464641766680358
# x has mutated

Similarly for loop_via_broadcast,

A meta comment is that you almost never want a loop to perform an operation as simple as multiplying elements by 2 if performance matters. Your granularity is too small. You want to combine loops so that you make as few passes over your arrays as possible and do lots of work per iteration.

Coarse-graining your loops also helps you to eliminate temporary arrays.

(Languages like Matlab and Python teach people the bad habit of decomposing every algorithm into sequences of simple β€œvectorized” steps, to work around the performance limitations of their loops.)

5 Likes

Yes, only the mutating functions should have a trailing ! in the name. If you see x = ..., it’s assignment, not mutation.

So it should be

function loop_with_map_dot!(x) ...

but

function loop_with_map(x) ...

is fine, as it does not mutate?

1 Like

Broadcasting is more generic in that it also works well on GPU arrays. As a general rule, I’d recommend x .*= 2 for that reason.

In the light of OP’s context

It doesn’t sound to me like GPU computing matters and then broadcasting bears the risk of being slower, is generally less expressive than loops and possible harder to read.

If this were really true, then broadcasting would would have no point at all. In many cases broadcasting is both as fast, as well as safer, and definitely easier to read. Not always, but often.

I see that my short statement might convey the wrong message.
I assumed that OP likely has generally more complicated operations than multiplying all elements of an array by 2. Broadcasting is great for smallish and rather simple operations where it is also very readable. However in general explicit loops offer the most expressive power (i.e. there is loop code that you cannot reasonably express as broadcast). So in general a for loop is the most basic construct for efficient looping while broadcasting can make things nicer to read and perhaps a bit easier to write iff the problem is suited for broadcasting.

And for performance I was referring to:

So especially for beginners, I would recommend writing correct loop-based code first and then learning to see opportunities where broadcasting can simplify things. However, GPU computing might be an exception.

1 Like

Writing loop-based code can in some cases be significantly more challenging, in particular, you need to pre-allocate and figure out correct element-types, which can be highly non-trivial. Broadcasting is also likely to be very familiar to beginners coming from the python/matlab/r world.

I would definitely recommend both styles from the very beginning.

2 Likes

Thanks for your comments and advice. The original code misses exclamation marks in function signatures and the use of map! opens possibilities for aliasing issues if the mutated argument shares memory with any other argument. So I conclude that map and map! are indeed slow and not recommended right now. Promoting map! on my end will sooner or later lead to an aliasing issue when the subtleties of the documentation are forgotten. It is better to use broadcast or broadcast! as replacement for map and map!. An allocation for the results of the calculation can be avoided if a pre-existing collection can be overwritten using .=. From the documentation: β€œA special syntax exists for broadcasting: f.(args…) is equivalent to broadcast(f, args…)” so the dot notation may as well be used for the right hand side instead of typing broadcast in full. Combining these two gives several dots and the .@ macro can be used to keep track of the dots automatically. Also, broadcasting may allocate more memory than expected. With a for-loop it is unlikely that performance or clarity is lost.

One thing I learned while thinking about this is that you really do have to explicitly broadcast the multiplication, otherwise the following forms are slow.

julia> function f(x)
           x .= 2x
       end
f (generic function with 1 method)

julia> @benchmark f($x)
BenchmarkTools.Trial: 10000 samples with 194 evaluations.
 Range (min … max):  507.515 ns … 61.053 ΞΌs  β”Š GC (min … max):  0.00% … 98.65%
 Time  (median):     817.332 ns              β”Š GC (median):     0.00%
 Time  (mean Β± Οƒ):     1.165 ΞΌs Β±  1.813 ΞΌs  β”Š GC (mean Β± Οƒ):  27.94% Β± 16.65%

  β–ƒβ–ˆβ–‡β–…                                                     ▁   ▁
  β–ˆβ–ˆβ–ˆβ–ˆβ–…β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–†β–ˆβ–ˆβ–ˆβ–ˆβ–ˆ β–ˆ
  508 ns        Histogram: log(frequency) by time      10.5 ΞΌs <

 Memory estimate: 8.00 KiB, allocs estimate: 1.

julia> function f(x)
           x .= 2*x
       end
f (generic function with 1 method)

julia> @benchmark f($x)
BenchmarkTools.Trial: 10000 samples with 193 evaluations.
 Range (min … max):  506.047 ns … 61.707 ΞΌs  β”Š GC (min … max):  0.00% … 98.48%
 Time  (median):     823.187 ns              β”Š GC (median):     0.00%
 Time  (mean Β± Οƒ):     1.195 ΞΌs Β±  1.886 ΞΌs  β”Š GC (mean Β± Οƒ):  28.36% Β± 16.65%

  β–‚β–ˆβ–†β–…                                                         ▁
  β–ˆβ–ˆβ–ˆβ–ˆβ–†β–…β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–†β–‡β–ˆβ–ˆβ–ˆβ–‡β–ˆ β–ˆ
  506 ns        Histogram: log(frequency) by time        11 ΞΌs <

 Memory estimate: 8.00 KiB, allocs estimate: 1.

julia> function f(x)
           x .= 2 .* x
       end
f (generic function with 1 method)

Relative to using .* or @..

julia> @benchmark f($x)
BenchmarkTools.Trial: 10000 samples with 944 evaluations.
 Range (min … max):   99.664 ns … 220.824 ns  β”Š GC (min … max): 0.00% … 0.00%
 Time  (median):     101.076 ns               β”Š GC (median):    0.00%
 Time  (mean Β± Οƒ):   101.127 ns Β±   2.302 ns  β”Š GC (mean Β± Οƒ):  0.00% Β± 0.00%

  β–ˆβ–…β–‚     β–†β–ˆβ–„β–‚β– ▁▁  ▁▁▃▁▁    ▁▂▁                                β–‚
  β–ˆβ–ˆβ–ˆβ–ˆβ–‡β–ˆβ–‡β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‡β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‡β–‡β–ˆβ–‡β–‡β–‡β–†β–‡β–†β–‡β–†β–‡β–†β–†β–„β–…β–…β–„β–‡β–„β–…β–„β–„β–ƒβ–„β–„β–„β–β–†β–… β–ˆ
  99.7 ns       Histogram: log(frequency) by time        109 ns <

 Memory estimate: 0 bytes, allocs estimate: 0.

julia> function f(x)
           @. x = 2x
       end
f (generic function with 1 method)

julia> @benchmark f($x)
BenchmarkTools.Trial: 10000 samples with 946 evaluations.
 Range (min … max):   99.673 ns … 129.185 ns  β”Š GC (min … max): 0.00% … 0.00%
 Time  (median):     101.040 ns               β”Š GC (median):    0.00%
 Time  (mean Β± Οƒ):   101.023 ns Β±   1.653 ns  β”Š GC (mean Β± Οƒ):  0.00% Β± 0.00%

  ▇▆▁▂       β–ƒβ–ˆβ–…β–„β–ƒ            ▁           β–‚                     ▁
  β–ˆβ–ˆβ–ˆβ–ˆβ–†β–…β–…β–…β–…β–†β–…β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‡β–ˆβ–‡β–†β–‡β–†β–‡β–†β–ˆβ–ˆβ–ˆβ–‡β–‡β–†β–‡β–†β–ˆβ–…β–‡β–†β–ˆβ–ˆβ–ˆβ–‡β–…β–†β–…β–†β–†β–†β–†β–…β–…β–…β–„β–†β–…β–†β–…β–‡β–…β–…β–† β–ˆ
  99.7 ns       Histogram: log(frequency) by time        106 ns <

 Memory estimate: 0 bytes, allocs estimate: 0.

6 Likes