What is idiomatic way of vectorizing longer operations in Julia?

Hello.

I have recently tried writing some code and need help with vectorization. I like that Julia’s loops are efficient, but sometimes I want to write less verbose code. How to do it in this case? What are the best practices? Below is an example of what I mean:

Imagine you have a vector a and you are interested in knowing the sum of a[i]/(1 + a[i]). If you want to get it done, you might be tempted to write:

function f1(a)
    return sum( a ./ (1. .+ a))
end

But you might also write

function f2(a)
    Val = 0.0
    for i in each index(a)
        Val += a[i] / (1. + a[i])
    end
    return value
end

Ignoring any typing issues, that is, a is an array of Float64.

Then you have

using BenchmarkTools

a = rand(9000000)

@btime f1(a)
# 22.670 ms (3 allocations: 68.66 MiB)

@btime f2(a)
# 15.136 ms (1 allocation: 16 bytes)

Which is a slight difference, but it quickly adds up.

How to make the f1 as efficient memory and time-wise?

julia> @btime mapreduce(x -> x/(1+x), +, a)
  2.042 ms (1 allocation: 16 bytes)
2.7611581397048105e6

mapreduce seems pretty good. probably there is more wizardry to be done for another factor of 2 or so if you’re willing to try a lot harder

1 Like

For your particular example:

function f3(a)
  return sum(x -> (x / (1+x)), a)
end

@btime f3($a)
2.043 ms (0 allocations: 0 bytes)

But your question is probably more general. Tullio.jl is one way of writing complex loops in a very compact format.

What??? Very cool, thank you!

Why is mapreduce faster than the for loop?

Are there other functions like that I should know about?

Also I feel it loses a bit of the readability that the vectorization has

mapreduce likely uses optimizations along the lines of

julia> function f2(a)
           vOut = 0.0
           @simd for a1 in a
               vOut += a1 / (1.0 + a1)
           end
           return vOut
       end
f2 (generic function with 1 method)

julia> @btime f2($a)
  2.114 ms (0 allocations: 0 bytes)
2.76175963830546e6

Now it’s the same speed at f3

1 Like

If you want to stick to one liners and vectorized operations, I suggest using LoopVectorization and LazyArrays. When you have to sum a broadcasted operation, LazyArrays is usually the best option, it achieves almost the same speed as mapreduce.

Specifically:

using BenchmarkTools
a = rand(1_000)

#baseline
f1(a) =  sum( a ./ (1. .+ a))
@btime f1($a)


#with a generator
f2(a) = sum(x / (1. + x) for x in a)
@btime f2($a)

# with lazyarray (notice @~ in the broadcasted part)
using LazyArrays
f3(a) =  sum(@~ a ./ (1. .+ a))
@btime f3($a)


#with @turbo
using LoopVectorization
f4(a) = sum(@turbo a ./ (1. .+ a))
@btime f4($a)

# with mapreduce
f5(a) = mapreduce(x -> x/(1+x), +, a)
@btime f5($a)

with time

  693.377 ns (1 allocation: 7.94 KiB) # baseline
  889.583 ns (0 allocations: 0 bytes) # generator
  517.188 ns (0 allocations: 0 bytes) # Lazy Array option
  551.875 ns (1 allocation: 7.94 KiB) # LoopVectorization
  454.315 ns (0 allocations: 0 bytes) # mapreduce

I wrote a week ago how to keep a balance between speed and clear code, with emphasis on vectorized operations. See here