Is there a maximum(f, op, itrs...) in Julia?

Sometimes I see the following pattern in code. How can I write this in a more-efficient/elegant way? Without writing explicit loops, of course.

a = rand(5,5);
b = rand(5,5);
delta = maximum(abs, i-j for (i,j) in zip(a,b))

Ideally, I imagine something like:

 delta = maximum(abs, -, a, b) 

Is this possible in Julia?

julia> a = rand(5,5);
julia> b = rand(5,5);
julia> delta = maximum(abs, i-j for (i,j) in zip(a,b))
0.8146851910185748
julia> maximum(abs, a .- b)
0.8146851910185748
3 Likes

Yes, delta = mapreduce(absāˆ˜-, max, a, b). However, this doesnā€™t (yet) have a fast implementation, so the zip may be better.

4 Likes
maximum(x->abs(x[1]-x[2]), zip(a, b))
3 Likes

@jzr, the dot seems to be unnecessary? Same result with: maximum(abs, a - b)

1 Like
maximum(Base.splat(absāˆ˜-), zip(a, b))
2 Likes

What does a ā€œfast implementationā€ mean in this context? Is mapreduce implemented in a not very efficient way and there is potential to improve it?

2 Likes

No, mapreduce is pretty well optimized, but only for specific cases since we can dispatch on different functions for those (functions have their own type after all). absāˆ˜- is unlikely to be one of those cases so far, so itā€™ll fall back to a generic implementation (which may not SIMD etc.) Depending on the functions and iterators in use, mapreduce may be specialized to those and achieve a greater speedup than the fallback implementation can provide.

2 Likes

Fastest Iā€™ve seen so far:

jl> using Tullio

jl> @tullio (max) maxval := abs(a[i] - b[i])

@mcabbott must there be an output variable inside the macro call? Could one hypothetically write

jl> maxval = @tullio (max) abs(a[i] - b[i])

?

(Oops: fixed missing abs)

2 Likes

For now there must be, although it could be changed I suppose. You can write @tullio (max) _ := abs(a[i] - b[i]) which at least saves thinking up a name.

What does a ā€œfast implementationā€ mean in this context?

If you do @less mapreduce(absāˆ˜-, max, a, b) you will see this:

mapreduce(f, op, A::AbstractArrayOrBroadcasted...; kw...) =
    reduce(op, map(f, A...); kw...)

i.e. it performs map which allocates the array, then reduce. I think this is a placeholder implementation, to allow the syntax now and make it fast later. Itā€™s very much like maximum(abs.(a.-b)), while with one array maximum(abs, a) is mapreduce(abs, max, a) and this does not allocate abs.(a). The issue about the multiple-argument case is #38558.

1 Like

Thank you all for your contributions, here I present some benchmarks for different alternatives. Tullio seems to be the only alternative with the same performance as explicit loops.

function max_abs(a,b)
    mx = 0.0
    for i in eachindex(a,b)
        tmp = abs(a[i]-b[i]) 
        tmp > mx && (mx = tmp)
    end 
    mx 
end 

N = 500
a = rand(N,N)
b = rand(N,N)

@btime max_abs($a,$b)                            # 188.500 Ī¼s (0 allocations: 0 bytes)
@btime maximum(abs(i-j) for (i,j) in zip($a,$b)) # 442.200 Ī¼s (0 allocations: 0 bytes)
@btime maximum(abs, $a - $b)                     # 568.800 Ī¼s (2 allocations: 1.91 MiB)
@btime maximum(Base.splat(absāˆ˜-), zip($a, $b))   # 442.300 Ī¼s (0 allocations: 0 bytes)
@btime mapreduce(absāˆ˜-, max, $a, $b)             # 572.900 Ī¼s (2 allocations: 1.91 MiB)
using Tullio
@btime @tullio (max) delta := abs($a[i] - $b[i]) # 188.800 Ī¼s (1 allocation: 16 bytes)
using LoopVectorization
@btime @tullio (max) delta := abs($a[i] - $b[i]) #  75.300 Ī¼s (1 allocation: 16 bytes)
2 Likes

Not sure if it will make a difference, but I would use

mx = max(mx, tmp) 

to avoid the branch, which could in principle prevent vectorization. Itā€™s also clearer.

This is nicer undoubtedly, but the performance drops to the generator version, it needs a @fastmath to keep the same performance in this case.

An extended version of Seif Sheblā€™s excellent benchmark test

I tried the benchmark tests for N = 100, 200, 500, 1000, and 2000.

The results look different for each N.

Julia script:

# maxabstest.jl

using BenchmarkTools

N = isempty(ARGS) ? 500 : parse(Int, ARGS[1])

function max_abs(a, b)
    m = zero(promote_type(eltype(a), eltype(b)))
    for i in eachindex(a, b)
        tmp = abs(a[i] - b[i]) 
        tmp > m && (m = tmp)
    end 
    m
end

a = randn(N, N)
b = randn(N, N)

@show VERSION
@show Threads.nthreads()
println()
@show N
print("simple for loop:     ")
@btime max_abs($a, $b)
print("mapreduce absāˆ˜- max: ")
@btime mapreduce(absāˆ˜-, max, $a, $b)
print("maximum(abs, a - b): ")
@btime maximum(abs, $a - $b)
print("maximum(generator):  ")
@btime maximum(abs(i - j) for (i, j) in zip($a, $b))
print("maximum splat(absāˆ˜-):")
@btime maximum(Base.splat(absāˆ˜-), zip($a, $b))
using Tullio
print("Tullio:              ")
@btime @tullio (max) _ := abs($a[i] - $b[i])
print("Tullio (LoopVect.):  ")
using LoopVectorization
@btime @tullio (max) _ := abs($a[i] - $b[i])

function max_abs_turbo(a, b)
    m = zero(promote_type(eltype(a), eltype(b)))
    @turbo for i in eachindex(a, b)
        m = max(m, abs(a[i] - b[i]))
    end 
    m
end

function max_abs_tturbo(a, b)
    m = zero(promote_type(eltype(a), eltype(b)))
    @tturbo for i in eachindex(a, b)
        m = max(m, abs(a[i] - b[i]))
    end 
    m
end

print("LoopVect. @turbo:    ")
@btime max_abs_turbo($a, $b)
print("LoopVect. @tturbo:   ")
@btime max_abs_tturbo($a, $b)

Results for N = 100, 200, 500, 1000, 2000

$ julia maxabstest.jl 100
VERSION = v"1.6.1"
Threads.nthreads() = 12

N = 100
simple for loop:       8.733 Ī¼s (0 allocations: 0 bytes)
maximum(abs, a - b):   10.100 Ī¼s (2 allocations: 78.20 KiB)
maximum(generator):    18.100 Ī¼s (0 allocations: 0 bytes)
maximum splat(absāˆ˜-):  18.100 Ī¼s (0 allocations: 0 bytes)
mapreduce absāˆ˜- max:   11.300 Ī¼s (2 allocations: 78.20 KiB)
Tullio:                8.767 Ī¼s (1 allocation: 16 bytes)
Tullio (LoopVect.):    1.290 Ī¼s (1 allocation: 16 bytes)
LoopVect. @turbo:      1.240 Ī¼s (0 allocations: 0 bytes)
LoopVect. @tturbo:     722.689 ns (0 allocations: 0 bytes)
$ julia maxabstest.jl 200
VERSION = v"1.6.1"
Threads.nthreads() = 12

N = 200
simple for loop:       34.900 Ī¼s (0 allocations: 0 bytes)
maximum(abs, a - b):   34.500 Ī¼s (2 allocations: 312.58 KiB)
maximum(generator):    71.600 Ī¼s (0 allocations: 0 bytes)
maximum splat(absāˆ˜-):  71.700 Ī¼s (0 allocations: 0 bytes)
mapreduce absāˆ˜- max:   42.300 Ī¼s (2 allocations: 312.58 KiB)
Tullio:                34.900 Ī¼s (1 allocation: 16 bytes)
Tullio (LoopVect.):    8.400 Ī¼s (1 allocation: 16 bytes)
LoopVect. @turbo:      9.600 Ī¼s (0 allocations: 0 bytes)
LoopVect. @tturbo:     2.010 Ī¼s (0 allocations: 0 bytes)
$ julia maxabstest.jl 500
VERSION = v"1.6.1"
Threads.nthreads() = 12

N = 500
simple for loop:       213.400 Ī¼s (0 allocations: 0 bytes)
maximum(abs, a - b):   444.500 Ī¼s (2 allocations: 1.91 MiB)
maximum(generator):    447.000 Ī¼s (0 allocations: 0 bytes)
maximum splat(absāˆ˜-):  447.100 Ī¼s (0 allocations: 0 bytes)
mapreduce absāˆ˜- max:   454.600 Ī¼s (2 allocations: 1.91 MiB)
Tullio:                213.400 Ī¼s (1 allocation: 16 bytes)
Tullio (LoopVect.):    55.900 Ī¼s (1 allocation: 16 bytes)
LoopVect. @turbo:      56.100 Ī¼s (0 allocations: 0 bytes)
LoopVect. @tturbo:     12.700 Ī¼s (0 allocations: 0 bytes)
$ julia maxabstest.jl 1000
VERSION = v"1.6.1"
Threads.nthreads() = 12

N = 1000
simple for loop:       1.053 ms (0 allocations: 0 bytes)
mapreduce absāˆ˜- max:   2.354 ms (2 allocations: 7.63 MiB)
maximum(abs, a - b):   2.240 ms (2 allocations: 7.63 MiB)
maximum(generator):    1.939 ms (0 allocations: 0 bytes)
maximum splat(absāˆ˜-):  1.937 ms (0 allocations: 0 bytes)
Tullio:                263.000 Ī¼s (93 allocations: 4.73 KiB)
Tullio (LoopVect.):    257.600 Ī¼s (93 allocations: 4.73 KiB)
LoopVect. @turbo:      410.200 Ī¼s (0 allocations: 0 bytes)
LoopVect. @tturbo:     207.600 Ī¼s (0 allocations: 0 bytes)
$ julia maxabstest.jl 2000
VERSION = v"1.6.1"
Threads.nthreads() = 12

N = 2000
simple for loop:       4.193 ms (0 allocations: 0 bytes)
mapreduce absāˆ˜- max:   10.075 ms (2 allocations: 30.52 MiB)
maximum(abs, a - b):   11.004 ms (2 allocations: 30.52 MiB)
maximum(generator):    7.903 ms (0 allocations: 0 bytes)
maximum splat(absāˆ˜-):  7.896 ms (0 allocations: 0 bytes)
Tullio:                2.388 ms (199 allocations: 10.17 KiB)
Tullio (LoopVect.):    2.855 ms (200 allocations: 10.20 KiB)
LoopVect. @turbo:      2.540 ms (0 allocations: 0 bytes)
LoopVect. @tturbo:     2.275 ms (0 allocations: 0 bytes)
2 Likes

Just to mention another option Home Ā· Transducers.jl supports map/reduce operations.