Is there a maximum(f, op, itrs...) in Julia?

Sometimes I see the following pattern in code. How can I write this in a more-efficient/elegant way? Without writing explicit loops, of course.

a = rand(5,5);
b = rand(5,5);
delta = maximum(abs, i-j for (i,j) in zip(a,b))

Ideally, I imagine something like:

 delta = maximum(abs, -, a, b) 

Is this possible in Julia?

julia> a = rand(5,5);
julia> b = rand(5,5);
julia> delta = maximum(abs, i-j for (i,j) in zip(a,b))
0.8146851910185748
julia> maximum(abs, a .- b)
0.8146851910185748

Yes, delta = mapreduce(abs∘-, max, a, b). However, this doesn’t (yet) have a fast implementation, so the zip may be better.

maximum(x->abs(x[1]-x[2]), zip(a, b))

@jzr, the dot seems to be unnecessary? Same result with: maximum(abs, a - b)

maximum(Base.splat(abs∘-), zip(a, b))

What does a ā€œfast implementationā€ mean in this context? Is mapreduce implemented in a not very efficient way and there is potential to improve it?

No, mapreduce is pretty well optimized, but only for specific cases since we can dispatch on different functions for those (functions have their own type after all). abs∘- is unlikely to be one of those cases so far, so it’ll fall back to a generic implementation (which may not SIMD etc.) Depending on the functions and iterators in use, mapreduce may be specialized to those and achieve a greater speedup than the fallback implementation can provide.

Fastest I’ve seen so far:

jl> using Tullio

jl> @tullio (max) maxval := abs(a[i] - b[i])

@mcabbott must there be an output variable inside the macro call? Could one hypothetically write

jl> maxval = @tullio (max) abs(a[i] - b[i])

?

(Oops: fixed missing abs)

For now there must be, although it could be changed I suppose. You can write @tullio (max) _ := abs(a[i] - b[i]) which at least saves thinking up a name.

What does a ā€œfast implementationā€ mean in this context?

If you do @less mapreduce(abs∘-, max, a, b) you will see this:

mapreduce(f, op, A::AbstractArrayOrBroadcasted...; kw...) =
    reduce(op, map(f, A...); kw...)

i.e. it performs map which allocates the array, then reduce. I think this is a placeholder implementation, to allow the syntax now and make it fast later. It’s very much like maximum(abs.(a.-b)), while with one array maximum(abs, a) is mapreduce(abs, max, a) and this does not allocate abs.(a). The issue about the multiple-argument case is #38558.

Thank you all for your contributions, here I present some benchmarks for different alternatives. Tullio seems to be the only alternative with the same performance as explicit loops.

function max_abs(a,b)
    mx = 0.0
    for i in eachindex(a,b)
        tmp = abs(a[i]-b[i]) 
        tmp > mx && (mx = tmp)
    end 
    mx 
end 

N = 500
a = rand(N,N)
b = rand(N,N)

@btime max_abs($a,$b)                            # 188.500 μs (0 allocations: 0 bytes)
@btime maximum(abs(i-j) for (i,j) in zip($a,$b)) # 442.200 μs (0 allocations: 0 bytes)
@btime maximum(abs, $a - $b)                     # 568.800 μs (2 allocations: 1.91 MiB)
@btime maximum(Base.splat(abs∘-), zip($a, $b))   # 442.300 μs (0 allocations: 0 bytes)
@btime mapreduce(abs∘-, max, $a, $b)             # 572.900 μs (2 allocations: 1.91 MiB)
using Tullio
@btime @tullio (max) delta := abs($a[i] - $b[i]) # 188.800 μs (1 allocation: 16 bytes)
using LoopVectorization
@btime @tullio (max) delta := abs($a[i] - $b[i]) #  75.300 μs (1 allocation: 16 bytes)

Not sure if it will make a difference, but I would use

mx = max(mx, tmp) 

to avoid the branch, which could in principle prevent vectorization. It’s also clearer.

This is nicer undoubtedly, but the performance drops to the generator version, it needs a @fastmath to keep the same performance in this case.

An extended version of Seif Shebl’s excellent benchmark test

I tried the benchmark tests for N = 100, 200, 500, 1000, and 2000.

The results look different for each N.

Julia script:

# maxabstest.jl

using BenchmarkTools

N = isempty(ARGS) ? 500 : parse(Int, ARGS[1])

function max_abs(a, b)
    m = zero(promote_type(eltype(a), eltype(b)))
    for i in eachindex(a, b)
        tmp = abs(a[i] - b[i]) 
        tmp > m && (m = tmp)
    end 
    m
end

a = randn(N, N)
b = randn(N, N)

@show VERSION
@show Threads.nthreads()
println()
@show N
print("simple for loop:     ")
@btime max_abs($a, $b)
print("mapreduce abs∘- max: ")
@btime mapreduce(abs∘-, max, $a, $b)
print("maximum(abs, a - b): ")
@btime maximum(abs, $a - $b)
print("maximum(generator):  ")
@btime maximum(abs(i - j) for (i, j) in zip($a, $b))
print("maximum splat(abs∘-):")
@btime maximum(Base.splat(abs∘-), zip($a, $b))
using Tullio
print("Tullio:              ")
@btime @tullio (max) _ := abs($a[i] - $b[i])
print("Tullio (LoopVect.):  ")
using LoopVectorization
@btime @tullio (max) _ := abs($a[i] - $b[i])

function max_abs_turbo(a, b)
    m = zero(promote_type(eltype(a), eltype(b)))
    @turbo for i in eachindex(a, b)
        m = max(m, abs(a[i] - b[i]))
    end 
    m
end

function max_abs_tturbo(a, b)
    m = zero(promote_type(eltype(a), eltype(b)))
    @tturbo for i in eachindex(a, b)
        m = max(m, abs(a[i] - b[i]))
    end 
    m
end

print("LoopVect. @turbo:    ")
@btime max_abs_turbo($a, $b)
print("LoopVect. @tturbo:   ")
@btime max_abs_tturbo($a, $b)

Results for N = 100, 200, 500, 1000, 2000

$ julia maxabstest.jl 100
VERSION = v"1.6.1"
Threads.nthreads() = 12

N = 100
simple for loop:       8.733 μs (0 allocations: 0 bytes)
maximum(abs, a - b):   10.100 μs (2 allocations: 78.20 KiB)
maximum(generator):    18.100 μs (0 allocations: 0 bytes)
maximum splat(abs∘-):  18.100 μs (0 allocations: 0 bytes)
mapreduce abs∘- max:   11.300 μs (2 allocations: 78.20 KiB)
Tullio:                8.767 μs (1 allocation: 16 bytes)
Tullio (LoopVect.):    1.290 μs (1 allocation: 16 bytes)
LoopVect. @turbo:      1.240 μs (0 allocations: 0 bytes)
LoopVect. @tturbo:     722.689 ns (0 allocations: 0 bytes)
$ julia maxabstest.jl 200
VERSION = v"1.6.1"
Threads.nthreads() = 12

N = 200
simple for loop:       34.900 μs (0 allocations: 0 bytes)
maximum(abs, a - b):   34.500 μs (2 allocations: 312.58 KiB)
maximum(generator):    71.600 μs (0 allocations: 0 bytes)
maximum splat(abs∘-):  71.700 μs (0 allocations: 0 bytes)
mapreduce abs∘- max:   42.300 μs (2 allocations: 312.58 KiB)
Tullio:                34.900 μs (1 allocation: 16 bytes)
Tullio (LoopVect.):    8.400 μs (1 allocation: 16 bytes)
LoopVect. @turbo:      9.600 μs (0 allocations: 0 bytes)
LoopVect. @tturbo:     2.010 μs (0 allocations: 0 bytes)
$ julia maxabstest.jl 500
VERSION = v"1.6.1"
Threads.nthreads() = 12

N = 500
simple for loop:       213.400 μs (0 allocations: 0 bytes)
maximum(abs, a - b):   444.500 μs (2 allocations: 1.91 MiB)
maximum(generator):    447.000 μs (0 allocations: 0 bytes)
maximum splat(abs∘-):  447.100 μs (0 allocations: 0 bytes)
mapreduce abs∘- max:   454.600 μs (2 allocations: 1.91 MiB)
Tullio:                213.400 μs (1 allocation: 16 bytes)
Tullio (LoopVect.):    55.900 μs (1 allocation: 16 bytes)
LoopVect. @turbo:      56.100 μs (0 allocations: 0 bytes)
LoopVect. @tturbo:     12.700 μs (0 allocations: 0 bytes)
$ julia maxabstest.jl 1000
VERSION = v"1.6.1"
Threads.nthreads() = 12

N = 1000
simple for loop:       1.053 ms (0 allocations: 0 bytes)
mapreduce abs∘- max:   2.354 ms (2 allocations: 7.63 MiB)
maximum(abs, a - b):   2.240 ms (2 allocations: 7.63 MiB)
maximum(generator):    1.939 ms (0 allocations: 0 bytes)
maximum splat(abs∘-):  1.937 ms (0 allocations: 0 bytes)
Tullio:                263.000 μs (93 allocations: 4.73 KiB)
Tullio (LoopVect.):    257.600 μs (93 allocations: 4.73 KiB)
LoopVect. @turbo:      410.200 μs (0 allocations: 0 bytes)
LoopVect. @tturbo:     207.600 μs (0 allocations: 0 bytes)
$ julia maxabstest.jl 2000
VERSION = v"1.6.1"
Threads.nthreads() = 12

N = 2000
simple for loop:       4.193 ms (0 allocations: 0 bytes)
mapreduce abs∘- max:   10.075 ms (2 allocations: 30.52 MiB)
maximum(abs, a - b):   11.004 ms (2 allocations: 30.52 MiB)
maximum(generator):    7.903 ms (0 allocations: 0 bytes)
maximum splat(abs∘-):  7.896 ms (0 allocations: 0 bytes)
Tullio:                2.388 ms (199 allocations: 10.17 KiB)
Tullio (LoopVect.):    2.855 ms (200 allocations: 10.20 KiB)
LoopVect. @turbo:      2.540 ms (0 allocations: 0 bytes)
LoopVect. @tturbo:     2.275 ms (0 allocations: 0 bytes)

Just to mention another option Home Ā· Transducers.jl supports map/reduce operations.