Is there a maximum(f, op, itrs...) in Julia?

Seif_Shebl · July 1, 2021, 2:23am

Sometimes I see the following pattern in code. How can I write this in a more-efficient/elegant way? Without writing explicit loops, of course.

a = rand(5,5);
b = rand(5,5);
delta = maximum(abs, i-j for (i,j) in zip(a,b))

Ideally, I imagine something like:

 delta = maximum(abs, -, a, b)

Is this possible in Julia?

jzr · July 1, 2021, 3:34am

julia> a = rand(5,5);
julia> b = rand(5,5);
julia> delta = maximum(abs, i-j for (i,j) in zip(a,b))
0.8146851910185748
julia> maximum(abs, a .- b)
0.8146851910185748

mcabbott · July 1, 2021, 3:34am

Yes, delta = mapreduce(abs∘-, max, a, b). However, this doesn’t (yet) have a fast implementation, so the zip may be better.

DNF · July 1, 2021, 5:14am

maximum(x->abs(x[1]-x[2]), zip(a, b))

rafael.guerra · July 1, 2021, 10:56am

@jzr, the dot seems to be unnecessary? Same result with: maximum(abs, a - b)

genkuroki · July 1, 2021, 11:11am

maximum(Base.splat(abs∘-), zip(a, b))

aplavin · July 1, 2021, 11:12am

What does a “fast implementation” mean in this context? Is mapreduce implemented in a not very efficient way and there is potential to improve it?

Sukera · July 1, 2021, 11:22am

No, mapreduce is pretty well optimized, but only for specific cases since we can dispatch on different functions for those (functions have their own type after all). abs∘- is unlikely to be one of those cases so far, so it’ll fall back to a generic implementation (which may not SIMD etc.) Depending on the functions and iterators in use, mapreduce may be specialized to those and achieve a greater speedup than the fallback implementation can provide.

DNF · July 1, 2021, 11:27am

Fastest I’ve seen so far:

jl> using Tullio

jl> @tullio (max) maxval := abs(a[i] - b[i])

@mcabbott must there be an output variable inside the macro call? Could one hypothetically write

jl> maxval = @tullio (max) abs(a[i] - b[i])

?

(Oops: fixed missing abs)

mcabbott · July 1, 2021, 11:48am

For now there must be, although it could be changed I suppose. You can write @tullio (max) _ := abs(a[i] - b[i]) which at least saves thinking up a name.

What does a “fast implementation” mean in this context?

If you do @less mapreduce(abs∘-, max, a, b) you will see this:

mapreduce(f, op, A::AbstractArrayOrBroadcasted...; kw...) =
    reduce(op, map(f, A...); kw...)

i.e. it performs map which allocates the array, then reduce. I think this is a placeholder implementation, to allow the syntax now and make it fast later. It’s very much like maximum(abs.(a.-b)), while with one array maximum(abs, a) is mapreduce(abs, max, a) and this does not allocate abs.(a). The issue about the multiple-argument case is #38558.

Seif_Shebl · July 1, 2021, 4:00pm

Thank you all for your contributions, here I present some benchmarks for different alternatives. Tullio seems to be the only alternative with the same performance as explicit loops.

function max_abs(a,b)
    mx = 0.0
    for i in eachindex(a,b)
        tmp = abs(a[i]-b[i]) 
        tmp > mx && (mx = tmp)
    end 
    mx 
end 

N = 500
a = rand(N,N)
b = rand(N,N)

@btime max_abs($a,$b)                            # 188.500 μs (0 allocations: 0 bytes)
@btime maximum(abs(i-j) for (i,j) in zip($a,$b)) # 442.200 μs (0 allocations: 0 bytes)
@btime maximum(abs, $a - $b)                     # 568.800 μs (2 allocations: 1.91 MiB)
@btime maximum(Base.splat(abs∘-), zip($a, $b))   # 442.300 μs (0 allocations: 0 bytes)
@btime mapreduce(abs∘-, max, $a, $b)             # 572.900 μs (2 allocations: 1.91 MiB)
using Tullio
@btime @tullio (max) delta := abs($a[i] - $b[i]) # 188.800 μs (1 allocation: 16 bytes)
using LoopVectorization
@btime @tullio (max) delta := abs($a[i] - $b[i]) #  75.300 μs (1 allocation: 16 bytes)

DNF · July 1, 2021, 5:15pm

Not sure if it will make a difference, but I would use

mx = max(mx, tmp)

to avoid the branch, which could in principle prevent vectorization. It’s also clearer.

Seif_Shebl · July 1, 2021, 5:41pm

This is nicer undoubtedly, but the performance drops to the generator version, it needs a @fastmath to keep the same performance in this case.

genkuroki · July 2, 2021, 7:04am

An extended version of Seif Shebl’s excellent benchmark test

I tried the benchmark tests for N = 100, 200, 500, 1000, and 2000.

The results look different for each N.

Julia script:

# maxabstest.jl

using BenchmarkTools

N = isempty(ARGS) ? 500 : parse(Int, ARGS[1])

function max_abs(a, b)
    m = zero(promote_type(eltype(a), eltype(b)))
    for i in eachindex(a, b)
        tmp = abs(a[i] - b[i]) 
        tmp > m && (m = tmp)
    end 
    m
end

a = randn(N, N)
b = randn(N, N)

@show VERSION
@show Threads.nthreads()
println()
@show N
print("simple for loop:     ")
@btime max_abs($a, $b)
print("mapreduce abs∘- max: ")
@btime mapreduce(abs∘-, max, $a, $b)
print("maximum(abs, a - b): ")
@btime maximum(abs, $a - $b)
print("maximum(generator):  ")
@btime maximum(abs(i - j) for (i, j) in zip($a, $b))
print("maximum splat(abs∘-):")
@btime maximum(Base.splat(abs∘-), zip($a, $b))
using Tullio
print("Tullio:              ")
@btime @tullio (max) _ := abs($a[i] - $b[i])
print("Tullio (LoopVect.):  ")
using LoopVectorization
@btime @tullio (max) _ := abs($a[i] - $b[i])

function max_abs_turbo(a, b)
    m = zero(promote_type(eltype(a), eltype(b)))
    @turbo for i in eachindex(a, b)
        m = max(m, abs(a[i] - b[i]))
    end 
    m
end

function max_abs_tturbo(a, b)
    m = zero(promote_type(eltype(a), eltype(b)))
    @tturbo for i in eachindex(a, b)
        m = max(m, abs(a[i] - b[i]))
    end 
    m
end

print("LoopVect. @turbo:    ")
@btime max_abs_turbo($a, $b)
print("LoopVect. @tturbo:   ")
@btime max_abs_tturbo($a, $b)

Results for N = 100, 200, 500, 1000, 2000

$ julia maxabstest.jl 100
VERSION = v"1.6.1"
Threads.nthreads() = 12

N = 100
simple for loop:       8.733 μs (0 allocations: 0 bytes)
maximum(abs, a - b):   10.100 μs (2 allocations: 78.20 KiB)
maximum(generator):    18.100 μs (0 allocations: 0 bytes)
maximum splat(abs∘-):  18.100 μs (0 allocations: 0 bytes)
mapreduce abs∘- max:   11.300 μs (2 allocations: 78.20 KiB)
Tullio:                8.767 μs (1 allocation: 16 bytes)
Tullio (LoopVect.):    1.290 μs (1 allocation: 16 bytes)
LoopVect. @turbo:      1.240 μs (0 allocations: 0 bytes)
LoopVect. @tturbo:     722.689 ns (0 allocations: 0 bytes)

$ julia maxabstest.jl 200
VERSION = v"1.6.1"
Threads.nthreads() = 12

N = 200
simple for loop:       34.900 μs (0 allocations: 0 bytes)
maximum(abs, a - b):   34.500 μs (2 allocations: 312.58 KiB)
maximum(generator):    71.600 μs (0 allocations: 0 bytes)
maximum splat(abs∘-):  71.700 μs (0 allocations: 0 bytes)
mapreduce abs∘- max:   42.300 μs (2 allocations: 312.58 KiB)
Tullio:                34.900 μs (1 allocation: 16 bytes)
Tullio (LoopVect.):    8.400 μs (1 allocation: 16 bytes)
LoopVect. @turbo:      9.600 μs (0 allocations: 0 bytes)
LoopVect. @tturbo:     2.010 μs (0 allocations: 0 bytes)

$ julia maxabstest.jl 500
VERSION = v"1.6.1"
Threads.nthreads() = 12

N = 500
simple for loop:       213.400 μs (0 allocations: 0 bytes)
maximum(abs, a - b):   444.500 μs (2 allocations: 1.91 MiB)
maximum(generator):    447.000 μs (0 allocations: 0 bytes)
maximum splat(abs∘-):  447.100 μs (0 allocations: 0 bytes)
mapreduce abs∘- max:   454.600 μs (2 allocations: 1.91 MiB)
Tullio:                213.400 μs (1 allocation: 16 bytes)
Tullio (LoopVect.):    55.900 μs (1 allocation: 16 bytes)
LoopVect. @turbo:      56.100 μs (0 allocations: 0 bytes)
LoopVect. @tturbo:     12.700 μs (0 allocations: 0 bytes)

$ julia maxabstest.jl 1000
VERSION = v"1.6.1"
Threads.nthreads() = 12

N = 1000
simple for loop:       1.053 ms (0 allocations: 0 bytes)
mapreduce abs∘- max:   2.354 ms (2 allocations: 7.63 MiB)
maximum(abs, a - b):   2.240 ms (2 allocations: 7.63 MiB)
maximum(generator):    1.939 ms (0 allocations: 0 bytes)
maximum splat(abs∘-):  1.937 ms (0 allocations: 0 bytes)
Tullio:                263.000 μs (93 allocations: 4.73 KiB)
Tullio (LoopVect.):    257.600 μs (93 allocations: 4.73 KiB)
LoopVect. @turbo:      410.200 μs (0 allocations: 0 bytes)
LoopVect. @tturbo:     207.600 μs (0 allocations: 0 bytes)

$ julia maxabstest.jl 2000
VERSION = v"1.6.1"
Threads.nthreads() = 12

N = 2000
simple for loop:       4.193 ms (0 allocations: 0 bytes)
mapreduce abs∘- max:   10.075 ms (2 allocations: 30.52 MiB)
maximum(abs, a - b):   11.004 ms (2 allocations: 30.52 MiB)
maximum(generator):    7.903 ms (0 allocations: 0 bytes)
maximum splat(abs∘-):  7.896 ms (0 allocations: 0 bytes)
Tullio:                2.388 ms (199 allocations: 10.17 KiB)
Tullio (LoopVect.):    2.855 ms (200 allocations: 10.20 KiB)
LoopVect. @turbo:      2.540 ms (0 allocations: 0 bytes)
LoopVect. @tturbo:     2.275 ms (0 allocations: 0 bytes)

jzr · July 2, 2021, 7:36am

Just to mention another option Home · Transducers.jl supports map/reduce operations.

Topic		Replies	Views
Perf improvement suggestion for simple extrema Performance proposal	4	648	February 18, 2020
Maximum does not work on UInt General Usage	7	256	January 30, 2024
What is idiomatic way of vectorizing longer operations in Julia? Performance performance , loopvectorization	5	419	September 21, 2023
Parallel reductions Julia at Scale	23	5150	June 19, 2022
Julia Beginner (from Python): Numba outperforms Julia in rewrite. Any tips to improve performance? Performance benchmark , python , tullio , loopvectorization	56	6171	August 18, 2021

Is there a maximum(f, op, itrs...) in Julia?

Related topics