Faster isapprox()

user_231578 · July 5, 2023, 10:12am

I have to run (possibly) millions of times the function isapprox(A, A_, rtol=1e-4), where A and A_ are two matrices. To improve efficiency, I defined a custom isapprox() function using a more efficient norm:

using LoopVectorization

function is_approx(A::Array{Float64,2}, A_::Array{Float64,2}, rtol::Float64)
	# similar to built-in isapprox() but faster
	return norm(A - A_)  < rtol * max(norm(A), norm(A_))
end

function norm(A::Array{Float64,2})
	# fast L(2,2) norm
	# discussion: https://discourse.julialang.org/t/performance-of-norm-function/14709/13
	x = zero(eltype(A))
	@avx for i in eachindex(A)
		@fastmath x += A[i] * A[i]
	end
	@fastmath sqrt(x)
end

If you want, you can find more context about what I use this for here .

Despite this being considerably more efficient than isapprox(), is_approx() is still the main performance bottleneck of my code. Do you have any ideas on how to make is_approx() more efficient?

If it is relevant, profiling the function,

using StatProfilerHTML 
A, A_ = rand(10,10), rand(10,10)
@profilehtml for i in 1:10^6 is_approx(A,A_,1e-4) end

gives the following report:

Dan · July 5, 2023, 10:20am

is_approx is calling norm 3 times (for A - A_ , A , and A_).
Fusing norm with is_approx can allow calculation of the norms together (keeping three accumulators for the relevant squared numbers), and will reduce memory access 3x leading to significant improvement to performance.

user_231578 · July 5, 2023, 10:25am

What does fusing norm with is_approx mean? Would you be able to propose an example for your solution? Thanks for your help!

jondea · July 5, 2023, 10:25am

Given that you are comparing norms, can you skip all the sqrts and square your rtol instead?

Dan · July 5, 2023, 10:38am

Especially, you should skip calculating A - A_ which is also allocating memory (slow). This calculation should not allocate any memory.

user_231578 · July 5, 2023, 10:39am

Thanks for helping. This seems to make it just slightly faster.


function is_approx2(A::Array{Float64,2}, A_::Array{Float64,2}, rtol::Float64)
	# similar to built-in isapprox() but faster
	return norm2(A - A_)  < abs2(rtol) * max(norm2(A), norm2(A_))
end

function norm2(A::Array{Float64,2})
	# fast L(2,2) norm
	# discussion: https://discourse.julialang.org/t/performance-of-norm-function/14709/13
	x = zero(eltype(A))
	@avx for i in eachindex(A)
		@fastmath x += A[i] * A[i]
	end
	return x
end

@btime is_approx($A,$A_,$1e-4)
  96.901 ns (1 allocation: 896 bytes)

@btime is_approx2($A,$A_,$1e-4)
  93.640 ns (1 allocation: 896 bytes)

gdalle · July 5, 2023, 10:45am

The main bottleneck of your code is allocation when you create the new matrix A - A_.

Dan · July 5, 2023, 10:46am

Something efficient might look like:

function is_approx3(A::Array{Float64,2}, A_::Array{Float64,2}, rtol::Float64)
    # make sure sizes of A and A_ are the same here (?)
    s1, s2, s3 = 0.0, 0.0, 0.0
    for i in eachindex(A)
        a, a_ = A[i], A_[i]
        s1 += (a - a_)^2
        s2 += a^2
        s3 += a_^2
    end
    return s1 < rtol^2*max(s2,s3)
end

gdalle · July 5, 2023, 10:52am

With LoopVectorization:

julia> function is_approx3(A::Matrix{T1}, B::Matrix{T2}, rtol) where {T1,T2}
           normA = zero(T1)
           normB = zero(T2)
           normdiff = zero(promote_type(T1, T2))
           @avx for i in eachindex(A, B)
               a, b = A[i], B[i]
               normA += abs2(a)
               normA += abs2(b)
               normdiff += abs2(a - b)
           end
           return normdiff < abs2(rtol) * max(normA, normB)
       end
is_approx3 (generic function with 1 method)

julia> A, B = rand(10, 10), rand(10, 10);

julia> @btime is_approx($A, $B, $1e-4);
  147.392 ns (1 allocation: 896 bytes)

julia> @btime is_approx2($A, $B, $1e-4);
  144.150 ns (1 allocation: 896 bytes)

julia> @btime is_approx3($A, $B, $1e-4);
  17.524 ns (0 allocations: 0 bytes)

Note that eachindex(A, B) makes sure that the sizes are compatible

user_231578 · July 5, 2023, 10:54am

your solution, together with @avx, makes it x5 faster!

function is_approx3(A::Array{Float64,2}, A_::Array{Float64,2}, rtol::Float64)
    s1, s2, s3 = 0.0, 0.0, 0.0
    @avx for i in eachindex(A)
        a, a_ = A[i], A_[i]
        s1 += (a - a_)^2
        s2 += a^2
        s3 += a_^2
    end
    return s1 < rtol^2*max(s2,s3)
end

@btime is_approx3($A,$A_,$1e-4)
  18.829 ns (0 allocations: 0 bytes)

DNF · July 5, 2023, 2:07pm

LoopVectorization.@avx is the old name for LoopVectorization.@turbo. I thought it changed several years ago. If @avx is still around, I suggest you use the newer, official, name, @turbo.

stevengj · July 5, 2023, 2:29pm

I’m curious if you can share information about your application? It’s pretty rare for isapprox to be a performance-critical function. Even if you’re using it as a convergence test in an iterative solver of some kind, usually there are other parts of your iteration that will dominate.

user_231578 · July 5, 2023, 2:39pm

Thanks for letting me know!

user_231578 · July 5, 2023, 2:41pm

Sure! I wrote the code example in the link specifically for this question.

Elrod · July 5, 2023, 2:51pm

Yes, use @turbo. I haven’t made a breaking release since then, which is why @avx still works.
@turbo replaced @avx early in the 0.12 series. We’re now at 0.12.162.

Topic		Replies	Views
Non-allocating distance function Internals & Design question	6	416	November 3, 2022
Performance of norm function Performance linearalgebra	21	12356	March 28, 2021
Could this be faster using some decomposition? Performance linearalgebra	4	342	January 27, 2022
Improving Speed for a Novice Performance	7	2198	May 14, 2018
How to make norm(::AbstractVector) faster in my case? Performance	14	563	December 28, 2021

Faster isapprox()

Related topics