Faster isapprox()

I have to run (possibly) millions of times the function isapprox(A, A_, rtol=1e-4), where A and A_ are two matrices. To improve efficiency, I defined a custom isapprox() function using a more efficient norm:

using LoopVectorization

function is_approx(A::Array{Float64,2}, A_::Array{Float64,2}, rtol::Float64)
	# similar to built-in isapprox() but faster
	return norm(A - A_)  < rtol * max(norm(A), norm(A_))
end

function norm(A::Array{Float64,2})
	# fast L(2,2) norm
	# discussion: https://discourse.julialang.org/t/performance-of-norm-function/14709/13
	x = zero(eltype(A))
	@avx for i in eachindex(A)
		@fastmath x += A[i] * A[i]
	end
	@fastmath sqrt(x)
end

If you want, you can find more context about what I use this for here .

Despite this being considerably more efficient than isapprox(), is_approx() is still the main performance bottleneck of my code. Do you have any ideas on how to make is_approx() more efficient?

If it is relevant, profiling the function,

using StatProfilerHTML 
A, A_ = rand(10,10), rand(10,10)
@profilehtml for i in 1:10^6 is_approx(A,A_,1e-4) end

gives the following report:

is_approx is calling norm 3 times (for A - A_ , A , and A_).
Fusing norm with is_approx can allow calculation of the norms together (keeping three accumulators for the relevant squared numbers), and will reduce memory access 3x leading to significant improvement to performance.

1 Like

What does fusing norm with is_approx mean? Would you be able to propose an example for your solution? Thanks for your help!

Given that you are comparing norms, can you skip all the sqrts and square your rtol instead?

4 Likes

Especially, you should skip calculating A - A_ which is also allocating memory (slow). This calculation should not allocate any memory.

2 Likes

Thanks for helping. This seems to make it just slightly faster.


function is_approx2(A::Array{Float64,2}, A_::Array{Float64,2}, rtol::Float64)
	# similar to built-in isapprox() but faster
	return norm2(A - A_)  < abs2(rtol) * max(norm2(A), norm2(A_))
end

function norm2(A::Array{Float64,2})
	# fast L(2,2) norm
	# discussion: https://discourse.julialang.org/t/performance-of-norm-function/14709/13
	x = zero(eltype(A))
	@avx for i in eachindex(A)
		@fastmath x += A[i] * A[i]
	end
	return x
end

@btime is_approx($A,$A_,$1e-4)
  96.901 ns (1 allocation: 896 bytes)

@btime is_approx2($A,$A_,$1e-4)
  93.640 ns (1 allocation: 896 bytes)
1 Like

The main bottleneck of your code is allocation when you create the new matrix A - A_.

Something efficient might look like:

function is_approx3(A::Array{Float64,2}, A_::Array{Float64,2}, rtol::Float64)
    # make sure sizes of A and A_ are the same here (?)
    s1, s2, s3 = 0.0, 0.0, 0.0
    for i in eachindex(A)
        a, a_ = A[i], A_[i]
        s1 += (a - a_)^2
        s2 += a^2
        s3 += a_^2
    end
    return s1 < rtol^2*max(s2,s3)
end
6 Likes

With LoopVectorization:

julia> function is_approx3(A::Matrix{T1}, B::Matrix{T2}, rtol) where {T1,T2}
           normA = zero(T1)
           normB = zero(T2)
           normdiff = zero(promote_type(T1, T2))
           @avx for i in eachindex(A, B)
               a, b = A[i], B[i]
               normA += abs2(a)
               normA += abs2(b)
               normdiff += abs2(a - b)
           end
           return normdiff < abs2(rtol) * max(normA, normB)
       end
is_approx3 (generic function with 1 method)

julia> A, B = rand(10, 10), rand(10, 10);

julia> @btime is_approx($A, $B, $1e-4);
  147.392 ns (1 allocation: 896 bytes)

julia> @btime is_approx2($A, $B, $1e-4);
  144.150 ns (1 allocation: 896 bytes)

julia> @btime is_approx3($A, $B, $1e-4);
  17.524 ns (0 allocations: 0 bytes)

Note that eachindex(A, B) makes sure that the sizes are compatible

6 Likes

your solution, together with @avx, makes it x5 faster!

function is_approx3(A::Array{Float64,2}, A_::Array{Float64,2}, rtol::Float64)
    s1, s2, s3 = 0.0, 0.0, 0.0
    @avx for i in eachindex(A)
        a, a_ = A[i], A_[i]
        s1 += (a - a_)^2
        s2 += a^2
        s3 += a_^2
    end
    return s1 < rtol^2*max(s2,s3)
end

@btime is_approx3($A,$A_,$1e-4)
  18.829 ns (0 allocations: 0 bytes)
1 Like

LoopVectorization.@avx is the old name for LoopVectorization.@turbo. I thought it changed several years ago. If @avx is still around, I suggest you use the newer, official, name, @turbo.

4 Likes

I’m curious if you can share information about your application? It’s pretty rare for isapprox to be a performance-critical function. Even if you’re using it as a convergence test in an iterative solver of some kind, usually there are other parts of your iteration that will dominate.

4 Likes

Thanks for letting me know!

Sure! I wrote the code example in the link specifically for this question.

Yes, use @turbo. I haven’t made a breaking release since then, which is why @avx still works.
@turbo replaced @avx early in the 0.12 series. We’re now at 0.12.162.