Explain Performance of a Ylm Code

The code below evaluates the standard complex spherical harmonics for a batch of inputs given in terms of cos, sin and the associated legendre functions (which are already precomputed - in the “real” scenario they would be computed first). I assumed that the inner-most loop over the batch of inputs could be made very fast, but not so:

(1) if I use LoopVectorizations and annotate the for i loops with @avx I get

┌ Warning: #= /Users/ortner/gits/Polynomials4ML.jl/profile/temp.jl:37 =#:
│ `LoopVectorization.check_args` on your inputs failed; running fallback `@inbounds @fastmath` loop instead.
│ Use `warn_check_args=false`, e.g. `@turbo warn_check_args=false ...`, to disable this warning.
└ @ Main ~/.julia/packages/LoopVectorization/k62bU/src/condense_loopset.jl:985

Why is that? Is that related to the complex product? It seems that if I rewrite it in terms of real operations and real arrays only then @avx can indeed be used by improves the performance only marginally.

(2) Profiling the code suggests that abou 1/2 to 2/3 of the time is spent in iterate, checking where the i-loop has finished (see screenshot). What is the explanation and can it be prevented? It seems this should not be the bottleneck?

Thank you.

Code:

using BenchmarkTools

sizeY(maxL) = (maxL + 1) * (maxL + 1)
index_y(l, m) = m + l + (l*l) + 1
sizeP(maxL) = div((maxL + 1) * (maxL + 2), 2)
index_p(l, m) = m + div(l*(l+1), 2) + 1


function _cYlm!(Y, L, cosθ, sinθ, P, t)
	nS = length(cosθ)
	@assert size(P) == (nS, sizeP(L))
	@assert size(Y) == (nS, sizeY(L))
   @assert length(sinθ) == length(t) == nS 

   fill!(t, 1 / sqrt(2) + im * 0)

	@inbounds begin 
		for l = 0:L 
			i_yl0 = index_y(l, 0)
			i_pl0 = index_p(l, 0)
			for i = 1:nS
				Y[i, i_yl0] = P[i, i_pl0] * t[i] 
			end
		end

		sig = 1
		for m in 1:L
			sig *= -1
		   for i = 1:nS
				t[i] *= cosθ[i] + im * sinθ[i]
			end

		   for l in m:L
				i_plm = index_p(l,m)
				i_ylm⁺ = index_y(l,  m)
				i_ylm⁻ = index_y(l, -m)
				for i = 1:nS
					p = P[i, i_plm]
					Y[i, i_ylm⁻] = (sig * p) * conj(t[i])
					Y[i, i_ylm⁺] = t[i] * p  
				end
			end
		end
	end 

	return Y
end

##

L = 8
nX = 1024 
cosθ = [ cos(π * rand()) for _=1:nX ]   # input 
sinθ = [ sin(π * rand()) for _=1:nX ] 
t = zeros(ComplexF64, nX)   # temporary array 
P = rand(nX, sizeP(L)) # in the real code, these are the ALPs
Y = zeros(ComplexF64, nX, sizeY(L))   # allocate output

##

@btime _cYlm!($Y, $L, $cosθ, $sinθ, $P, $t)

## 

@profview let Y= Y, L=L, cosθ=cosθ, sinθ=sinθ, P=P, t =t 
   for n = 1:50_000 
      _cYlm!(Y, L, cosθ, sinθ, P, t)
   end
end

It should improve the performance for code calling sin and cos by far more than “marginally”.
Have a reproducer?

What’s your architecture?
LoopVectorization.jl doesn’t work that well on the M1.

LoopVectorization.jl works at a very high level of abstraction, so it only works on code already using primitive types.
The rewrite will work at a much lower level (i.e. LLVM), so it’ll work much better on code using abstractions like complex or dual numbers (or w/e arbitrary Julia types).

3 Likes

Thank you for the respOnes.

Reproducer?

Indeed I have an M1 - I should have mentioned. I will test on other architectures

Re rewrite : is there a timescale?

Of @turbo not helping much.
I’d of course suggest sincos over sin or cos.
The warning also only shows up once, so I was wondering if maybe the possible is still there and you just missed it.

Maybe a year? You can track progress here: GitHub - JuliaSIMD/LoopModels
Dependence analysis is mostly there. I will start cost modeling soon.
Other things it needs:

  • LLVM parsing to build its internal representation.
  • code generation/emitting LLVM based on it’s decision.
  • A Julia front end.

I’m able to spend about 2 days a week on this, but pushing for more/spending a lot of free time on this.

4 Likes

Thank you - I will reproduce and post. Note I don’t actually call sincos- this is in an earlier stage of the computation, a preconputation if you will

Here is now the real-ified version. Without avx it actually runs a bit slower (~54us vs 37us) with avx it runs a bit faster (~ 31us) so the speedup over the un-avxed real code is not bad though not quite as significant as I had hoped. The speedup against the complex code is moderate.

In all of this what still bugs me most is that the bottleneck seems to be iterate and not the arithmetic!

using BenchmarkTools, LoopVectorization

sizeY(maxL) = (maxL + 1) * (maxL + 1)
index_y(l, m) = m + l + (l*l) + 1
sizeP(maxL) = div((maxL + 1) * (maxL + 2), 2)
index_p(l, m) = m + div(l*(l+1), 2) + 1


function _cYlm!(Y, L, cosθ, sinθ, P, t)
	nS = length(cosθ)
	@assert size(P) == (nS, sizeP(L))
	@assert size(Y) == (nS, sizeY(L))
   @assert length(sinθ) == length(t) == nS 

   fill!(t, 1 / sqrt(2) + im * 0)

	@inbounds begin 
		for l = 0:L 
			i_yl0 = index_y(l, 0)
			i_pl0 = index_p(l, 0)
			for i = 1:nS
				Y[i, i_yl0] = P[i, i_pl0] * t[i] 
			end
		end

		sig = 1
		for m in 1:L
			sig *= -1
		   for i = 1:nS
				t[i] *= cosθ[i] + im * sinθ[i]
			end

		   for l in m:L
				i_plm = index_p(l,m)
				i_ylm⁺ = index_y(l,  m)
				i_ylm⁻ = index_y(l, -m)
				for i = 1:nS
					p = P[i, i_plm]
					Y[i, i_ylm⁻] = (sig * p) * conj(t[i])
					Y[i, i_ylm⁺] = t[i] * p  
				end
			end
		end
	end 

	return Y
end


function _rYlm!(Yr, Yi, L, cosθ, sinθ, P, tr, ti)
	nS = length(cosθ)
	@assert size(P) == (nS, sizeP(L))
	@assert size(Yr) == size(Yi) == (nS, sizeY(L))
   @assert length(sinθ) == length(tr) == length(ti) == nS 

   fill!(t, 1 / sqrt(2) + im * 0)

	@inbounds begin 
		for l = 0:L 
			i_yl0 = index_y(l, 0)
			i_pl0 = index_p(l, 0)
			@avx for i = 1:nS
				Yr[i, i_yl0] = P[i, i_pl0] * tr[i] 
				Yi[i, i_yl0] = P[i, i_pl0] * ti[i] 
			end
		end

		sig = 1
		for m in 1:L
			sig *= -1
		   @avx for i = 1:nS
				tr[i] = tr[i] * cosθ[i] - ti[i] * sinθ[i]
            ti[i] = ti[i] * cosθ[i] + tr[i] * sinθ[i]
			end

		   for l in m:L
				i_plm = index_p(l,m)
				i_ylm⁺ = index_y(l,  m)
				i_ylm⁻ = index_y(l, -m)
				@avx for i = 1:nS
					p = P[i, i_plm]
					Yr[i, i_ylm⁻] = (sig * p) * tr[i] # conj(t[i])
					Yi[i, i_ylm⁻] = - (sig * p) * ti[i] # conj(t[i])
					Yr[i, i_ylm⁺] = tr[i] * p  
					Yi[i, i_ylm⁺] = ti[i] * p  
				end
			end
		end
	end 

	return Y
end

##

L = 8
nX = 1024 
cosθ = [ cos(π * rand()) for _=1:nX ]   # input 
sinθ = [ sin(π * rand()) for _=1:nX ] 
t = zeros(ComplexF64, nX)   # temporary array 
tr = real.(t) 
ti = imag.(t)
P = rand(nX, sizeP(L)) # in the real code, these are the ALPs
Y = zeros(ComplexF64, nX, sizeY(L))   # allocate output
Yr = real.(Y)
Yi = imag.(Y)


##

@btime _cYlm!($Y, $L, $cosθ, $sinθ, $P, $t)
# 37.541 μs (0 allocations: 0 bytes)
  
@btime _rYlm!($Yr, $Yi, $L, $cosθ, $sinθ, $P, $tr, $ti)
# 31.042 μs (0 allocations: 0 bytes)
#   ~ ca 54 us without avx. 

##

# with view into the original array 
Yre = reinterpret(Float64, Y)
Yre_r = @view Yre[1:2:end, :]
Yre_i = @view Yre[2:2:end, :]

@btime _rYlm!($Yre_r, $Yre_i, $L, $cosθ, $sinθ, $P, $tr, $ti)
# 43.333 μs (0 allocations: 0 bytes)


## 

# @profview let Y= Y, L=L, cosθ=cosθ, sinθ=sinθ, P=P, t =t 
#    for n = 1:50_000 
#       _cYlm!(Y, L, cosθ, sinθ, P, t)
#    end
# end

# @profview let Yr = Yr, Yi = Yi, L = L, cosθ = cosθ, sinθ = sinθ, P = P, tr = tr, ti = ti
#    for n = 1:66_000 
#       _rYlm!(Yr, Yi, L, cosθ, sinθ, P, tr, ti)
#    end
# end

Same benchmark on an AMD EPYC:

[ Info: Complex Code
  54.522 μs (0 allocations: 0 bytes)
[ Info: Real Code - no avx 
  54.933 μs (1 allocation: 32 bytes)
[ Info: Real Code - with avx 
  29.611 μs (1 allocation: 32 bytes)