Seems something is not working as you believe. My kernel is quite a bit faster if I have two separate calls to sin and cos instead of cis. Furthermore, when using cis there are references to double in the IR that aren’t there when using sin and cos.
I think i kind of understand what you are saying about about range reduction, but that doesn’t quite make sense to me. If I’m doing my computation using Float32 then I will accept the accuracy issues by doing so. Transparently converting to higher precision and back behind the scenes seems bad. Especially when using co-processors like with cuda kernels.
interesting thought about making my own kernel for sin and cos, but that seems a little overkill and kind of silly to have to do that to get the same performance as c++.
thanks for your thoughtful replies!