CUDA.jl kernel is half as fast as c++ Kernel

Base.Math.rem_pio2_kernel(::Float32), which is, I believe, used by all trigonometric functions for range reduction, uses Float64 for some computation. So this might be the cause for the conversions that you’re seeing.

This means two things (if I’m right):

  1. you’d be better off using cis instead of separate sin and cos calls, because then you only pay for a single range reduction, instead of one each for sin and cos. This is partly why sincos and cis exist in the first place.
  2. Range reduction can be implemented in various ways, e.g. it’s even possible to do it with integer arithmetic as far as I remember, but in all cases there’s a further tradeoff between accuracy and speed (where standard library code must always prefer accuracy). My point is: you might be able to speed up your code by taking care of range reduction yourself. For some problems range reduction can be even eliminated, so it might be worth your time to think about range reduction a bit: what are your requirements of range reduction (the domains of your arguments), and what implementation strategy (if any is necessary in the end) is best suited for your hardware.

PS: if you do end up taking care of the range reduction yourself, it might also be fun to get your custom polynomials (with Float32 coefficients) for computing range-reduced sin and cos to the required accuracy. Then you could base your sin_kernel and cos_kernel (or even sincos_kernel directly) on these polynomials.

1 Like