CUDA.jl kernel is half as fast as c++ Kernel

nsajko · September 26, 2022, 1:23am

Base.Math.rem_pio2_kernel(::Float32), which is, I believe, used by all trigonometric functions for range reduction, uses Float64 for some computation. So this might be the cause for the conversions that you’re seeing.

This means two things (if I’m right):

you’d be better off using cis instead of separate sin and cos calls, because then you only pay for a single range reduction, instead of one each for sin and cos. This is partly why sincos and cis exist in the first place.
Range reduction can be implemented in various ways, e.g. it’s even possible to do it with integer arithmetic as far as I remember, but in all cases there’s a further tradeoff between accuracy and speed (where standard library code must always prefer accuracy). My point is: you might be able to speed up your code by taking care of range reduction yourself. For some problems range reduction can be even eliminated, so it might be worth your time to think about range reduction a bit: what are your requirements of range reduction (the domains of your arguments), and what implementation strategy (if any is necessary in the end) is best suited for your hardware.

PS: if you do end up taking care of the range reduction yourself, it might also be fun to get your custom polynomials (with Float32 coefficients) for computing range-reduced sin and cos to the required accuracy. Then you could base your sin_kernel and cos_kernel (or even sincos_kernel directly) on these polynomials.

Topic		Replies	Views
Julia vs C++ speed General Usage performance , c	21	4700	September 2, 2021
Cosine seems slow Performance	14	1854	November 27, 2019
Trig functions very slow Performance	67	7053	October 10, 2018
Trignometric functions on GPU GPU	4	1114	November 21, 2019
Why is this Julia code considerably slower than Matlab New to Julia performance	64	8621	March 5, 2017

CUDA.jl kernel is half as fast as c++ Kernel

Related topics