CUDA.jl kernel is half as fast as c++ Kernel

Hi Maleadt. I looka th the notebook you sent me and there is indeed an exception. Though I don’t understand why. It has to do with the trunc(Int32, range_bin_float) line.

Here is the output:

; ┌ @ float.jl:781 within `trunc`
   call fastcc void @ijl_box_float32(float %106)
   call fastcc void @gpu_report_exception(i64 ptrtoint ([10 x i8]* @exception130 to i64))
   call fastcc void @gpu_signal_exception([1 x i64] %state)
   call void asm sideeffect "exit;", ""() #3
   unreachable

I want to take the float32 value and then convert and truncate to an integer. Is there a different way to do this?