Continuous callback does not work correctly with Ensemble GPUArray()

Hello, I am solving a central potential problem with EnsembleGPUArray() for multiple initial conditions, when I add the ContinuousCallback the first 800 trajectories ignore the affect!(integrator) and the last 200 do the cut correctly. If I solve for 10_000 trajectories, the first 8000 trajectories ignore the cut and the last 2000 do it correctly, similarly for 1_000_000 trajectories. I did several tests but I didn’t find what the problem is.

The code is the following

using CUDA
using DiffEqGPU
using NPZ
using OrdinaryDiffEq
using Plots
using PlotlyBase
using PlotlyKaleido

plotly()
Plots.PlotlyBackend()
CUDA.device!(1)

CuDevice(1): NVIDIA A30

"""
     pot_central!(du,u,p,t)
     u=[x,dx,y,dy,# of steps]
     p=[k,m,dt]
"""
function pot_central!(du,u,p,t)
        r3 = ( u[1]^2 + u[3]^2 )^(3/2)
     du[1] = u[2]                                     # u[2]= dx
     du[2] = -( p[1]*u[1] ) / ( p[2]*r3 )   
     du[3] = u[4]                                     # u[4]= dy
     du[4] = -( p[1]*u[3] ) / ( p[2]*r3 )
     du[5] = 1/p[3]
end

T=100.0
trajectories=1_000
dt=0.01

u_rand = convert(Array{Float64},npzread("IO_GPU/IO_u0.npy"))
    u0 = [2.0, 2.0, 1.0, 1.5,dt]
     p = [1.0,1.0,dt]               #[k,m,dt]
 tspan = [0.0,T]
    R2 = [4.5,5_000.0]              # R2=[Rmin2,Rmax2]

prob= ODEProblem{true}(pot_central!,u0,tspan,p)
prob_func= (prob,i,repeat) -> remake(prob,u0 = [u_rand[i,:];dt].*u0 + [1.0,1.0,1.0,1.0,dt])
Ensemble_Problem=EnsembleProblem(prob,prob_func=prob_func,safetycopy=false) 

function condition(u,t,integrator)
        r2 = u[1]*u[1] + u[3]*u[3]
        (R2[2] - r2)*(r2 - R2[1])
end

affect!(integrator) = terminate!(integrator)
gpu_cb =  ContinuousCallback(condition, affect!;save_positions=(true,false),rootfind=true,interp_points=0,abstol=1e-7,reltol=0)

ContinuousCallback{typeof(condition), typeof(affect!), typeof(affect!), typeof(SciMLBase.INITIALIZE_DEFAULT), typeof(SciMLBase.FINALIZE_DEFAULT), Float64, Int64, Rational{Int64}, Nothing, Int64}(condition, affect!, affect!, SciMLBase.INITIALIZE_DEFAULT, SciMLBase.FINALIZE_DEFAULT, nothing, SciMLBase.LeftRootFind, 0, Bool[1, 0], 1, 1.0e-7, 0, 1//100)

CUDA.@time sol= solve(Ensemble_Problem,
                                Tsit5(),
                                EnsembleGPUArray(),
                                dt=0.01,
                                trajectories=trajectories,
                                #batch_size=10_000,
                                callback = gpu_cb,
                                adaptive=false,
                                save_everystep=false
                                ) 

0.067974 seconds (780.84 k CPU allocations: 36.133 MiB) (52 GPU allocations: 551.062 KiB, 0.66% memmgmt time)
EnsembleSolution Solution of length 1000 with uType:
ODESolution{Float64, 2, uType, Nothing, Nothing, Vector{Float64}, rateType, P, Tsit5{typeof(OrdinaryDiffEq.trivial_limiter!), typeof(OrdinaryDiffEq.trivial_limiter!), Static.False}, IType, DiffEqBase.DEStats, Nothing} where {uType, rateType, P, IType}

The solutions for some trajectories are as follows

sol[1]

retcode: Success
Interpolation: 1st order linear
t: 2-element Vector{Float64}:
0.0
100.0
u: 2-element Vector{SubArray{Float64, 1, Matrix{Float64}, Tuple{Base.Slice{Base.OneTo{Int64}}, Int64}, true}}:
[1.6916643844888346, 1.3952481788899702, 1.4257929742644806, 2.3254414019025376, 0.0101]
[130.3632437338083, 1.283605087637319, 221.2362706355311, 2.193291036113921, 10000.010099999996]

sol[800]

retcode: Success
Interpolation: 1st order linear
t: 2-element Vector{Float64}:
0.0
100.0
u: 2-element Vector{SubArray{Float64, 1, Matrix{Float64}, Tuple{Base.Slice{Base.OneTo{Int64}}, Int64}, true}}:
[2.3286232359143177, 2.0849960853408054, 1.5481395680210261, 1.0878883702422468, 0.0101]
[198.03489179352627, 1.9510332210035648, 102.76307791913189, 1.008911077728196, 10000.010099999996]

sol[801]

retcode: Terminated
Interpolation: 1st order linear
t: 2-element Vector{Float64}:
0.0
28.22912660730258
u: 2-element Vector{Vector{Float64}}:
[1.3698092340031855, 2.018364836988968, 1.9808709009626209, 1.5912143969170325, 0.0101]
[55.503066919039185, 1.9067772301987758, 43.81106666791687, 1.4723417477028182, 2822.922760730258]

This histogram shows the distances r^2 for 1_000_000 trajectories.

callback_trajectories_r2

The first column is the trajectories that end at Rmin2=4.5, the second at Rmax2=5_000 and the rest those that ignore the Callback. This structure is repeated for any number of trajectories.

From already thank you very much.