CUDA.jl mystery : VSCode + Julia extension works fine but commandline run fails

Hi,
I have encountered some mysterious issues with CUDA.jl. When I run the code from within the file (in VSCode + Julia extension), everything seems fine and I got the expected answer. But if I run the same code in command line, I got the following error:

  Activating project at `~/jianguoyun/Nutstore/RigorousCoupledWaveAnalysis.jl-master`
eigenmodes:
etm_propagate:
ERROR: LoadError: CUBLASError: an invalid value was used as an argument (code 7, CUBLAS_STATUS_INVALID_VALUE)
Stacktrace:
  [1] throw_api_error(res::CUDA.CUBLAS.cublasStatus_t)
    @ CUDA.CUBLAS ~/.julia/packages/CUDA/DfvRa/lib/cublas/error.jl:50
  [2] macro expansion
    @ ~/.julia/packages/CUDA/DfvRa/lib/cublas/error.jl:63 [inlined]
  [3] cublasZgemv_v2(handle::Ptr{Nothing}, trans::Char, m::Int64, n::Int64, alpha::Bool, A::CUDA.CuArray{ComplexF64, 2, CUDA.Mem.DeviceBuffer}, lda::Int64, x::CUDA.CuArray{ComplexF64, 1, CUDA.Mem.DeviceBuffer}, incx::Int64, beta::Bool, y::CUDA.CuArray{ComplexF64, 1, CUDA.Mem.DeviceBuffer}, incy::Int64)
    @ CUDA.CUBLAS ~/.julia/packages/CUDA/DfvRa/lib/utils/call.jl:26
  [4] gemv!
    @ ~/.julia/packages/CUDA/DfvRa/lib/cublas/wrappers.jl:331 [inlined]
  [5] gemv_dispatch!(Y::CUDA.CuArray{ComplexF64, 1, CUDA.Mem.DeviceBuffer}, A::CUDA.CuArray{ComplexF64, 2, CUDA.Mem.DeviceBuffer}, B::CUDA.CuArray{ComplexF64, 1, CUDA.Mem.DeviceBuffer}, alpha::Bool, beta::Bool)
    @ CUDA.CUBLAS ~/.julia/packages/CUDA/DfvRa/lib/cublas/linalg.jl:179
  [6] mul!
    @ ~/.julia/packages/CUDA/DfvRa/lib/cublas/linalg.jl:188 [inlined]
  [7] mul!
    @ ~/julia/share/julia/stdlib/v1.7/LinearAlgebra/src/matmul.jl:275 [inlined]
  [8] *(A::CUDA.CuArray{ComplexF64, 2, CUDA.Mem.DeviceBuffer}, x::CUDA.CuArray{ComplexF64, 1, CUDA.Mem.DeviceBuffer})
    @ LinearAlgebra ~/julia/share/julia/stdlib/v1.7/LinearAlgebra/src/matmul.jl:51
  [9] etm_propagate_gpu(sup::RigorousCoupledWaveAnalysis.Common.Halfspace, sub::RigorousCoupledWaveAnalysis.Common.Halfspace, ems_gpu::Vector{RigorousCoupledWaveAnalysis.Common.Eigenmodes}, ψin::CUDA.CuArray{ComplexF64, 1, CUDA.Mem.DeviceBuffer}, get_r::Bool)
    @ RigorousCoupledWaveAnalysis.ETM ~/jianguoyun/Nutstore/RigorousCoupledWaveAnalysis.jl-master/src/ETM/ETM.jl:133
 [10] etm_propagate
    @ ~/jianguoyun/Nutstore/RigorousCoupledWaveAnalysis.jl-master/src/ETM/ETM.jl:43 [inlined]
 [11] etm_propagate(sup::RigorousCoupledWaveAnalysis.Common.Halfspace, sub::RigorousCoupledWaveAnalysis.Common.Halfspace, em::Vector{RigorousCoupledWaveAnalysis.Common.Eigenmodes}, ψin::CUDA.CuArray{ComplexF64, 1, CUDA.Mem.DeviceBuffer}, grd::RigorousCoupledWaveAnalysis.Common.RCWAGrid)
    @ RigorousCoupledWaveAnalysis.ETM ~/jianguoyun/Nutstore/RigorousCoupledWaveAnalysis.jl-master/src/ETM/ETM.jl:38
 [12] top-level scope
    @ ~/jianguoyun/Nutstore/RigorousCoupledWaveAnalysis.jl-master/examples/test.augel2018.jl:56
in expression starting at /home/dabajabaza/jianguoyun/Nutstore/RigorousCoupledWaveAnalysis.jl-master/examples/test.augel2018.jl:56

Anyone knows what is going wrong?

I don’t have any issues when I run the code line by line in VSCode … I got the expected results.

perhaps this is related to stream synchronization because when I profile the code, I found this line is involked many times and takes a lot of time:

It is perhaps related to stream-ordered allocations

still working on this issue…
I have pinned the error to the following:

inside a function f(cu_A, cu_v), multiplication of the CuVector cu_v by the CuArray cu_A caused the error. synchronize() does not help. Running the code line by line in VSCode will not cause the error