Hi, and welcome to the Julia community!
Here you correctly noted that blockIdx().x starts at 1, so that you need to subtract 1 for the offset calculation. Similarly, threadIdx().x also begins at 1, meaning that to end up with a 1-indexed i, you don’t need to subtract (or add) 1. So the correct code would be
i = threadIdx().x + (blockIdx().x - 1) * blockDim().x
In general these index manipulations can be confusing, so I would advise to just manually try out the boundary cases. E.g. the ‘first’ thread has threadIdx().x == 1 == blockIdx().x and should end up with i = 1.
As a side-note, 1 is of type Int64, which you typically want to avoid on the GPU. Instead you should use a 32-bit version, e.g. as 1i32, after importing via using CUDA: i32. See the Performance Tips for more information.