They are not the same (you can explore this with @edit, or the debugger). The first creates a temporary array for A .> 0, and then collects the true indexes. Both of these operations are very fast.
The second one goes through the generic iterator path. Perhaps it could optimized more, I am sure PRs doing this would be welcome.
I don’t think there is an inherent reason for the performance difference here (which would of course warrant documentation), it is just waiting for someone to optimize it.