Benchmarking is often tricky to get right for tiny functions, since Julia is so good at optimizing your code. Times of ~0.01 ns are the result of the compiler replacing your expression with a constant, and you end up benchmarking nothing at all. (1 nanosecond is ~3 CPU clock cycles on a 3 GHz computer, so 0.01 ns is not enough to do anything.)
To get around that, I usually try to structure the benchmark expression in such a way that the compiler can’t constant-fold it or cheat in any other way. In the example below, I also chose a vector large enough that the CPU can’t learn the branching behavior.
julia> v = rand(100_000);
julia> a = similar(v);
julia> @btime $a .= sin.($v);
626.487 μs (0 allocations: 0 bytes)
100k calls in 626 μs equals around 6.26 ns per call to sin
(if called through broadcast).
And yes, the example in the BenchmarkTools manual is now broken, and it would be great if they could talk more about these difficulties. Cf. this issue: