Strange, even after adding alignment information (using Base.pointerref(..., 1, 8)) which then results in identical LLVM and native code, the performance discrepancy remains.
EDIT: on 0.6, both implementations are equally slow (ie. same as the slow time from OP).
I think there’s something off with your testing, because the code generated (at least on master) is identical.
One, when sizeof(str) does exactly what you need here, and is generic, why do you want to peek at the internals?
Also, why you are calling Base.unsafe_convert, which is for converting something, when you really just need to reinterpret the pointer, i.e. reinterpret(Ptr{UInt}, pointer(a)-8)?
unsafe_convert(T, x)
Convert x to a C argument of type T where the input x must be the return value of cconvert(T, …).
In this case time_ns does indeed show consistent results (of ~19ns, but that’s to be expected since BenchmarkTools does multiple evals/sample). However, I’d advise against recommending it, because BenchmarkTools protects against so many other common pitfalls that are common with newcomers. @btime is a vastly better tool.
I’ve bisected the issue to 1669d532de7434108f1092f34361166737706ba5 from #24362, confirming @kristoffer.carlsson’s hunch
I wasn’t intending to recommend it for novice users - in my case though, I’ve had 30+ years of extensive benchmarking experience, and for that reason I like to get all of the raw data and munge it myself (which Julia makes much nicer / easier than in any other language I’ve worked on before! )