Can Julia achieve fine grained control of performance without sacrificing ease of use?

This will be always true to some level, not only in Julia. You can find many (exhaustive) performance comparisons of Julia with other languages (and among other languages) in which the differences in performance end up being at the level of specific compiler flags, which intrinsic math function is being called, and so on. Generally these things are only required in very localized portions of the code, and the overall facility to implement good algorithms is much more important for the performance as a whole.

Where I think Julia seems to be somewhat slower than C++ (specifically) is when dynamic dispatch is required. From what I’ve seen here trying to match C++ performance in this case can be cumbersome, and I don’t remember having a standard go-to solution.

Large stack allocated arrays, for example, would be (will be?) a nice addition to the language, but one can get over that rather easily with preallocation.