First about speed. What makes Julia or any language fast is an optimizing compiler (after the code is compiled, note compilation can be slow, see “time-to-first-plot problem”, that has a workaround, and is mostly solved anyway).
Even a minimally optimizing compiler, or with no optimization, compilers produce faster code than you can get from interpreters.
There’s always going to be a trade-off, interpreters can feel “faster” more interactive (while with Python it’s also an illusion, as both Python code itself is actually compiled, and e.g. C code it calls). Julia, like Python, has no separate compilation phase (by default) so looks likes it’s interpreted, just isn’t. You can tune Julia to do less compilation, where appropriate. C# has tiered compilation, something that would also be useful for Julia, to mostly eliminate the trade-off.
In short, it’s not hard to make a language inherently fast (meaning for the code you get after compilation), at least faster than interpreted implementation, what’s usually associated with dynamic languages. All languages can be compiled (or interpreted), it’s just about if the language was designed for it, without speed-impediments. E.g. Python can be compiled (and is by default partially in current versions, just not to machine code). It’s just not possible to compile it always to full speed, as it’s too dynamic (see the main difference in Julia’s performance section of the manual). Julia has absolutely no speed impediments (at least theoretically; and in practice is often faster then C, C++ or Fortran), e.g. all bound checks can be turned off (by default they aren’t), and see my note below on GC.
There’s a paper on the Julia language, but I do also recommend reading Jeff’s PhD thesis, on Github; https://github.com/JeffBezanson/phdthesis/blob/master/main.pdf
at least pages 25-28, I hesitate to summarize it more or the whole 133 pages): “2.1.1 Case study Vandemonde matrices”.
It’s neither too hard to make dynamic (“easy”) languages, e.g. Lisp, Scheme, Python or Ruby. All of these languages, and Julia (and non-dynamic ones like Java), have a garbage collector (GC), one of the features making Julia easy. It can make languages faster in practice (at least theoretically, even with it activating), while it can also slow down or make your language perceived to be slower. In Julia it’s not so much that the GC is advanced or fast (it is advanced/generational and reasonably fast), it’s mostly about you can avoid it, and you need to for some fast code (then it’s easy, but otherwise allocations that trigger it can be a performance pitfall).
Dynamic [typing] is a technical term, and what make Julia dynamic, is just that it’s defined that way; just without the usual speed trap. That’s Julia’s magic. The optimizing part, i.e. LLVM, Julia’s backend is the same as for other languages, it’s e.g. used for C and C++ too.
Only a few language (implementations) have whole program optimization (and/or profile-guided optimization (PGO)), another two technical terms. For e.g. C++ link-time optimization (LTO) provides the last 10% speed increase. LTO doesn’t really apply to Julia as it doesn’t have separate compilation and link step. In practice this step has often been skipped in C++ (until ThinLTO, a solution to the speed problem) as it makes even parallel compile times really slow. Julia doesn’t use PGO, that I know of (maybe with add-on package already do? at least I do not see PGO ruled out by Julia, and do not know what the usual gain is; it’s rarely used, last time I checked). Julia may already have the “10%” compilation boost LTO provides, while I’m pretty sure Julia doesn’t have the full parallel support C´ has with [thin]LTO.
https://clang.llvm.org/docs/ThinLTO.html