# Why I love Julia

I just gave a talk at my company attempting to explain why we are all so excited about Julia in one hour. I tried to focus on the essence of it, borrowing material from the manual, Stephan’s “Unreasonable effectiveness of multiple dispatch” talk and others.

Thought I’d share the nextjournal notebook I used to present, for others to fork and reuse.

Notifications of inaccuracies and omissions welcome.

33 Likes

I am expecting to give a similar talk soon so this will be very useful. Thanks!

1 Like

Hi,

Nice work. For your `summed` case you could write it like this:

``````function summed(a)
result = 0.0  # note 0.0
@simd for x in a  # note @simd
result += x
end
return result
end
``````

This is fast as the built-in `sum` on 1.2 (for me at least). The built-in sum is more generic and handles more corner cases. I first tried changing `0` to `0.0` to match the type of the elements of `a` but it didn’t speed up (so 1.2 must be smarter than previous versions).

Glen

1 Like

When I give the talk I explain that this is a naive impl of sum, and I show that for example a better init for result would be zero(eltype(a))

I tried slapping a @simd in front of the for loop but it doesn’t speed it up in the notebook env which is 1.2. What array size did you try this with to see a speed up? I’m guessing the speed up for naively putting @simd in front of the for loop is dependent on size(a), whereas The recursive binary divide and conquer in the Base.sum impl gives you a speed up on a broad range of sizes, probably? Anyone have some insight here?

1 Like

The divide and conquer algorithm in Base for floating point summation is used to reduce accumulation error as far as I know. Naive linear summation has O(n) additions whereas the one in base does O(log n).

1 Like

Isn’t the number of non leaf nodes in a full binary tree with n leaves n-1? It’s still linear.

But maybe doing the additions in the bottom up binary tree order leads to less error?

http://kristofferc.github.io/post/intrinsics/ Is a good read. Still not entirely sure how the divide and conquer strategy helps.

Yes of course, my mistake - it’s purely to reduce summation error, not necessarily faster.

Thanks for the explanation, pairwise summation it is!

The remaining mysteries are why prepending @simd does not speed up my naive loop when it does speed up yours, and why the @simd divide and conquer for loop in Base.sum does give me a speed up…

I’ll try playing with it until I figure it out

Hi kolia,

I used the same input as you. My `versioninfo()` returns:

``````Julia Version 1.2.0
Commit c6da87ff4b (2019-08-20 00:03 UTC)
Platform Info:
OS: Linux (x86_64-pc-linux-gnu)
CPU: Intel(R) Core(TM) i5-4300U CPU @ 1.90GHz
WORD_SIZE: 64
LIBM: libopenlibm
LLVM: libLLVM-6.0.1 (ORCJIT, haswell)
``````

Maybe why you don’t see a speed-up is related to your CPU or OS. Here’s my `@code_native` to compare with yours:

``````julia> @code_native summed(a)
.text
; ┌ @ REPL[2]:3 within `summed'
; │┌ @ simdloop.jl:71 within `macro expansion'
; ││┌ @ simdloop.jl:51 within `simd_inner_length'
; │││┌ @ REPL[2]:2 within `length'
movq	8(%rdi), %rax
; │└└└
; │┌ @ int.jl:49 within `macro expansion'
testq	%rax, %rax
; └└
; ┌ @ simdloop.jl:72 within `summed'
jle	L26
movq	(%rdi), %rcx
; │ @ simdloop.jl:75 within `summed'
cmpq	\$16, %rax
jae	L34
vxorpd	%xmm0, %xmm0, %xmm0
xorl	%edx, %edx
jmp	L124
L26:
vxorps	%xmm0, %xmm0, %xmm0
; └
; ┌ @ REPL[2]:6 within `summed'
vzeroupper
retq
; │ @ REPL[2]:3 within `summed'
; │┌ @ simdloop.jl:75 within `macro expansion'
L34:
movq	%rax, %rdx
andq	\$-16, %rdx
leaq	96(%rcx), %rsi
vxorpd	%xmm0, %xmm0, %xmm0
; ││ @ simdloop.jl:78 within `macro expansion'
; ││┌ @ int.jl:53 within `+'
movq	%rdx, %rdi
vxorpd	%xmm1, %xmm1, %xmm1
vxorpd	%xmm2, %xmm2, %xmm2
vxorpd	%xmm3, %xmm3, %xmm3
; ││└
; ││ @ simdloop.jl:77 within `macro expansion' @ REPL[2]:4
; ││┌ @ float.jl:395 within `+'
L64:
; │└└
; │┌ @ int.jl:53 within `macro expansion'
subq	\$-128, %rsi
jne	L64
; │└
; │┌ @ simdloop.jl:77 within `macro expansion' @ REPL[2]:4
; ││┌ @ float.jl:395 within `+'
vextractf128	\$1, %ymm0, %xmm1
cmpq	%rdx, %rax
; └└└
; ┌ @ simdloop.jl:75 within `summed'
je	L158
L124:
subq	%rdx, %rax
leaq	(%rcx,%rdx,8), %rcx
nopw	%cs:(%rax,%rax)
; └
; ┌ @ REPL[2]:3 within `summed'
; │┌ @ simdloop.jl:77 within `macro expansion' @ REPL[2]:4
; ││┌ @ float.jl:395 within `+'
L144:
; ││└
; ││ @ simdloop.jl:75 within `macro expansion'
; ││┌ @ int.jl:49 within `<'
; ││└
jne	L144
; │└
; │ @ REPL[2]:6 within `summed'
L158:
vzeroupper
retq
nopw	%cs:(%rax,%rax)
; └
``````

Glen

You can do `zero(typeof(a))` or more simply `zero(a)`.

Nice presentation, @kolia!

I gave a talk (Nextjournal link) a few months ago to a less technical audience (I assume based on some of the elements of your notebook that your audience has more of a computer science background).

My audience was actuaries, who have more of a math and mild-programming via data-science background.

I also included a comparison to R vs Python, based on a combination of opinion and feedback on the Julia Slack:

Probably in the intervening months some updates needed (e.g. 2.5/3 stars for Parallelism with 1.3)

2 Likes

tbf, I won’t call Julia ‘written’ in Julia, although, `stdlib` is mostly written in Julia

I didn’t know that R is written in Rust

2 Likes

I’m pretty sure most of R is written in C - then again a lot may have changed since I last looked.

2 Likes

Indeed: it is mostly R, C, and Fortran:

Packages also use C++ and other languages. There is a project with Rust, but it appears to be dormant:

Yea, I think I meant to say “R itself” per its Wikipedia page.

Finally got around to figuring it out, it was very simple.

Turns out that adding @simd without fixing the type mismatch between the `result = 0` init and `eltype(a)` keeps @simd from doing its thing.

This does give the speedup, and it’s nice and clean and eltype-agnostic:

``````function sumsimd(a)
result = zero(eltype(a))
@simd for x in a
result += x
end
return result
end
``````

I added this `@simd for x in a` version to the nextjournal presentation notebook, and added the link to Pairwise Summation wikipedia page explaining why Base.sum does what it does.

1 Like

As you guessed, my audience was software engineers who work mostly with Scala and Python, and data scientists.

Julia is an easy sell to data science folks so I targeted the software folks. Julia is just a good general programming language generally, and I set out to explain why.

The presentation was met with a lot of nods and what’s the catch and when can we switch. The main dissent was some being horrified about the lack of privacy of defining new methods on functions like `Base.print` whenever you define a new datatype, and an ensuing discussion of what type piracy is, and why it hasn’t been an issue. Not everyone was convinced by the argument that the Julia community has been doing this for a while without the sky coming crashing down, and that we see more benefits from it than downsides…

1 Like

Not to be a wet blanket, but given the R packages parallel and foreach at the native R level and RcppParallel at the Rcpp level, I have a little trouble giving R only one star for parallel support.

I will say that it is a bit easier in Julia. But truth be told, I found RcppParallel to more efficiently use multiple threads than Julia on the same machine with the same algorithm. I have no insight into the why of that and user error is a very real possibility.

Of course, the fact that you have to think about two different languages for parallezation in R instead of one points directly to the two language problem that Julia answers.

Hi Kolia,

Good to hear you figured it out. I like that you explained why Julia’s `sum` is so much more complicated because when I looked at it the first time I was totally confused and it didn’t give me the impression that Julia is easy to use.

Glen