Julia programs now shown on benchmarks game website

FASTA is one line per gene if I remember correctly, so each line here could be say 5-40KB and the whole file could be a couple of gigs, so you can’t just slurp the whole file into RAM, well these days you can but 20 years ago or more when the files were invented you couldn’t.

in any case it’s not reading 80 chars at a time

You are right, it’t 60 chars at a time (from a 1GB file) :wink:

And at least for this benchmark you pretty much have to read it all into memory (look at the memory use), because you have to reverse the whole thing and you can’t really do that before you have read in the end.

2 Likes

Update: over the past few months, it looks like a number of people did some amazing work submitting benchmarks, and Julia now has a very respectable showing, ahead of Swift and Go:

I also think that there’s still quite a bit of additional performance that can be squeezed out. But I think this is a great showing that will help steer more people towards the language.

9 Likes

Kind of happy to see Pascal. One of my first languages

3 Likes

https://benchmarksgame-team.pages.debian.net/benchmarksgame/program/fasta-julia-4.html

The best time I get on MY decade old laptop is (i.e.NOT with 4 or 2 threads):

time ~/julia-1.1.0/bin/julia  -O3  -- fasta.jl 25000000 >/dev/null
real 0m6,066s
user 0m4,928s
sys 0m0,232s

While I regularly get under 5 sec (for “user”). I’m unable to get less than 5.1 sec. on julia-1.3.0-alpha, with whatever optimization level or number of threads.

At least I get slower with (for both below) export JULIA_NUM_THREADS=4

time ~/julia-1.1.0/bin/julia  -O2  -- fasta.jl 25000000 >/dev/null
real 0m6,385s
user 0m4,984s
sys 0m0,212s
time ~/julia-1.1.0/bin/julia  -O3  -- fasta.jl 25000000 >/dev/null
real 0m6,479s
user 0m5,008s
sys 0m0,196s

MY best on Julia-1.3.0-alpha (best combination, i.e. NOT -O3, nor more threads faster):

export JULIA_NUM_THREADS=1
time julia -O2  -- fasta.jl 25000000 >/dev/null
real 0m6,372s
user 0m5,212s
sys 0m0,164s

Please consider these settings on your machines, and when submitting benchmarks if LOWER optimization levels (or older Julia versions) is faster; and if disabling threading, or what numbers of is fastest (this may say more about my Core Duo laptop, or threads has startup-overhead?). Best case startup for me is real 0m0,313s" on Julia-1.3.0-alpha, (can easily be “real 0m0,390s”), and 1.1 is just slightly slower at best “real 0m0,330s”, but I’ve seen “real 0m0,958s”.

Could it be that -O0 disables threads? On MY machine in the test below, 1 thread is better for -O3 (and -O0).

I found -O3 to be 46% SLOWER than -O0 on “real” (when both CONFIGURED for 4 threads); 42% slower with best settings for both (2,572s vs. 1,813s for “real” time; even worse on “user” time, then 61% slower 1,992s vs. 1,236s), this was all WHEN I was first testing (for below, not above test) using the shorter test file (not the longer one actually used in the benchmark to make it long-running, still useful to know how it affects speed):

export JULIA_NUM_THREADS=1
time ~/julia-1.3.0-alpha/bin/julia -O0  -- kn.jl 0 < ~/Downloads/knucleotide-input.txt 
real 0m1,813s
user 0m1,296s
sys 0m0,204s

also got:

real 0m2,047s
user 0m1,236s
sys 0m0,272s
export JULIA_NUM_THREADS=4
time ~/julia-1.3.0-alpha/bin/julia -O3  -- kn.jl 0 < ~/Downloads/knucleotide-input.txt
real 0m2,887s
user 0m2,144s
sys 0m0,188s
export JULIA_NUM_THREADS=4
time ~/julia-1.3.0-alpha/bin/julia -O0  -- kn.jl 0 < ~/Downloads/knucleotide-input.txt 
real 0m2,010s
user 0m1,392s
sys 0m0,184s
export JULIA_NUM_THREADS=1
time ~/julia-1.1.0/bin/julia -O3  -- kn.jl 0 < ~/Downloads/knucleotide-input.txt
real 0m2,572s
user 0m2,048s
sys 0m0,180s

[As with here, I’m always going for lowest “user” and have seen lower “real”, but then “user” higher.]

time ~/julia-1.3.0-alpha/bin/julia -O3  -- kn.jl 0 < ~/Downloads/knucleotide-input.txt 
real 0m2,829s
user 0m1,992s
sys 0m0,196s
time julia --compile=min  -- kn.jl 0 < ~/Downloads/knucleotide-input.txt 
real 0m6,511s
user 0m4,820s
sys 0m0,232s

Update: over the past few months, it looks like a number of people did some amazing work submitting benchmarks, and Julia now has a very respectable showing, ahead of Swift and Go:

Yes, a bunch of the work has been done, mostly by the amazing @non-Jedi (currently 4/10 top Julia programs).

I also think that there’s still quite a bit of additional performance that can be squeezed out.

Quite possibly, yes.

  • pidigits: all GMP calls anyways, so not too much hope here
  • revcomp: I’m currently trying to get a buffered version to be accepted. With 1.3 I’ll try to do some multithreading.
  • fasta: 1.3 mulitthreading will help for sure
  • nbody: … Adam is currently fighting with this one, maybe some SIMD wizards can help out. Not sure how Rust manages to be this much faster on pure number crunching.
  • knuc: maybe multithreading helps, maybe some more hacks regarding the hash function… not quite sure.
  • binarytrees: because Julia provides GC, there is not much that can be done here, I think
  • With the other ones I’m not sure because I haven’t tried them. Many of the are quite fast already though.
4 Likes

I do wonder why this one runs so much slower multi-threaded than it does with multiple processes. Is there something to do with heap-allocating and garbage collection that’s inherently not amenable to multi-threaded environments? If we could use multi-threading instead of multiple processes, that would take a rather large chunk off the run time.

To expand on the list off the top of my head:

  • regex-redux: could become faster once 1.3 lands with new threading runtime (and thread-safe regex) since execution time is dominated by a strictly non-parallelizable task that could be started on a single thread ahead of other work.
  • mandelbrot: isn’t currently using all cpu cores effectively compared to other implementations. I haven’t identified why yet.
  • revcomp: in addition to buffered read @Karajan is working on, this could also benefit from new threading runtime to start working on a specific sequence while still reading input. It also may be possible to speedup the reversal of each sequence using multi-threading if you divide it into “chunks” instead of just using a naive Threads.@threads looping over the array; this requires removing new-lines from the array being reversed–slightly different than @Karajan current fastest implementation.
  • knuc: julia’s hashmap in general seems slower than some other implementations; not sure why. There’s an opportunity for better usage of cpu cores by implementing parallelism within the counting of each frame instead of around it.
  • fasta: obvious opportunity to parallelize, but I haven’t taken the time to grok what the benchmark is actually doing yet.
5 Likes

The one I don’t understand is nbody: it takes 3.2 sec on my computer to run, and on the website is says that the benchmark takes 22 sec. I don’t think my computer should be that much faster…

EDIT: The code here: https://github.com/KristofferC/BenchmarksGame.jl/blob/master/nbody/nbody-fast.jl is even faster (and simpler to understand) at 2.8sec.

Which benchmark? You probably made the same mistake I did, running one with a shorter test file. As I did for the second benchmark in my comment above.

Here: n-body Julia #3 program (Benchmarks Game)

I get the same output so I think I’m running the right thing.

It seems the summary page they have hasn’t been updated. I get on my decade on Core Duo laptop (with e.g. browser running), slower than you, but way faster than on the web page (for Julia1.4-DEV, also tried 1.1.0):

real 0m13,142s

See the discussion at:

And also in the linked issue. The big obvious difference between the benchmark CPU and modern CPUs is AVX instructions, but it evidently doesn’t end there.

@Palli since you happen to have a core2 cpu, would you mind running the julia-4 benchmark as well to compare with julia-3? And possibly even send me the @code_native it’s generating for both?

Why are there a bunch of explicit VecElement’s there? Tuple of VecElements are so that things are passed to LLVM as LLVM-vectors instead of LLVM-arrays and then you can write llvmcall code on them, but they have almost no purpose on their own.

On my machine at least, the nbody-fast.jl code in your repo is faster, and as a bonus it is simpler and cleaner. Note sure if that would be true on whatever machine these benchmarks are being run as noted by @non-Jedi.

Yes, I think now might be a good time to try all that multithreading stuff if for nothing else than to put it to the test before it’s released!

this could also benefit from new threading runtime to start working on a specific sequence while still reading input

This was the version I was thinking of, but you might be right that the other option might be faster. We’ll need to test.

fasta: obvious opportunity to parallelize, but I haven’t taken the time to grok what the benchmark is actually doing yet.

Well, maybe I can help with that, and if anyone else wants to play along and try their best, even better. Here we go, my current implementation:

# Just FYI, I completely restructured the code compared to the
# version on the website, not sure if it runs like this.

const OUT = stdout
const LINE_LENGTH = 60

# First task: just repeat this string over and over with \n
# in the right places
const ALU = codeunits(
    "GGCCGGGCGCGGTGGCTCACGCCTGTAATCCCAGCACTTTGG" *
    "GAGGCCGAGGCGGGCGGATCACCTGAGGTCAGGAGTTCGAGA" *
    "CCAGCCTGGCCAACATGGTGAAACCCCGTCTCTACTAAAAAT" *
    "ACAAAAATTAGCCGGGCGTGGTGGCGCGCGCCTGTAATCCCA" *
    "GCTACTCGGGAGGCTGAGGCAGGAGAATCGCTTGAACCCGGG" *
    "AGGCGGAGGTTGCAGTGAGCCGAGATCGCGCCACTGCACTCC" *
    "AGCCTGGGCGACAGAGCGAGACTCCGTCTCAAAAA")

# I want to always be able to take the next 60 chars of that
# string (without going over the edge) so I repeat it at the end.
function repeat_fasta(str, n)
    # This is a reaaally ugly way of repeating a string, but
    # it was consistently faster than nicer alternatives, so ... :shrug:
    len = length(str)
    src = Vector{UInt8}(undef, len + LINE_LENGTH)
    for i in 1:len
        @inbounds src[i] = str[i]
    end
    for i in 1:LINE_LENGTH
        @inbounds src[i+len] = str[i]
    end

    # Well, write the requred amount of chars of that string,
    # skip to the beginning of the string if you went to far.
    i = 1
    lines, rest = divrem(n, LINE_LENGTH)
    for _ in 1:lines
        write(OUT, @inbounds @view src[i:i+LINE_LENGTH-1])
        write(OUT, '\n')
        i += LINE_LENGTH
        i > len && (i -= len)
    end
    write(OUT, @inbounds @view src[i:i+rest-1])
    write(OUT, '\n')
end

# That was easy, now the more interesting part.
# We have an alphabet of chars with associated probabilities and
# we have to pick n chars from that alphabet according to a LCG
# random number generator.
# This inherently means we can't really parallelize the RNG because
# the numbers need to be in the right order. Of course there are
# opportunities for playing with the sweet new threads elsewhere.

# The RNG works with `Int32`s and the probabilities are given as
# `Float`s so I scale the [0, 1) range of accumulated probabilities
# up to the [0, IM) range of the RNG and store that with the
# corresponding char. Store the Aminoacids as a const `Tuple`.
struct Aminoacids
    c::UInt8
    p::Int32
end
function make_Aminoacids(cs, ps)
    cum_p = 0.0
    tmp = Aminoacids[]
    for (c, p) in zip(cs, ps)
        cum_p += p * IM
        # the comparison is with Int32, so use it here as well
        push!(tmp, Aminoacids(c, floor(Int32, cum_p)))
    end
    return (tmp...,)
end

# create Aminoacids with accumulated probabilities and make
# the result a constant
const IUB = let
    iub_c = b"acgtBDHKMNRSVWY"
    iub_p = [0.27, 0.12, 0.12, 0.27, 0.02, 0.02, 0.02,
             0.02, 0.02, 0.02, 0.02, 0.02, 0.02, 0.02, 0.02]
    make_Aminoacids(iub_c, iub_p)
end
const HOMOSAPIENS = let
    homosapiens_c = b"acgt"
    homosapiens_p = [0.3029549426680, 0.1979883004921,
                     0.1975473066391, 0.3015094502008]
    make_Aminoacids(homosapiens_c, homosapiens_p)
end

# This is the RNG as defined on the website. Not sure there is
# much opportunity here because it needs to be pretty much exactly
# like this.
const IM = Int32(139968)
const IA = Int32(3877)
const IC = Int32(29573)
const last_rnd = Ref(Int32(42))
gen_random() = (last_rnd[] = (last_rnd[] * IA + IC) % IM)

# After we generated a new number we need to pick our the
# corresponding Aminoacid. Some implementations use a binary
# search but I found that simply going though the Tuple seems
# to be faster.
function random_char(genelist)
    r = gen_random()
    for aminoacid in genelist
        aminoacid.p >= r && return aminoacid.c
    end
    return genelist[end].c
end

# A little helper method. I need to fill a vector with chars
# to print out but that can be shorter that the line length
# (and I must not genereate more chars than needed because)
# that would leave the RNG in the wrong state for the next
# run.
function fillrand!(line, genelist, n)
    for i in 1:n
        @inbounds line[i] = random_char(genelist)
    end
end

# Not much to see here, just fill lines until we got the required
# amount of chars printed out.
function random_fasta(genelist, n)
    line = Vector{UInt8}(undef, LINE_LENGTH+1)
    line[end] = UInt8('\n')
    while n > LINE_LENGTH
        fillrand!(line, genelist, LINE_LENGTH)
        write(OUT, line)
        n -= LINE_LENGTH
    end
    fillrand!(line, genelist, n)
    line[n+1] = UInt8('\n')
    write(OUT, @view line[1:n+1])
end

# Simply calling everything. Do two random ones with different
# alphabets.
function main(n)
    write(OUT, ">ONE Homo sapiens alu\n")
    repeat_fasta(ALU, 2n)
    write(OUT, ">TWO IUB ambiguity codes\n")
    random_fasta(IUB, 3n)
    write(OUT, ">THREE Homo sapiens frequency\n")
    random_fasta(HOMOSAPIENS, 5n)
end
main(parse(Int, ARGS[1]))

Now, the main opportunity for parallelization would be to let the RNG generating numbers in the background (into a Channel, I think? I haven’t worked with those yet.), let a second thread convert these numbers into corresponding chars and let the final thread print everything out. Of course with the thread overhead this might be too heavy, but at least that would be my start.

It’s slower; both with version 4 and 3 with: export JULIA_NUM_THREADS=4 (my computer has 2 cores)

-O3 seems to always be slightly slower than -O1 for me:

time ~/julia-1.4.0-DEV-8ebe5643ca/bin/julia -O1 – nbody.julia-4.julia 50000000

real 0m19,348s
user 0m19,144s
sys 0m0,212s

time ~/julia-1.4.0-DEV-8ebe5643ca/bin/julia -O3 – nbody.julia-4.julia 50000000

real 0m21,855s
user 0m21,656s
sys 0m0,212s

vs.:

I seem to get for version 3:

time ~/julia-1.4.0-DEV-8ebe5643ca/bin/julia -O3 – nbody.julia-3.julia 50000000

real 0m13,319s
user 0m13,028s
sys 0m0,236s

export JULIA_NUM_THREADS=1
time ~/julia-1.4.0-DEV-8ebe5643ca/bin/julia -O3 – nbody.julia-3.julia 50000000

real 0m13,158s
user 0m13,008s
sys 0m0,208s

–cpu-target=core2 doesn’t seem to change much, as I guess it’s the default.

For version 4 with -O3:

julia> @code_native main(stdout, (50000000), 0.01)
	.text
; ┌ @ REPL[13]:2 within `main'
	pushq	%rbp
	movq	%rsp, %rbp
; │ @ REPL[13]:32 within `main'
	pushq	%r15
	pushq	%r14
	pushq	%r13
	pushq	%r12
	pushq	%rbx
	subq	$472, %rsp              # imm = 0x1D8
	xorpd	%xmm1, %xmm1
	movapd	%xmm1, -80(%rbp)
	movsd	%xmm0, -56(%rbp)
	movq	%rsi, %rbx
	movq	%rdi, %r13
	movapd	%xmm1, -96(%rbp)
	movq	%fs:0, %rax
	movq	$4, -96(%rbp)
	movq	-15712(%rax), %rcx
	movq	%rcx, -88(%rbp)
	movabsq	$140581406747936, %r12  # imm = 0x7FDBA8CFAD20
	leaq	-96(%rbp), %rcx
	movq	%rcx, -15712(%rax)
	movapd	%xmm1, -496(%rbp)
	movapd	%xmm1, -512(%rbp)
	movabsq	$140581334880848, %rcx  # imm = 0x7FDBA4871250
	movaps	(%rcx), %xmm0
	movaps	%xmm0, -480(%rbp)
	movabsq	$140581334880864, %rcx  # imm = 0x7FDBA4871260
	movaps	(%rcx), %xmm0
	movaps	%xmm0, -464(%rbp)
	movabsq	$140581334880880, %rcx  # imm = 0x7FDBA4871270
	movaps	(%rcx), %xmm0
	movaps	%xmm0, -448(%rbp)
	movabsq	$140581334880896, %rcx  # imm = 0x7FDBA4871280
	movapd	(%rcx), %xmm0
	movapd	%xmm0, -432(%rbp)
	movabsq	$140581334881056, %rcx  # imm = 0x7FDBA4871320
	xorpd	%xmm0, %xmm0
	movhpd	(%rcx), %xmm0           # xmm0 = xmm0[0],mem[0]
	movapd	%xmm0, -416(%rbp)
	movabsq	$140581334880912, %rcx  # imm = 0x7FDBA4871290
	movapd	(%rcx), %xmm0
	movapd	%xmm0, -400(%rbp)
	movabsq	$140581334881064, %rcx  # imm = 0x7FDBA4871328
	xorpd	%xmm0, %xmm0
	movhpd	(%rcx), %xmm0           # xmm0 = xmm0[0],mem[0]
	movapd	%xmm0, -384(%rbp)
	movabsq	$140581334880928, %rcx  # imm = 0x7FDBA48712A0
	movaps	(%rcx), %xmm0
	movaps	%xmm0, -368(%rbp)
	movabsq	$140581334881072, %rcx  # imm = 0x7FDBA4871330
	movsd	(%rcx), %xmm0           # xmm0 = mem[0],zero
	movaps	%xmm0, -352(%rbp)
	movabsq	$140581334880944, %rcx  # imm = 0x7FDBA48712B0
	movaps	(%rcx), %xmm0
	movaps	%xmm0, -336(%rbp)
	movabsq	$140581334881080, %rcx  # imm = 0x7FDBA4871338
	movsd	(%rcx), %xmm0           # xmm0 = mem[0],zero
	movaps	%xmm0, -320(%rbp)
	movabsq	$140581334880960, %rcx  # imm = 0x7FDBA48712C0
	movaps	(%rcx), %xmm0
	movaps	%xmm0, -304(%rbp)
	movabsq	$140581334880976, %rcx  # imm = 0x7FDBA48712D0
	movapd	(%rcx), %xmm0
	movapd	%xmm0, -288(%rbp)
	movabsq	$140581334881088, %rcx  # imm = 0x7FDBA4871340
	xorpd	%xmm0, %xmm0
	movhpd	(%rcx), %xmm0           # xmm0 = xmm0[0],mem[0]
	leaq	-15712(%rax), %r14
	movapd	%xmm0, -272(%rbp)
	movabsq	$140581334880992, %rax  # imm = 0x7FDBA48712E0
	movaps	(%rax), %xmm0
	movaps	%xmm0, -256(%rbp)
	movabsq	$140581334881096, %rax  # imm = 0x7FDBA4871348
	movhpd	(%rax), %xmm1           # xmm1 = xmm1[0],mem[0]
	movapd	%xmm1, -240(%rbp)
	movabsq	$140581334881008, %rax  # imm = 0x7FDBA48712F0
	movaps	(%rax), %xmm0
	movaps	%xmm0, -224(%rbp)
	movabsq	$140581334881104, %rax  # imm = 0x7FDBA4871350
	movsd	(%rax), %xmm0           # xmm0 = mem[0],zero
	movaps	%xmm0, -208(%rbp)
	movabsq	$140581334881024, %rax  # imm = 0x7FDBA4871300
	movaps	(%rax), %xmm0
	movaps	%xmm0, -192(%rbp)
	movabsq	$140581334881112, %rax  # imm = 0x7FDBA4871358
	movsd	(%rax), %xmm0           # xmm0 = mem[0],zero
	movaps	%xmm0, -176(%rbp)
	movabsq	$4566835785178257836, %rax # imm = 0x3F60A8F3531799AC
	movq	%rax, -160(%rbp)
; │┌ @ array.jl:130 within `vect'
; ││┌ @ array.jl:612 within `_array_for'
; │││┌ @ abstractarray.jl:671 within `similar' @ abstractarray.jl:672
; ││││┌ @ boot.jl:413 within `Array' @ boot.jl:404
	leaq	277347728(%r12), %rax
	movabsq	$140581406097168, %rdi  # imm = 0x7FDBA8C5BF10
	movl	$5, %esi
	callq	*%rax
	movq	%rax, %r15
; ││└└└
; ││ @ array.jl:780 within `vect'
	movq	(%r15), %rax
	movq	$-288, %rcx             # imm = 0xFEE0
	xorl	%edx, %edx
	nopl	(%rax)
L608:
	movups	-224(%rbp,%rcx), %xmm0
	movupd	-208(%rbp,%rcx), %xmm1
	movups	-192(%rbp,%rcx), %xmm2
	movups	-176(%rbp,%rcx), %xmm3
	movq	-160(%rbp,%rcx), %rsi
	movups	%xmm0, 288(%rax,%rcx)
	movupd	%xmm1, 304(%rax,%rcx)
	movups	%xmm2, 320(%rax,%rcx)
	movups	%xmm3, 336(%rax,%rcx)
	movq	%rsi, 352(%rax,%rcx)
; ││ @ array.jl:130 within `vect'
; ││┌ @ range.jl:597 within `iterate'
; │││┌ @ promotion.jl:399 within `=='
	testq	%rcx, %rcx
; ││└└
	je	L735
; ││┌ @ tuple.jl:24 within `getindex'
	addq	$72, %rcx
	incq	%rdx
	cmpq	$5, %rdx
	jb	L608
	movabsq	$jl_bounds_error_unboxed_int, %rax
	leaq	-512(%rbp), %rdi
	movl	$6, %edx
	movq	%r12, %rsi
	callq	*%rax
; └└└
; ┌ @ tuple.jl within `main'
L735:
	movq	%r14, -64(%rbp)
	movq	%r15, -72(%rbp)
; └
; ┌ @ REPL[13]:34 within `main'
	movabsq	$julia_energy_16666, %rax
	movq	%r15, %rdi
	callq	*%rax
	movsd	%xmm0, -48(%rbp)
	movabsq	$getbuf, %rax
	callq	*%rax
	movsd	-48(%rbp), %xmm0        # xmm0 = mem[0],zero
	movq	%rax, %r12
; │┌ @ float.jl:553 within `isfinite'
; ││┌ @ float.jl:403 within `-'
	movapd	%xmm0, %xmm1
	subsd	%xmm1, %xmm1
; ││└
; ││┌ @ float.jl:488 within `==' @ float.jl:454
	xorps	%xmm2, %xmm2
; │└└
	ucomisd	%xmm2, %xmm1
	jne	L802
	jnp	L879
; │┌ @ printf.jl:150 within `macro expansion'
; ││┌ @ float.jl:503 within `<' @ float.jl:458
L802:
	ucomisd	%xmm0, %xmm2
; ││└
	movabsq	$140581402197232, %rax  # imm = 0x7FDBA88A3CF0
	movabsq	$jl_system_image_data, %rcx
	cmovbeq	%rax, %rcx
; ││┌ @ float.jl:535 within `isnan'
; │││┌ @ float.jl:456 within `!='
	ucomisd	%xmm0, %xmm0
; ││└└
	movabsq	$jl_system_image_data, %rax
	cmovnpq	%rcx, %rax
; │└
; │┌ @ io.jl:179 within `print'
; ││┌ @ io.jl:177 within `write'
; │││┌ @ gcutils.jl:91 within `macro expansion'
; ││││┌ @ string.jl:85 within `sizeof'
	movq	(%rax), %rdx
	movq	%rax, -80(%rbp)
; ││││└
; ││││┌ @ string.jl:81 within `pointer'
; │││││┌ @ pointer.jl:59 within `unsafe_convert'
; ││││││┌ @ pointer.jl:159 within `+'
	leaq	8(%rax), %rsi
; ││││└└└
	movabsq	$unsafe_write, %rax
	movq	%r13, %rdi
	callq	*%rax
	jmp	L1051
; └└└└
; ┌ @ gcutils.jl within `main'
L879:
	movq	%r13, %r14
; └
; ┌ @ REPL[13]:34 within `main'
; │┌ @ printf.jl:841 within `fix_dec' @ printf.jl:992
; ││┌ @ array.jl:214 within `length'
	movq	8(%r12), %rax
; ││└
; ││┌ @ int.jl:52 within `-'
	decq	%rax
; ││└
; ││ @ printf.jl:841 within `fix_dec' @ int.jl:49
	cmpq	$10, %rax
	movl	$9, %edx
; └└
; ┌ @ printf.jl:992 within `main'
	cmovlq	%rax, %rdx
	movq	%r12, -80(%rbp)
; │ @ printf.jl:993 within `main'
	movabsq	$grisu, %rax
	leaq	-144(%rbp), %rdi
	movl	$2, %esi
	movq	%r12, %rcx
	callq	*%rax
; └
; ┌ @ REPL[13]:34 within `main'
; │┌ @ printf.jl:841 within `fix_dec' @ printf.jl:994
; ││┌ @ promotion.jl:399 within `=='
	movq	-144(%rbp), %r13
	testq	%r13, %r13
; ││└
	je	L1423
; ││ @ printf.jl:841 within `fix_dec' @ printf.jl:998
; ││┌ @ boot.jl:709 within `Int32'
; │││┌ @ boot.jl:619 within `toInt32'
; ││││┌ @ boot.jl:581 within `checked_trunc_sint'
	movslq	%r13d, %rax
; │││││ @ boot.jl:582 within `checked_trunc_sint'
	cmpq	%rax, %r13
	jne	L1514
; │││││ @ boot.jl:580 within `checked_trunc_sint'
	movq	-136(%rbp), %rdx
; │││││ @ boot.jl:581 within `checked_trunc_sint'
	movslq	%edx, %rax
; │││││ @ boot.jl:582 within `checked_trunc_sint'
	cmpq	%rax, %rdx
	jne	L1551
; ││└└└
	movb	-128(%rbp), %al
; │└
	testb	%al, %al
	je	L1016
; │┌ @ char.jl:229 within `print'
; ││┌ @ io.jl:647 within `write'
L988:
	movabsq	$write, %rax
	movl	$45, %esi
	movq	%r14, %rdi
	movq	%rdx, -48(%rbp)
	callq	*%rax
	movq	-48(%rbp), %rdx
; │└└
L1016:
	movabsq	$print_fixed, %rax
	movl	$9, %esi
	movl	$1, %r8d
	movq	%r14, %rdi
	movl	%r13d, %ecx
	movq	%r12, %r9
	callq	*%rax
	movq	%r14, %r13
; │┌ @ char.jl:229 within `print'
; ││┌ @ io.jl:647 within `write'
L1051:
	movabsq	$write, %r12
	movl	$10, %esi
	movq	%r13, %rdi
	callq	*%r12
; │└└
; │ @ REPL[13]:35 within `main'
; │┌ @ range.jl:5 within `Colon'
; ││┌ @ range.jl:277 within `UnitRange'
; │││┌ @ range.jl:282 within `unitrange_last'
; ││││┌ @ operators.jl:341 within `>='
; │││││┌ @ int.jl:424 within `<='
	testq	%rbx, %rbx
; │└└└└└
	jle	L1104
	movabsq	$"julia_next!_16667", %r14
	nop
; │ @ REPL[13]:36 within `main'
L1088:
	movq	%r15, %rdi
	movsd	-56(%rbp), %xmm0        # xmm0 = mem[0],zero
	callq	*%r14
; │┌ @ range.jl:597 within `iterate'
; ││┌ @ promotion.jl:399 within `=='
	decq	%rbx
; │└└
	jne	L1088
; │ @ REPL[13]:38 within `main'
L1104:
	movq	%r15, %rdi
	movabsq	$julia_energy_16666, %rax
	callq	*%rax
	movsd	%xmm0, -56(%rbp)
	movabsq	$getbuf, %rax
	callq	*%rax
	movsd	-56(%rbp), %xmm0        # xmm0 = mem[0],zero
	movq	%rax, %rbx
; │┌ @ float.jl:553 within `isfinite'
; ││┌ @ float.jl:403 within `-'
	movapd	%xmm0, %xmm1
	subsd	%xmm1, %xmm1
; ││└
; ││┌ @ float.jl:488 within `==' @ float.jl:454
	xorps	%xmm2, %xmm2
; │└└
	ucomisd	%xmm2, %xmm1
	jne	L1163
	jnp	L1244
; │┌ @ printf.jl:150 within `macro expansion'
; ││┌ @ float.jl:503 within `<' @ float.jl:458
L1163:
	ucomisd	%xmm0, %xmm2
; ││└
	movabsq	$140581402197232, %rax  # imm = 0x7FDBA88A3CF0
	movabsq	$jl_system_image_data, %rcx
	cmovbeq	%rax, %rcx
; ││┌ @ float.jl:535 within `isnan'
; │││┌ @ float.jl:456 within `!='
	ucomisd	%xmm0, %xmm0
; ││└└
	movabsq	$jl_system_image_data, %rax
	cmovnpq	%rcx, %rax
; │└
; │┌ @ io.jl:179 within `print'
; ││┌ @ io.jl:177 within `write'
; │││┌ @ gcutils.jl:91 within `macro expansion'
; ││││┌ @ string.jl:85 within `sizeof'
	movq	(%rax), %rdx
	movq	%rax, -80(%rbp)
; ││││└
; ││││┌ @ string.jl:81 within `pointer'
; │││││┌ @ pointer.jl:59 within `unsafe_convert'
; ││││││┌ @ pointer.jl:159 within `+'
	leaq	8(%rax), %rsi
; ││││└└└
	movabsq	$unsafe_write, %rax
	movq	%r13, %rdi
	callq	*%rax
	movq	-64(%rbp), %rbx
	jmp	L1390
; │└└└
; │┌ @ printf.jl:841 within `fix_dec' @ printf.jl:992
; ││┌ @ array.jl:214 within `length'
L1244:
	movq	8(%rbx), %rax
; ││└
; ││ @ printf.jl:841 within `fix_dec' @ int.jl:52
	decq	%rax
; ││ @ printf.jl:841 within `fix_dec' @ printf.jl:992
; ││┌ @ operators.jl:294 within `>'
; │││┌ @ int.jl:49 within `<'
	cmpq	$10, %rax
	movl	$9, %edx
; ││└└
	cmovlq	%rax, %rdx
	movq	%rbx, -80(%rbp)
; ││ @ printf.jl:841 within `fix_dec' @ printf.jl:993
	movabsq	$grisu, %rax
	leaq	-120(%rbp), %rdi
	movl	$2, %esi
	movq	%rbx, %rcx
	callq	*%rax
; ││ @ printf.jl:841 within `fix_dec' @ printf.jl:994
; ││┌ @ promotion.jl:399 within `=='
	movq	-120(%rbp), %r15
	testq	%r15, %r15
; ││└
	je	L1469
; ││ @ printf.jl:841 within `fix_dec' @ printf.jl:998
; ││┌ @ boot.jl:709 within `Int32'
; │││┌ @ boot.jl:619 within `toInt32'
; ││││┌ @ boot.jl:581 within `checked_trunc_sint'
	movslq	%r15d, %rax
; │││││ @ boot.jl:582 within `checked_trunc_sint'
	cmpq	%rax, %r15
	jne	L1585
; │││││ @ boot.jl:580 within `checked_trunc_sint'
	movq	-112(%rbp), %r14
; │││││ @ boot.jl:581 within `checked_trunc_sint'
	movslq	%r14d, %rax
; │││││ @ boot.jl:582 within `checked_trunc_sint'
	cmpq	%rax, %r14
	jne	L1622
; ││└└└
	movb	-104(%rbp), %al
; │└
	testb	%al, %al
	je	L1351
; │┌ @ char.jl:229 within `print'
; ││┌ @ io.jl:647 within `write'
L1340:
	movl	$45, %esi
	movq	%r13, %rdi
	callq	*%r12
; │└└
L1351:
	movabsq	$print_fixed, %rax
	movl	$9, %esi
	movl	$1, %r8d
	movq	%r13, %rdi
	movl	%r14d, %edx
	movl	%r15d, %ecx
	movq	%rbx, %r9
	callq	*%rax
	movq	-64(%rbp), %rbx
; │┌ @ char.jl:229 within `print'
; ││┌ @ io.jl:647 within `write'
L1390:
	movl	$10, %esi
	movq	%r13, %rdi
	callq	*%r12
	movq	-88(%rbp), %rax
	movq	%rax, (%rbx)
; │└└
	leaq	-40(%rbp), %rsp
	popq	%rbx
	popq	%r12
	popq	%r13
	popq	%r14
	popq	%r15
	popq	%rbp
	retq
; │ @ REPL[13]:34 within `main'
; │┌ @ printf.jl:841 within `fix_dec' @ printf.jl:995
; ││┌ @ array.jl:780 within `setindex!'
L1423:
	cmpq	$0, 8(%r12)
	je	L1659
	movq	(%r12), %rax
	movb	$48, (%rax)
	movl	$1, %r13d
; ││└
; ││ @ printf.jl:841 within `fix_dec' @ printf.jl:996
	movb	-128(%rbp), %al
	movl	$1, %edx
; │└
	testb	%al, %al
	jne	L988
	jmp	L1016
; │ @ REPL[13]:38 within `main'
; │┌ @ printf.jl:841 within `fix_dec' @ printf.jl:995
; ││┌ @ array.jl:780 within `setindex!'
L1469:
	cmpq	$0, 8(%rbx)
	je	L1697
	movq	(%rbx), %rax
	movb	$48, (%rax)
	movl	$1, %r15d
; ││└
; ││ @ printf.jl:841 within `fix_dec' @ printf.jl:996
	movb	-104(%rbp), %al
	movl	$1, %r14d
; │└
	testb	%al, %al
	jne	L1340
	jmp	L1351
; │ @ REPL[13]:34 within `main'
; │┌ @ printf.jl:841 within `fix_dec' @ printf.jl:998
; ││┌ @ boot.jl:709 within `Int32'
; │││┌ @ boot.jl:619 within `toInt32'
; ││││┌ @ boot.jl:582 within `checked_trunc_sint'
L1514:
	movabsq	$throw_inexacterror, %rax
	movabsq	$140581378234496, %rdi  # imm = 0x7FDBA71C9880
	movabsq	$jl_system_image_data, %rsi
	movq	%r13, %rdx
	callq	*%rax
	ud2
L1551:
	movabsq	$throw_inexacterror, %rax
	movabsq	$140581378234496, %rdi  # imm = 0x7FDBA71C9880
	movabsq	$jl_system_image_data, %rsi
	callq	*%rax
	ud2
; │└└└└
; │ @ REPL[13]:38 within `main'
; │┌ @ printf.jl:841 within `fix_dec' @ printf.jl:998
; ││┌ @ boot.jl:709 within `Int32'
; │││┌ @ boot.jl:619 within `toInt32'
; ││││┌ @ boot.jl:582 within `checked_trunc_sint'
L1585:
	movabsq	$throw_inexacterror, %rax
	movabsq	$140581378234496, %rdi  # imm = 0x7FDBA71C9880
	movabsq	$jl_system_image_data, %rsi
	movq	%r15, %rdx
	callq	*%rax
	ud2
L1622:
	movabsq	$throw_inexacterror, %rax
	movabsq	$140581378234496, %rdi  # imm = 0x7FDBA71C9880
	movabsq	$jl_system_image_data, %rsi
	movq	%r14, %rdx
	callq	*%rax
	ud2
; │└└└└
; │ @ REPL[13]:34 within `main'
; │┌ @ printf.jl:841 within `fix_dec' @ printf.jl:995
; ││┌ @ array.jl:780 within `setindex!'
L1659:
	movq	%rsp, %rax
	leaq	-16(%rax), %rsi
	movq	%rsi, %rsp
	movq	$1, -16(%rax)
	movabsq	$jl_bounds_error_ints, %rax
	movl	$1, %edx
	movq	%r12, %rdi
	callq	*%rax
; │└└
; │ @ REPL[13]:38 within `main'
; │┌ @ printf.jl:841 within `fix_dec' @ printf.jl:995
; ││┌ @ array.jl:780 within `setindex!'
L1697:
	movq	%rsp, %rax
	leaq	-16(%rax), %rsi
	movq	%rsi, %rsp
	movq	$1, -16(%rax)
	movabsq	$jl_bounds_error_ints, %rax
	movl	$1, %edx
	movq	%rbx, %rdi
	callq	*%rax
	nopw	(%rax,%rax)
; └└└
1 Like

I hit the character limit 32000 (not 32768; I was 100 letters over so posting separately)

For version 3 with -O3:

julia> @code_native NBody.perf_nbody(50000000)
	.text
; ┌ @ REPL[1]:132 within `perf_nbody'
	pushq	%rbp
	movq	%rsp, %rbp
	pushq	%r15
	pushq	%r14
	pushq	%r13
	pushq	%r12
	pushq	%rbx
	subq	$200, %rsp
	movq	%rdi, %r14
	xorps	%xmm0, %xmm0
	movaps	%xmm0, -144(%rbp)
	movaps	%xmm0, -160(%rbp)
	movaps	%xmm0, -176(%rbp)
	movq	$0, -128(%rbp)
	movq	%fs:0, %rax
; │┌ @ REPL[1]:119 within `initbody'
; ││┌ @ REPL[1]:17 within `Body'
	movq	$10, -176(%rbp)
	movq	-15712(%rax), %rcx
	movq	%rcx, -168(%rbp)
	leaq	-176(%rbp), %rcx
	movq	%rcx, -15712(%rax)
	leaq	-15712(%rax), %r13
	movabsq	$jl_gc_pool_alloc, %rbx
	movl	$1520, %esi             # imm = 0x5F0
	movl	$96, %edx
	movq	%r13, %rdi
	callq	*%rbx
	movq	%rax, %r12
	movabsq	$139695149897424, %r15  # imm = 0x7F0D4FC956D0
	movq	%r15, -8(%r12)
	movabsq	$-4631240860977730576, %rax # imm = 0xBFBA86F96C25EBF0
	movq	%rax, 16(%r12)
	movabsq	$139695052337072, %rax  # imm = 0x7F0D49F8AFB0
	movaps	(%rax), %xmm0
	movaps	%xmm0, (%r12)
	movabsq	$-4640446117579192555, %rax # imm = 0xBF99D2D79A5A0715
	movq	%rax, 48(%r12)
	movabsq	$139695052337088, %rax  # imm = 0x7F0D49F8AFC0
	movaps	(%rax), %xmm0
	movaps	%xmm0, 32(%r12)
	movabsq	$4585593052079010776, %rax # imm = 0x3FA34C95D9AB33D8
	movq	%rax, 64(%r12)
	movq	%r12, -160(%rbp)
; │└└
; │ @ REPL[1]:140 within `perf_nbody'
; │┌ @ REPL[1]:119 within `initbody'
; ││┌ @ REPL[1]:17 within `Body'
	movl	$1520, %esi             # imm = 0x5F0
	movl	$96, %edx
	movq	%r13, %rdi
	callq	*%rbx
	movq	%rbx, %rcx
	movq	%rax, %rbx
	movq	%r15, -8(%rbx)
	movabsq	$-4622431185293064580, %rax # imm = 0xBFD9D353E1EB467C
	movq	%rax, 16(%rbx)
	movabsq	$139695052337104, %rax  # imm = 0x7F0D49F8AFD0
	movaps	(%rax), %xmm0
	movaps	%xmm0, (%rbx)
	movabsq	$4576004977915405236, %rax # imm = 0x3F813C485F1123B4
	movq	%rax, 48(%rbx)
	movabsq	$139695052337120, %rax  # imm = 0x7F0D49F8AFE0
	movaps	(%rax), %xmm0
	movaps	%xmm0, 32(%rbx)
	movabsq	$4577659745833829943, %rax # imm = 0x3F871D490D07C637
	movq	%rax, 64(%rbx)
	movq	%rbx, -152(%rbp)
; │└└
; │ @ REPL[1]:148 within `perf_nbody'
; │┌ @ REPL[1]:119 within `initbody'
; ││┌ @ REPL[1]:17 within `Body'
	movl	$1520, %esi             # imm = 0x5F0
	movl	$96, %edx
	movq	%r13, %rdi
	callq	*%rcx
	movq	%r15, -8(%rax)
	movabsq	$-4626158513131520608, %rcx # imm = 0xBFCC9557BE257DA0
	movq	%rcx, 16(%rax)
	movabsq	$139695052337136, %rcx  # imm = 0x7F0D49F8AFF0
	movaps	(%rcx), %xmm0
	movaps	%xmm0, (%rax)
	movabsq	$-4645973824767902084, %rcx # imm = 0xBF862F6BFAF23E7C
	movq	%rcx, 48(%rax)
	movabsq	$139695052337152, %rcx  # imm = 0x7F0D49F8B000
	movaps	(%rcx), %xmm0
	movaps	%xmm0, 32(%rax)
	movabsq	$4565592097032511155, %rcx # imm = 0x3F5C3DD29CF41EB3
	movq	%rcx, 64(%rax)
	movq	%rax, -104(%rbp)
	movq	%rax, -144(%rbp)
; │└└
; │ @ REPL[1]:156 within `perf_nbody'
; │┌ @ REPL[1]:119 within `initbody'
; ││┌ @ REPL[1]:17 within `Body'
	movl	$1520, %esi             # imm = 0x5F0
	movl	$96, %edx
	movq	%r13, %rdi
	movabsq	$jl_gc_pool_alloc, %rax
	callq	*%rax
	movq	%r15, -8(%rax)
	movabsq	$4595626498235032896, %rcx # imm = 0x3FC6F1F393ABE540
	movq	%rcx, 16(%rax)
	movabsq	$139695052337168, %rcx  # imm = 0x7F0D49F8B010
	movaps	(%rcx), %xmm0
	movaps	%xmm0, (%rax)
	movabsq	$-4638202354754755082, %rcx # imm = 0xBFA1CB88587665F6
	movq	%rcx, 48(%rax)
	movabsq	$139695052337184, %rcx  # imm = 0x7F0D49F8B020
	movaps	(%rcx), %xmm0
	movaps	%xmm0, 32(%rax)
	movabsq	$4566835785178257836, %rcx # imm = 0x3F60A8F3531799AC
	movq	%rcx, 64(%rax)
	movq	%rax, -48(%rbp)
	movq	%rax, -136(%rbp)
; │└└
; │ @ REPL[1]:164 within `perf_nbody'
; │┌ @ REPL[1]:119 within `initbody'
; ││┌ @ REPL[1]:17 within `Body'
	movl	$1520, %esi             # imm = 0x5F0
	movl	$96, %edx
	movq	%r13, -184(%rbp)
	movq	%r13, %rdi
	movabsq	$jl_gc_pool_alloc, %rax
	callq	*%rax
	movq	%rax, %r13
	movq	%r15, -8(%r13)
	xorps	%xmm0, %xmm0
	movaps	%xmm0, (%r13)
	movq	$0, 16(%r13)
	movaps	%xmm0, 32(%r13)
	movq	$0, 48(%r13)
	movabsq	$4630752910647379422, %rax # imm = 0x4043BD3CC9BE45DE
	movq	%rax, 64(%r13)
	movq	%r13, -128(%rbp)
; │└└
; │ @ REPL[1]:166 within `perf_nbody'
; │┌ @ array.jl:130 within `vect'
; ││┌ @ array.jl:612 within `_array_for'
; │││┌ @ abstractarray.jl:671 within `similar' @ abstractarray.jl:672
; ││││┌ @ boot.jl:413 within `Array' @ boot.jl:404
	movabsq	$jl_system_image_data, %rax
	leaq	214180000(%rax), %rax
	movabsq	$139695149907168, %rdi  # imm = 0x7F0D4FC97CE0
	movl	$5, %esi
	callq	*%rax
	movq	%rax, %r15
	movzwl	16(%r15), %eax
	andl	$3, %eax
	cmpl	$3, %eax
; │└└└└
; │┌ @ tuple.jl:24 within `vect'
	jne	L892
; │└
; │┌ @ array.jl:130 within `vect'
; ││┌ @ array.jl:780 within `setindex!'
	movq	(%r15), %rcx
	movq	40(%r15), %rdi
	movq	-8(%rdi), %rax
	andl	$3, %eax
	cmpq	$3, %rax
	jne	L740
	testb	$1, -8(%r13)
	je	L2237
L740:
	movq	%r13, (%rcx)
	movq	40(%r15), %rdi
	movq	-8(%rdi), %rax
	andl	$3, %eax
	cmpq	$3, %rax
	jne	L772
	testb	$1, -8(%r12)
	je	L2262
L772:
	movq	%r12, 8(%rcx)
	movq	40(%r15), %rdi
	movq	-8(%rdi), %rax
	andl	$3, %eax
	cmpq	$3, %rax
	movabsq	$jl_system_image_data, %r12
	jne	L813
	testb	$1, -8(%rbx)
	je	L2285
L813:
	movq	%rbx, 16(%rcx)
	movq	40(%r15), %rdi
	movq	-8(%rdi), %rax
	andl	$3, %eax
	cmpq	$3, %rax
	movq	-104(%rbp), %rbx
	jne	L848
	testb	$1, -8(%rbx)
	je	L2308
L848:
	movq	%rbx, 24(%rcx)
	movq	40(%r15), %rdi
	movq	-8(%rdi), %rax
	andl	$3, %eax
	cmpq	$3, %rax
	movq	-48(%rbp), %rbx
	jne	L883
	testb	$1, -8(%rbx)
	je	L2331
L883:
	movq	%rbx, 32(%rcx)
; │└└
; │ @ REPL[1]:168 within `perf_nbody'
	jmp	L1050
; │ @ REPL[1]:166 within `perf_nbody'
; │┌ @ array.jl:130 within `vect'
; ││┌ @ array.jl:780 within `setindex!'
L892:
	movq	-8(%r15), %rax
	movq	(%r15), %rcx
	andl	$3, %eax
	cmpq	$3, %rax
	jne	L919
	testb	$1, -8(%r13)
	je	L2354
L919:
	movq	%r13, (%rcx)
	movq	-8(%r15), %rax
	andl	$3, %eax
	cmpq	$3, %rax
	jne	L947
	testb	$1, -8(%r12)
	je	L2382
L947:
	movq	%r12, 8(%rcx)
	movq	-8(%r15), %rax
	andl	$3, %eax
	cmpq	$3, %rax
	movabsq	$jl_system_image_data, %r12
	jne	L984
	testb	$1, -8(%rbx)
	je	L2408
L984:
	movq	%rbx, 16(%rcx)
	movq	-8(%r15), %rax
	andl	$3, %eax
	cmpq	$3, %rax
	movq	-104(%rbp), %rbx
	jne	L1015
	testb	$1, -8(%rbx)
	je	L2434
L1015:
	movq	%rbx, 24(%rcx)
	movq	-8(%r15), %rax
	andl	$3, %eax
	cmpq	$3, %rax
	movq	-48(%rbp), %rbx
	jne	L1046
	testb	$1, -8(%rbx)
	je	L2460
L1046:
	movq	%rbx, 32(%rcx)
; └└└
; ┌ @ array.jl within `perf_nbody'
L1050:
	movq	%r15, -160(%rbp)
; └
; ┌ @ REPL[1]:168 within `perf_nbody'
	movabsq	$julia_init_sun_16583, %rax
	movq	%r15, %rdi
	callq	*%rax
	fstp	%st(0)
; │ @ REPL[1]:170 within `perf_nbody'
	movq	(%r12), %rbx
	movq	%rbx, -136(%rbp)
	movabsq	$julia_energy_16584, %rax
	movq	%r15, %rdi
	callq	*%rax
	movsd	%xmm0, -48(%rbp)
	movabsq	$getbuf, %rax
	callq	*%rax
	movsd	-48(%rbp), %xmm0        # xmm0 = mem[0],zero
	movq	%rax, %r13
; │┌ @ float.jl:553 within `isfinite'
; ││┌ @ float.jl:403 within `-'
	movapd	%xmm0, %xmm1
	subsd	%xmm1, %xmm1
; ││└
; ││┌ @ float.jl:488 within `==' @ float.jl:454
	xorps	%xmm2, %xmm2
; │└└
	ucomisd	%xmm2, %xmm1
	jne	L1144
	jnp	L1234
; │┌ @ printf.jl:150 within `macro expansion'
; ││┌ @ float.jl:503 within `<' @ float.jl:458
L1144:
	ucomisd	%xmm0, %xmm2
; ││└
	movabsq	$139695111274064, %rax  # imm = 0x7F0D4D7BFE50
	movabsq	$jl_system_image_data, %rcx
	cmovbeq	%rax, %rcx
; ││┌ @ float.jl:535 within `isnan'
; │││┌ @ float.jl:456 within `!='
	ucomisd	%xmm0, %xmm0
; ││└└
	movabsq	$jl_system_image_data, %rax
	cmovnpq	%rcx, %rax
; │└
	movq	%rbx, -96(%rbp)
	movq	%rax, -88(%rbp)
	movabsq	$jl_apply_generic, %rax
	movabsq	$jl_system_image_data, %rdi
	leaq	-96(%rbp), %rsi
	movl	$2, %edx
	callq	*%rax
	jmp	L1530
; │┌ @ printf.jl:841 within `fix_dec' @ printf.jl:992
; ││┌ @ array.jl:214 within `length'
L1234:
	movq	8(%r13), %rax
; ││└
; ││ @ printf.jl:841 within `fix_dec' @ int.jl:52
	decq	%rax
; ││ @ printf.jl:841 within `fix_dec' @ printf.jl:992
; ││┌ @ operators.jl:294 within `>'
; │││┌ @ int.jl:49 within `<'
	cmpq	$10, %rax
	movl	$9, %edx
; ││└└
	cmovlq	%rax, %rdx
	movq	%r13, -128(%rbp)
; ││ @ printf.jl:841 within `fix_dec' @ printf.jl:993
	movabsq	$grisu, %rax
	leaq	-232(%rbp), %rdi
	movl	$2, %esi
	movq	%r13, %rcx
	callq	*%rax
; ││ @ printf.jl:841 within `fix_dec' @ printf.jl:994
; ││┌ @ promotion.jl:399 within `=='
	movq	-232(%rbp), %r12
	testq	%r12, %r12
; ││└
	movq	%r13, -104(%rbp)
	je	L2141
; ││ @ printf.jl:841 within `fix_dec' @ printf.jl:998
; ││┌ @ boot.jl:709 within `Int32'
; │││┌ @ boot.jl:619 within `toInt32'
; ││││┌ @ boot.jl:581 within `checked_trunc_sint'
	movslq	%r12d, %rax
; │││││ @ boot.jl:582 within `checked_trunc_sint'
	cmpq	%rax, %r12
	jne	L2486
; │││││ @ boot.jl:580 within `checked_trunc_sint'
	movq	-224(%rbp), %rcx
; │││││ @ boot.jl:581 within `checked_trunc_sint'
	movslq	%ecx, %rax
; │││││ @ boot.jl:582 within `checked_trunc_sint'
	cmpq	%rax, %rcx
	jne	L2523
; ││└└└
	movb	-216(%rbp), %al
; │└
	testb	%al, %al
	je	L1401
L1346:
	movq	%rbx, -96(%rbp)
	movabsq	$jl_system_image_data, %rax
	movq	%rax, -88(%rbp)
	movabsq	$jl_apply_generic, %rax
	movabsq	$jl_system_image_data, %rdi
	leaq	-96(%rbp), %rsi
	movl	$2, %edx
	movq	%rcx, %r13
	callq	*%rax
	movq	%r13, %rcx
L1401:
	movabsq	$jl_box_int32, %r13
	movl	%ecx, %edi
	callq	*%r13
	movq	%r13, %rcx
	movq	%rax, %r13
	movq	%r13, -144(%rbp)
	movl	%r12d, %edi
	callq	*%rcx
	movq	%rax, -152(%rbp)
	movq	%rbx, -96(%rbp)
	movabsq	$139695095480928, %rcx  # imm = 0x7F0D4C8B0260
	movq	%rcx, -88(%rbp)
	movq	%r13, -80(%rbp)
	movq	%rax, -72(%rbp)
	movabsq	$jl_system_image_data, %rax
	movq	%rax, -64(%rbp)
	movq	-104(%rbp), %rax
	movq	%rax, -56(%rbp)
	movabsq	$jl_apply_generic, %rax
	movabsq	$jl_system_image_data, %rdi
	leaq	-96(%rbp), %rsi
	movl	$6, %edx
	callq	*%rax
	movabsq	$jl_system_image_data, %r12
L1530:
	movq	%rbx, -96(%rbp)
	movabsq	$139695095517360, %rax  # imm = 0x7F0D4C8B90B0
	movq	%rax, -88(%rbp)
	movabsq	$jl_apply_generic, %rax
	movabsq	$jl_system_image_data, %rdi
	leaq	-96(%rbp), %rsi
	movl	$2, %edx
	callq	*%rax
; │ @ REPL[1]:172 within `perf_nbody'
; │┌ @ range.jl:5 within `Colon'
; ││┌ @ range.jl:277 within `UnitRange'
; │││┌ @ range.jl:282 within `unitrange_last'
; ││││┌ @ operators.jl:341 within `>='
; │││││┌ @ int.jl:424 within `<='
	testq	%r14, %r14
; │└└└└└
	jle	L1631
	movabsq	$julia_advance_16585, %rbx
	movabsq	$139695052337248, %rax  # imm = 0x7F0D49F8B060
	movsd	(%rax), %xmm0           # xmm0 = mem[0],zero
	movsd	%xmm0, -48(%rbp)
	nopl	(%rax)
; │ @ REPL[1]:173 within `perf_nbody'
L1616:
	movq	%r15, %rdi
	movsd	-48(%rbp), %xmm0        # xmm0 = mem[0],zero
	callq	*%rbx
; │┌ @ range.jl:597 within `iterate'
; ││┌ @ promotion.jl:399 within `=='
	decq	%r14
; │└└
	jne	L1616
; │ @ REPL[1]:175 within `perf_nbody'
L1631:
	movq	(%r12), %r13
	movq	%r13, -144(%rbp)
	movq	%r15, %rdi
	movabsq	$julia_energy_16584, %rax
	callq	*%rax
	movsd	%xmm0, -48(%rbp)
	movabsq	$getbuf, %rax
	callq	*%rax
	movsd	-48(%rbp), %xmm0        # xmm0 = mem[0],zero
	movq	%rax, %r14
; │┌ @ float.jl:553 within `isfinite'
; ││┌ @ float.jl:403 within `-'
	movapd	%xmm0, %xmm1
	subsd	%xmm1, %xmm1
; ││└
; ││ @ float.jl:454 within `isfinite'
	xorps	%xmm2, %xmm2
; │└
	ucomisd	%xmm2, %xmm1
	jne	L1701
	jnp	L1791
; │┌ @ printf.jl:150 within `macro expansion'
; ││┌ @ float.jl:503 within `<' @ float.jl:458
L1701:
	ucomisd	%xmm0, %xmm2
; ││└
	movabsq	$139695111274064, %rax  # imm = 0x7F0D4D7BFE50
	movabsq	$jl_system_image_data, %rcx
	cmovbeq	%rax, %rcx
; ││┌ @ float.jl:535 within `isnan'
; │││┌ @ float.jl:456 within `!='
	ucomisd	%xmm0, %xmm0
; ││└└
	movabsq	$jl_system_image_data, %rax
	cmovnpq	%rcx, %rax
; │└
	movq	%r13, -96(%rbp)
	movq	%rax, -88(%rbp)
	movabsq	$jl_system_image_data, %rdi
	leaq	-96(%rbp), %rsi
	movl	$2, %edx
	movabsq	$jl_apply_generic, %rbx
	callq	*%rbx
	jmp	L2070
; │┌ @ printf.jl:841 within `fix_dec' @ printf.jl:992
; ││┌ @ array.jl:214 within `length'
L1791:
	movq	8(%r14), %rax
; ││└
; ││┌ @ int.jl:52 within `-'
	decq	%rax
; ││└
; ││ @ printf.jl:841 within `fix_dec' @ int.jl:49
	cmpq	$10, %rax
	movl	$9, %edx
; └└
; ┌ @ printf.jl:992 within `perf_nbody'
	cmovlq	%rax, %rdx
	movq	%r14, -136(%rbp)
; │ @ printf.jl:993 within `perf_nbody'
	movabsq	$grisu, %rax
	leaq	-208(%rbp), %rdi
	movl	$2, %esi
	movq	%r14, %rcx
	callq	*%rax
; └
; ┌ @ REPL[1]:175 within `perf_nbody'
; │┌ @ printf.jl:841 within `fix_dec' @ printf.jl:994
; ││┌ @ promotion.jl:399 within `=='
	movq	-208(%rbp), %r12
	testq	%r12, %r12
; ││└
	je	L2189
; ││ @ printf.jl:841 within `fix_dec' @ printf.jl:998
; ││┌ @ boot.jl:709 within `Int32'
; │││┌ @ boot.jl:619 within `toInt32'
; ││││┌ @ boot.jl:581 within `checked_trunc_sint'
	movslq	%r12d, %rax
; │││││ @ boot.jl:582 within `checked_trunc_sint'
	cmpq	%rax, %r12
	jne	L2560
; │││││ @ boot.jl:580 within `checked_trunc_sint'
	movq	-200(%rbp), %r15
; │││││ @ boot.jl:581 within `checked_trunc_sint'
	movslq	%r15d, %rax
; │││││ @ boot.jl:582 within `checked_trunc_sint'
	cmpq	%rax, %r15
	jne	L2597
; ││└└└
	movb	-192(%rbp), %al
; │└
	testb	%al, %al
	je	L1951
L1902:
	movq	%r13, -96(%rbp)
	movabsq	$jl_system_image_data, %rax
	movq	%rax, -88(%rbp)
	movabsq	$jl_system_image_data, %rdi
	leaq	-96(%rbp), %rsi
	movl	$2, %edx
	movabsq	$jl_apply_generic, %rax
	callq	*%rax
L1951:
	movabsq	$jl_box_int32, %rbx
	movl	%r15d, %edi
	callq	*%rbx
	movq	%rbx, %rcx
	movabsq	$jl_apply_generic, %r15
	movq	%rax, %rbx
	movq	%rbx, -152(%rbp)
	movl	%r12d, %edi
	callq	*%rcx
	movq	%rax, -160(%rbp)
	movq	%r13, -96(%rbp)
	movabsq	$139695095480928, %rcx  # imm = 0x7F0D4C8B0260
	movq	%rcx, -88(%rbp)
	movq	%rbx, -80(%rbp)
	movq	%rax, -72(%rbp)
	movabsq	$jl_system_image_data, %rax
	movq	%rax, -64(%rbp)
	movq	%r14, -56(%rbp)
	movabsq	$jl_system_image_data, %rdi
	leaq	-96(%rbp), %rsi
	movl	$6, %edx
	callq	*%r15
	movq	%r15, %rbx
L2070:
	movq	%r13, -96(%rbp)
	movabsq	$139695095517360, %rax  # imm = 0x7F0D4C8B90B0
	movq	%rax, -88(%rbp)
	movabsq	$jl_system_image_data, %rdi
	leaq	-96(%rbp), %rsi
	movl	$2, %edx
	callq	*%rbx
	movq	-168(%rbp), %rax
	movq	-184(%rbp), %rcx
	movq	%rax, (%rcx)
	leaq	-40(%rbp), %rsp
	popq	%rbx
	popq	%r12
	popq	%r13
	popq	%r14
	popq	%r15
	popq	%rbp
	retq
; │ @ REPL[1]:170 within `perf_nbody'
; │┌ @ printf.jl:841 within `fix_dec' @ printf.jl:995
; ││┌ @ array.jl:780 within `setindex!'
L2141:
	cmpq	$0, 8(%r13)
	je	L2634
	movq	(%r13), %rax
	movb	$48, (%rax)
	movl	$1, %r12d
; ││└
; ││ @ printf.jl:841 within `fix_dec' @ printf.jl:996
	movb	-216(%rbp), %al
	movl	$1, %ecx
; │└
	testb	%al, %al
	jne	L1346
	jmp	L1401
; │ @ REPL[1]:175 within `perf_nbody'
; │┌ @ printf.jl:841 within `fix_dec' @ printf.jl:995
; ││┌ @ array.jl:780 within `setindex!'
L2189:
	cmpq	$0, 8(%r14)
	je	L2672
	movq	(%r14), %rax
	movb	$48, (%rax)
	movl	$1, %r12d
; ││└
; ││ @ printf.jl:841 within `fix_dec' @ printf.jl:996
	movb	-192(%rbp), %al
	movl	$1, %r15d
; │└
	testb	%al, %al
	jne	L1902
	jmp	L1951
; │ @ REPL[1]:166 within `perf_nbody'
; │┌ @ array.jl:130 within `vect'
; ││┌ @ array.jl:780 within `setindex!'
L2237:
	movabsq	$jl_gc_queue_root, %rax
	movq	%rcx, -112(%rbp)
	callq	*%rax
	movq	-112(%rbp), %rcx
	jmp	L740
L2262:
	movabsq	$jl_gc_queue_root, %rax
	movq	%rcx, %r13
	callq	*%rax
	movq	%r13, %rcx
	jmp	L772
L2285:
	movabsq	$jl_gc_queue_root, %rax
	movq	%rcx, %r13
	callq	*%rax
	movq	%r13, %rcx
	jmp	L813
L2308:
	movabsq	$jl_gc_queue_root, %rax
	movq	%rcx, %r13
	callq	*%rax
	movq	%r13, %rcx
	jmp	L848
L2331:
	movabsq	$jl_gc_queue_root, %rax
	movq	%rcx, %r13
	callq	*%rax
	movq	%r13, %rcx
	jmp	L883
L2354:
	movabsq	$jl_gc_queue_root, %rax
	movq	%r15, %rdi
	movq	%rcx, -112(%rbp)
	callq	*%rax
	movq	-112(%rbp), %rcx
	jmp	L919
L2382:
	movabsq	$jl_gc_queue_root, %rax
	movq	%r15, %rdi
	movq	%rcx, %r13
	callq	*%rax
	movq	%r13, %rcx
	jmp	L947
L2408:
	movabsq	$jl_gc_queue_root, %rax
	movq	%r15, %rdi
	movq	%rcx, %r13
	callq	*%rax
	movq	%r13, %rcx
	jmp	L984
L2434:
	movabsq	$jl_gc_queue_root, %rax
	movq	%r15, %rdi
	movq	%rcx, %r13
	callq	*%rax
	movq	%r13, %rcx
	jmp	L1015
L2460:
	movabsq	$jl_gc_queue_root, %rax
	movq	%r15, %rdi
	movq	%rcx, %r13
	callq	*%rax
	movq	%r13, %rcx
	jmp	L1046
; │└└
; │ @ REPL[1]:170 within `perf_nbody'
; │┌ @ printf.jl:841 within `fix_dec' @ printf.jl:998
; ││┌ @ boot.jl:709 within `Int32'
; │││┌ @ boot.jl:619 within `toInt32'
; ││││┌ @ boot.jl:582 within `checked_trunc_sint'
L2486:
	movabsq	$throw_inexacterror, %rax
	movabsq	$139695095699584, %rdi  # imm = 0x7F0D4C8E5880
	movabsq	$jl_system_image_data, %rsi
	movq	%r12, %rdx
	callq	*%rax
	ud2
L2523:
	movabsq	$throw_inexacterror, %rax
	movabsq	$139695095699584, %rdi  # imm = 0x7F0D4C8E5880
	movabsq	$jl_system_image_data, %rsi
	movq	%rcx, %rdx
	callq	*%rax
	ud2
; │└└└└
; │ @ REPL[1]:175 within `perf_nbody'
; │┌ @ printf.jl:841 within `fix_dec' @ printf.jl:998
; ││┌ @ boot.jl:709 within `Int32'
; │││┌ @ boot.jl:619 within `toInt32'
; ││││┌ @ boot.jl:582 within `checked_trunc_sint'
L2560:
	movabsq	$throw_inexacterror, %rax
	movabsq	$139695095699584, %rdi  # imm = 0x7F0D4C8E5880
	movabsq	$jl_system_image_data, %rsi
	movq	%r12, %rdx
	callq	*%rax
	ud2
L2597:
	movabsq	$throw_inexacterror, %rax
	movabsq	$139695095699584, %rdi  # imm = 0x7F0D4C8E5880
	movabsq	$jl_system_image_data, %rsi
	movq	%r15, %rdx
	callq	*%rax
	ud2
; │└└└└
; │ @ REPL[1]:170 within `perf_nbody'
; │┌ @ printf.jl:841 within `fix_dec' @ printf.jl:995
; ││┌ @ array.jl:780 within `setindex!'
L2634:
	movq	%rsp, %rax
	leaq	-16(%rax), %rsi
	movq	%rsi, %rsp
	movq	$1, -16(%rax)
	movabsq	$jl_bounds_error_ints, %rax
	movl	$1, %edx
	movq	%r13, %rdi
	callq	*%rax
; │└└
; │ @ REPL[1]:175 within `perf_nbody'
; │┌ @ printf.jl:841 within `fix_dec' @ printf.jl:995
; ││┌ @ array.jl:780 within `setindex!'
L2672:
	movq	%rsp, %rax
	leaq	-16(%rax), %rsi
	movq	%rsi, %rsp
	movq	$1, -16(%rax)
	movabsq	$jl_bounds_error_ints, %rax
	movl	$1, %edx
	movq	%r14, %rdi
	callq	*%rax
	nopw	%cs:(%rax,%rax)
; └└└
3 Likes

For fasta, I was looking into if a faster RNG would help (yes, probably disallowed by the rules, but I discovered a likely legal change). Strangely it hangs with my choice, and whatever datatype I tried (and then even with cast to 32-bit for type-stability).

I noticed the code used uses signed, while the fastest (currently C++) code uses unsigned. I also noticed that the RNG only returns 16-bits I think not 32-bits, with rest zero-padded.

#const last_rnd = Ref(Int32(42))  # I tries to change here to UInt32 and lines above, that works
#gen_random() = (last_rnd[] = (last_rnd[] * IA + IC) % IM)

using RandomNumbers.Xorshifts
r = Xoroshiro128Plus(0x1234567890abcdef)  # with a certain seed. Note that the seed must be non-zero.
gen_random() = UInt32(rand(r, UInt8))
static auto get_random = [] {
        static unsigned last = 42;
        return (last = (last * Config::ia + Config::ic) % Config::im);
    };

Could [any of] you check timing for UInt32 change (or look into other RNG)? Just my change to unsigned should have been faster, since assembly code shorter, but for my old laptop it was slightly slower (but so was O3):

Original with -O3

real 0m5,080s
user 0m4,944s
sys 0m0,192s

Original with -O2

real 0m5,076s
user 0m4,936s
sys 0m0,216s

My modified with UInt32 and -O3

real 0m5,224s
user 0m5,096s
sys 0m0,212s

My modified with -O2

real 0m5,205s
user 0m5,064s
sys 0m0,212s

@code_native gen_random() # For Uint32 (gets you slightly shorter than for Int32, thereafter):

	.text
; ┌ @ REPL[3]:2 within `gen_random'
	movabsq	$139625163228512, %rcx  # imm = 0x7EFD04418560
; │┌ @ int.jl:54 within `*'
	imull	$3877, (%rcx), %eax     # imm = 0xF25
; │└
; │┌ @ int.jl:53 within `+'
	addl	$29573, %eax            # imm = 0x7385
; │└
; │┌ @ int.jl:231 within `rem'
	imulq	$502748801, %rax, %rdx  # imm = 0x1DF75681
	shrq	$46, %rdx
	imull	$139968, %edx, %edx     # imm = 0x222C0
	subl	%edx, %eax
; │└
; │┌ @ refvalue.jl:33 within `setindex!'
; ││┌ @ Base.jl:21 within `setproperty!'
	movl	%eax, (%rcx)
; │└└
	retq
	nopl	(%rax,%rax)
; └
julia> @code_native gen_random()
	.text
; ┌ @ REPL[8]:2 within `gen_random'
	movabsq	$139793992281584, %rcx  # imm = 0x7F2453406DF0
; │┌ @ int.jl:54 within `*'
	imull	$3877, (%rcx), %eax     # imm = 0xF25
; │└
; │┌ @ int.jl:53 within `+'
	addl	$29573, %eax            # imm = 0x7385
; │└
; │┌ @ int.jl:229 within `rem'
	cltq
	imulq	$502748801, %rax, %rdx  # imm = 0x1DF75681
	movq	%rdx, %rsi
	shrq	$63, %rsi
	sarq	$46, %rdx
	addl	%esi, %edx
	imull	$139968, %edx, %edx     # imm = 0x222C0
	subl	%edx, %eax
; │└
; │┌ @ refvalue.jl:33 within `setindex!'
; ││┌ @ Base.jl:21 within `setproperty!'
	movl	%eax, (%rcx)
; │└└
	retq
	nopw	%cs:(%rax,%rax)
; └

For xoroshiro there’s no multiply:

julia> @code_native rand(r, UInt64)
	.text
; ┌ @ xoroshiro128.jl:68 within `rand'
; │┌ @ xoroshiro128.jl:35 within `xorshift_next'
; ││┌ @ xoroshiro128.jl:68 within `getproperty'
	movq	(%rdi), %rcx
	movq	8(%rdi), %rax
; ││└
; ││ @ xoroshiro128.jl:37 within `xorshift_next'
; ││┌ @ int.jl:317 within `xor'
	movq	%rcx, %rdx
	xorq	%rax, %rdx
; │└└
; │┌ @ int.jl:53 within `xorshift_next'
	addq	%rcx, %rax
; │└
; │┌ @ xoroshiro128.jl:38 within `xorshift_next'
; ││┌ @ common.jl:1 within `xorshift_rotl'
; │││┌ @ int.jl:316 within `|'
	rolq	$24, %rcx
; ││└└
; ││┌ @ int.jl:317 within `xor'
	xorq	%rdx, %rcx
; ││└
; ││┌ @ int.jl:446 within `<<' @ int.jl:439
	movq	%rdx, %rsi
	shlq	$16, %rsi
; │└└
; │┌ @ int.jl:317 within `xorshift_next'
	xorq	%rcx, %rsi
; │└
; │┌ @ xoroshiro128.jl:38 within `xorshift_next'
; ││┌ @ Base.jl:21 within `setproperty!'
	movq	%rsi, (%rdi)
; │└└
; │┌ @ int.jl:316 within `xorshift_next'
	rolq	$37, %rdx
; │└
; │┌ @ xoroshiro128.jl:39 within `xorshift_next'
; ││┌ @ Base.jl:21 within `setproperty!'
	movq	%rdx, 8(%rdi)
; │└└
	retq
	nopl	(%rax)
; └

List of comparisons just got changed on the website. It now has a comparison to C and to SB Common Lisp instead of Chapel.

2 Likes

I’ve created the JuliaPerf organization and moved the BenchmarksGame.jl repo there https://github.com/juliaperf/BenchmarksGame.jl. I’ve also invited @non-Jedi as an owner to that organization.

The BenchmarksGame.jl repo supports correctness checking and performance checking so I feel it would be useful if we could collect the community efforts in improving the benchmarks to that repo. Feel free to maintain the repo as you wish or ignore it if you feel it isn’t useful.

9 Likes