Julia programs now shown on benchmarks game website

Palli · August 20, 2019, 5:30pm

It seems the summary page they have hasn’t been updated. I get on my decade on Core Duo laptop (with e.g. browser running), slower than you, but way faster than on the web page (for Julia1.4-DEV, also tried 1.1.0):

real	0m13,142s

non-Jedi · August 20, 2019, 5:33pm

See the discussion at:

And also in the linked issue. The big obvious difference between the benchmark CPU and modern CPUs is AVX instructions, but it evidently doesn’t end there.

@Palli since you happen to have a core2 cpu, would you mind running the julia-4 benchmark as well to compare with julia-3? And possibly even send me the @code_native it’s generating for both?

kristoffer.carlsson · August 20, 2019, 6:04pm

Why are there a bunch of explicit VecElement’s there? Tuple of VecElements are so that things are passed to LLVM as LLVM-vectors instead of LLVM-arrays and then you can write llvmcall code on them, but they have almost no purpose on their own.

jebej · August 20, 2019, 7:02pm

On my machine at least, the nbody-fast.jl code in your repo is faster, and as a bonus it is simpler and cleaner. Note sure if that would be true on whatever machine these benchmarks are being run as noted by @non-Jedi.

Karajan · August 20, 2019, 7:37pm

Yes, I think now might be a good time to try all that multithreading stuff if for nothing else than to put it to the test before it’s released!

this could also benefit from new threading runtime to start working on a specific sequence while still reading input

This was the version I was thinking of, but you might be right that the other option might be faster. We’ll need to test.

fasta: obvious opportunity to parallelize, but I haven’t taken the time to grok what the benchmark is actually doing yet.

Well, maybe I can help with that, and if anyone else wants to play along and try their best, even better. Here we go, my current implementation:

# Just FYI, I completely restructured the code compared to the
# version on the website, not sure if it runs like this.

const OUT = stdout
const LINE_LENGTH = 60

# First task: just repeat this string over and over with \n
# in the right places
const ALU = codeunits(
    "GGCCGGGCGCGGTGGCTCACGCCTGTAATCCCAGCACTTTGG" *
    "GAGGCCGAGGCGGGCGGATCACCTGAGGTCAGGAGTTCGAGA" *
    "CCAGCCTGGCCAACATGGTGAAACCCCGTCTCTACTAAAAAT" *
    "ACAAAAATTAGCCGGGCGTGGTGGCGCGCGCCTGTAATCCCA" *
    "GCTACTCGGGAGGCTGAGGCAGGAGAATCGCTTGAACCCGGG" *
    "AGGCGGAGGTTGCAGTGAGCCGAGATCGCGCCACTGCACTCC" *
    "AGCCTGGGCGACAGAGCGAGACTCCGTCTCAAAAA")

# I want to always be able to take the next 60 chars of that
# string (without going over the edge) so I repeat it at the end.
function repeat_fasta(str, n)
    # This is a reaaally ugly way of repeating a string, but
    # it was consistently faster than nicer alternatives, so ... :shrug:
    len = length(str)
    src = Vector{UInt8}(undef, len + LINE_LENGTH)
    for i in 1:len
        @inbounds src[i] = str[i]
    end
    for i in 1:LINE_LENGTH
        @inbounds src[i+len] = str[i]
    end

    # Well, write the requred amount of chars of that string,
    # skip to the beginning of the string if you went to far.
    i = 1
    lines, rest = divrem(n, LINE_LENGTH)
    for _ in 1:lines
        write(OUT, @inbounds @view src[i:i+LINE_LENGTH-1])
        write(OUT, '\n')
        i += LINE_LENGTH
        i > len && (i -= len)
    end
    write(OUT, @inbounds @view src[i:i+rest-1])
    write(OUT, '\n')
end

# That was easy, now the more interesting part.
# We have an alphabet of chars with associated probabilities and
# we have to pick n chars from that alphabet according to a LCG
# random number generator.
# This inherently means we can't really parallelize the RNG because
# the numbers need to be in the right order. Of course there are
# opportunities for playing with the sweet new threads elsewhere.

# The RNG works with `Int32`s and the probabilities are given as
# `Float`s so I scale the [0, 1) range of accumulated probabilities
# up to the [0, IM) range of the RNG and store that with the
# corresponding char. Store the Aminoacids as a const `Tuple`.
struct Aminoacids
    c::UInt8
    p::Int32
end
function make_Aminoacids(cs, ps)
    cum_p = 0.0
    tmp = Aminoacids[]
    for (c, p) in zip(cs, ps)
        cum_p += p * IM
        # the comparison is with Int32, so use it here as well
        push!(tmp, Aminoacids(c, floor(Int32, cum_p)))
    end
    return (tmp...,)
end

# create Aminoacids with accumulated probabilities and make
# the result a constant
const IUB = let
    iub_c = b"acgtBDHKMNRSVWY"
    iub_p = [0.27, 0.12, 0.12, 0.27, 0.02, 0.02, 0.02,
             0.02, 0.02, 0.02, 0.02, 0.02, 0.02, 0.02, 0.02]
    make_Aminoacids(iub_c, iub_p)
end
const HOMOSAPIENS = let
    homosapiens_c = b"acgt"
    homosapiens_p = [0.3029549426680, 0.1979883004921,
                     0.1975473066391, 0.3015094502008]
    make_Aminoacids(homosapiens_c, homosapiens_p)
end

# This is the RNG as defined on the website. Not sure there is
# much opportunity here because it needs to be pretty much exactly
# like this.
const IM = Int32(139968)
const IA = Int32(3877)
const IC = Int32(29573)
const last_rnd = Ref(Int32(42))
gen_random() = (last_rnd[] = (last_rnd[] * IA + IC) % IM)

# After we generated a new number we need to pick our the
# corresponding Aminoacid. Some implementations use a binary
# search but I found that simply going though the Tuple seems
# to be faster.
function random_char(genelist)
    r = gen_random()
    for aminoacid in genelist
        aminoacid.p >= r && return aminoacid.c
    end
    return genelist[end].c
end

# A little helper method. I need to fill a vector with chars
# to print out but that can be shorter that the line length
# (and I must not genereate more chars than needed because)
# that would leave the RNG in the wrong state for the next
# run.
function fillrand!(line, genelist, n)
    for i in 1:n
        @inbounds line[i] = random_char(genelist)
    end
end

# Not much to see here, just fill lines until we got the required
# amount of chars printed out.
function random_fasta(genelist, n)
    line = Vector{UInt8}(undef, LINE_LENGTH+1)
    line[end] = UInt8('\n')
    while n > LINE_LENGTH
        fillrand!(line, genelist, LINE_LENGTH)
        write(OUT, line)
        n -= LINE_LENGTH
    end
    fillrand!(line, genelist, n)
    line[n+1] = UInt8('\n')
    write(OUT, @view line[1:n+1])
end

# Simply calling everything. Do two random ones with different
# alphabets.
function main(n)
    write(OUT, ">ONE Homo sapiens alu\n")
    repeat_fasta(ALU, 2n)
    write(OUT, ">TWO IUB ambiguity codes\n")
    random_fasta(IUB, 3n)
    write(OUT, ">THREE Homo sapiens frequency\n")
    random_fasta(HOMOSAPIENS, 5n)
end
main(parse(Int, ARGS[1]))

Now, the main opportunity for parallelization would be to let the RNG generating numbers in the background (into a Channel, I think? I haven’t worked with those yet.), let a second thread convert these numbers into corresponding chars and let the final thread print everything out. Of course with the thread overhead this might be too heavy, but at least that would be my start.

Palli · August 20, 2019, 7:58pm

It’s slower; both with version 4 and 3 with: export JULIA_NUM_THREADS=4 (my computer has 2 cores)

-O3 seems to always be slightly slower than -O1 for me:

time ~/julia-1.4.0-DEV-8ebe5643ca/bin/julia -O1 – nbody.julia-4.julia 50000000

real	0m19,348s
user	0m19,144s
sys	0m0,212s

time ~/julia-1.4.0-DEV-8ebe5643ca/bin/julia -O3 – nbody.julia-4.julia 50000000

real 0m21,855s
user 0m21,656s
sys 0m0,212s

vs.:

I seem to get for version 3:

time ~/julia-1.4.0-DEV-8ebe5643ca/bin/julia -O3 – nbody.julia-3.julia 50000000

real	0m13,319s
user	0m13,028s
sys	0m0,236s

export JULIA_NUM_THREADS=1
time ~/julia-1.4.0-DEV-8ebe5643ca/bin/julia -O3 – nbody.julia-3.julia 50000000

real	0m13,158s
user	0m13,008s
sys	0m0,208s

–cpu-target=core2 doesn’t seem to change much, as I guess it’s the default.

For version 4 with -O3:

julia> @code_native main(stdout, (50000000), 0.01)
	.text
; ┌ @ REPL[13]:2 within `main'
	pushq	%rbp
	movq	%rsp, %rbp
; │ @ REPL[13]:32 within `main'
	pushq	%r15
	pushq	%r14
	pushq	%r13
	pushq	%r12
	pushq	%rbx
	subq	$472, %rsp              # imm = 0x1D8
	xorpd	%xmm1, %xmm1
	movapd	%xmm1, -80(%rbp)
	movsd	%xmm0, -56(%rbp)
	movq	%rsi, %rbx
	movq	%rdi, %r13
	movapd	%xmm1, -96(%rbp)
	movq	%fs:0, %rax
	movq	$4, -96(%rbp)
	movq	-15712(%rax), %rcx
	movq	%rcx, -88(%rbp)
	movabsq	$140581406747936, %r12  # imm = 0x7FDBA8CFAD20
	leaq	-96(%rbp), %rcx
	movq	%rcx, -15712(%rax)
	movapd	%xmm1, -496(%rbp)
	movapd	%xmm1, -512(%rbp)
	movabsq	$140581334880848, %rcx  # imm = 0x7FDBA4871250
	movaps	(%rcx), %xmm0
	movaps	%xmm0, -480(%rbp)
	movabsq	$140581334880864, %rcx  # imm = 0x7FDBA4871260
	movaps	(%rcx), %xmm0
	movaps	%xmm0, -464(%rbp)
	movabsq	$140581334880880, %rcx  # imm = 0x7FDBA4871270
	movaps	(%rcx), %xmm0
	movaps	%xmm0, -448(%rbp)
	movabsq	$140581334880896, %rcx  # imm = 0x7FDBA4871280
	movapd	(%rcx), %xmm0
	movapd	%xmm0, -432(%rbp)
	movabsq	$140581334881056, %rcx  # imm = 0x7FDBA4871320
	xorpd	%xmm0, %xmm0
	movhpd	(%rcx), %xmm0           # xmm0 = xmm0[0],mem[0]
	movapd	%xmm0, -416(%rbp)
	movabsq	$140581334880912, %rcx  # imm = 0x7FDBA4871290
	movapd	(%rcx), %xmm0
	movapd	%xmm0, -400(%rbp)
	movabsq	$140581334881064, %rcx  # imm = 0x7FDBA4871328
	xorpd	%xmm0, %xmm0
	movhpd	(%rcx), %xmm0           # xmm0 = xmm0[0],mem[0]
	movapd	%xmm0, -384(%rbp)
	movabsq	$140581334880928, %rcx  # imm = 0x7FDBA48712A0
	movaps	(%rcx), %xmm0
	movaps	%xmm0, -368(%rbp)
	movabsq	$140581334881072, %rcx  # imm = 0x7FDBA4871330
	movsd	(%rcx), %xmm0           # xmm0 = mem[0],zero
	movaps	%xmm0, -352(%rbp)
	movabsq	$140581334880944, %rcx  # imm = 0x7FDBA48712B0
	movaps	(%rcx), %xmm0
	movaps	%xmm0, -336(%rbp)
	movabsq	$140581334881080, %rcx  # imm = 0x7FDBA4871338
	movsd	(%rcx), %xmm0           # xmm0 = mem[0],zero
	movaps	%xmm0, -320(%rbp)
	movabsq	$140581334880960, %rcx  # imm = 0x7FDBA48712C0
	movaps	(%rcx), %xmm0
	movaps	%xmm0, -304(%rbp)
	movabsq	$140581334880976, %rcx  # imm = 0x7FDBA48712D0
	movapd	(%rcx), %xmm0
	movapd	%xmm0, -288(%rbp)
	movabsq	$140581334881088, %rcx  # imm = 0x7FDBA4871340
	xorpd	%xmm0, %xmm0
	movhpd	(%rcx), %xmm0           # xmm0 = xmm0[0],mem[0]
	leaq	-15712(%rax), %r14
	movapd	%xmm0, -272(%rbp)
	movabsq	$140581334880992, %rax  # imm = 0x7FDBA48712E0
	movaps	(%rax), %xmm0
	movaps	%xmm0, -256(%rbp)
	movabsq	$140581334881096, %rax  # imm = 0x7FDBA4871348
	movhpd	(%rax), %xmm1           # xmm1 = xmm1[0],mem[0]
	movapd	%xmm1, -240(%rbp)
	movabsq	$140581334881008, %rax  # imm = 0x7FDBA48712F0
	movaps	(%rax), %xmm0
	movaps	%xmm0, -224(%rbp)
	movabsq	$140581334881104, %rax  # imm = 0x7FDBA4871350
	movsd	(%rax), %xmm0           # xmm0 = mem[0],zero
	movaps	%xmm0, -208(%rbp)
	movabsq	$140581334881024, %rax  # imm = 0x7FDBA4871300
	movaps	(%rax), %xmm0
	movaps	%xmm0, -192(%rbp)
	movabsq	$140581334881112, %rax  # imm = 0x7FDBA4871358
	movsd	(%rax), %xmm0           # xmm0 = mem[0],zero
	movaps	%xmm0, -176(%rbp)
	movabsq	$4566835785178257836, %rax # imm = 0x3F60A8F3531799AC
	movq	%rax, -160(%rbp)
; │┌ @ array.jl:130 within `vect'
; ││┌ @ array.jl:612 within `_array_for'
; │││┌ @ abstractarray.jl:671 within `similar' @ abstractarray.jl:672
; ││││┌ @ boot.jl:413 within `Array' @ boot.jl:404
	leaq	277347728(%r12), %rax
	movabsq	$140581406097168, %rdi  # imm = 0x7FDBA8C5BF10
	movl	$5, %esi
	callq	*%rax
	movq	%rax, %r15
; ││└└└
; ││ @ array.jl:780 within `vect'
	movq	(%r15), %rax
	movq	$-288, %rcx             # imm = 0xFEE0
	xorl	%edx, %edx
	nopl	(%rax)
L608:
	movups	-224(%rbp,%rcx), %xmm0
	movupd	-208(%rbp,%rcx), %xmm1
	movups	-192(%rbp,%rcx), %xmm2
	movups	-176(%rbp,%rcx), %xmm3
	movq	-160(%rbp,%rcx), %rsi
	movups	%xmm0, 288(%rax,%rcx)
	movupd	%xmm1, 304(%rax,%rcx)
	movups	%xmm2, 320(%rax,%rcx)
	movups	%xmm3, 336(%rax,%rcx)
	movq	%rsi, 352(%rax,%rcx)
; ││ @ array.jl:130 within `vect'
; ││┌ @ range.jl:597 within `iterate'
; │││┌ @ promotion.jl:399 within `=='
	testq	%rcx, %rcx
; ││└└
	je	L735
; ││┌ @ tuple.jl:24 within `getindex'
	addq	$72, %rcx
	incq	%rdx
	cmpq	$5, %rdx
	jb	L608
	movabsq	$jl_bounds_error_unboxed_int, %rax
	leaq	-512(%rbp), %rdi
	movl	$6, %edx
	movq	%r12, %rsi
	callq	*%rax
; └└└
; ┌ @ tuple.jl within `main'
L735:
	movq	%r14, -64(%rbp)
	movq	%r15, -72(%rbp)
; └
; ┌ @ REPL[13]:34 within `main'
	movabsq	$julia_energy_16666, %rax
	movq	%r15, %rdi
	callq	*%rax
	movsd	%xmm0, -48(%rbp)
	movabsq	$getbuf, %rax
	callq	*%rax
	movsd	-48(%rbp), %xmm0        # xmm0 = mem[0],zero
	movq	%rax, %r12
; │┌ @ float.jl:553 within `isfinite'
; ││┌ @ float.jl:403 within `-'
	movapd	%xmm0, %xmm1
	subsd	%xmm1, %xmm1
; ││└
; ││┌ @ float.jl:488 within `==' @ float.jl:454
	xorps	%xmm2, %xmm2
; │└└
	ucomisd	%xmm2, %xmm1
	jne	L802
	jnp	L879
; │┌ @ printf.jl:150 within `macro expansion'
; ││┌ @ float.jl:503 within `<' @ float.jl:458
L802:
	ucomisd	%xmm0, %xmm2
; ││└
	movabsq	$140581402197232, %rax  # imm = 0x7FDBA88A3CF0
	movabsq	$jl_system_image_data, %rcx
	cmovbeq	%rax, %rcx
; ││┌ @ float.jl:535 within `isnan'
; │││┌ @ float.jl:456 within `!='
	ucomisd	%xmm0, %xmm0
; ││└└
	movabsq	$jl_system_image_data, %rax
	cmovnpq	%rcx, %rax
; │└
; │┌ @ io.jl:179 within `print'
; ││┌ @ io.jl:177 within `write'
; │││┌ @ gcutils.jl:91 within `macro expansion'
; ││││┌ @ string.jl:85 within `sizeof'
	movq	(%rax), %rdx
	movq	%rax, -80(%rbp)
; ││││└
; ││││┌ @ string.jl:81 within `pointer'
; │││││┌ @ pointer.jl:59 within `unsafe_convert'
; ││││││┌ @ pointer.jl:159 within `+'
	leaq	8(%rax), %rsi
; ││││└└└
	movabsq	$unsafe_write, %rax
	movq	%r13, %rdi
	callq	*%rax
	jmp	L1051
; └└└└
; ┌ @ gcutils.jl within `main'
L879:
	movq	%r13, %r14
; └
; ┌ @ REPL[13]:34 within `main'
; │┌ @ printf.jl:841 within `fix_dec' @ printf.jl:992
; ││┌ @ array.jl:214 within `length'
	movq	8(%r12), %rax
; ││└
; ││┌ @ int.jl:52 within `-'
	decq	%rax
; ││└
; ││ @ printf.jl:841 within `fix_dec' @ int.jl:49
	cmpq	$10, %rax
	movl	$9, %edx
; └└
; ┌ @ printf.jl:992 within `main'
	cmovlq	%rax, %rdx
	movq	%r12, -80(%rbp)
; │ @ printf.jl:993 within `main'
	movabsq	$grisu, %rax
	leaq	-144(%rbp), %rdi
	movl	$2, %esi
	movq	%r12, %rcx
	callq	*%rax
; └
; ┌ @ REPL[13]:34 within `main'
; │┌ @ printf.jl:841 within `fix_dec' @ printf.jl:994
; ││┌ @ promotion.jl:399 within `=='
	movq	-144(%rbp), %r13
	testq	%r13, %r13
; ││└
	je	L1423
; ││ @ printf.jl:841 within `fix_dec' @ printf.jl:998
; ││┌ @ boot.jl:709 within `Int32'
; │││┌ @ boot.jl:619 within `toInt32'
; ││││┌ @ boot.jl:581 within `checked_trunc_sint'
	movslq	%r13d, %rax
; │││││ @ boot.jl:582 within `checked_trunc_sint'
	cmpq	%rax, %r13
	jne	L1514
; │││││ @ boot.jl:580 within `checked_trunc_sint'
	movq	-136(%rbp), %rdx
; │││││ @ boot.jl:581 within `checked_trunc_sint'
	movslq	%edx, %rax
; │││││ @ boot.jl:582 within `checked_trunc_sint'
	cmpq	%rax, %rdx
	jne	L1551
; ││└└└
	movb	-128(%rbp), %al
; │└
	testb	%al, %al
	je	L1016
; │┌ @ char.jl:229 within `print'
; ││┌ @ io.jl:647 within `write'
L988:
	movabsq	$write, %rax
	movl	$45, %esi
	movq	%r14, %rdi
	movq	%rdx, -48(%rbp)
	callq	*%rax
	movq	-48(%rbp), %rdx
; │└└
L1016:
	movabsq	$print_fixed, %rax
	movl	$9, %esi
	movl	$1, %r8d
	movq	%r14, %rdi
	movl	%r13d, %ecx
	movq	%r12, %r9
	callq	*%rax
	movq	%r14, %r13
; │┌ @ char.jl:229 within `print'
; ││┌ @ io.jl:647 within `write'
L1051:
	movabsq	$write, %r12
	movl	$10, %esi
	movq	%r13, %rdi
	callq	*%r12
; │└└
; │ @ REPL[13]:35 within `main'
; │┌ @ range.jl:5 within `Colon'
; ││┌ @ range.jl:277 within `UnitRange'
; │││┌ @ range.jl:282 within `unitrange_last'
; ││││┌ @ operators.jl:341 within `>='
; │││││┌ @ int.jl:424 within `<='
	testq	%rbx, %rbx
; │└└└└└
	jle	L1104
	movabsq	$"julia_next!_16667", %r14
	nop
; │ @ REPL[13]:36 within `main'
L1088:
	movq	%r15, %rdi
	movsd	-56(%rbp), %xmm0        # xmm0 = mem[0],zero
	callq	*%r14
; │┌ @ range.jl:597 within `iterate'
; ││┌ @ promotion.jl:399 within `=='
	decq	%rbx
; │└└
	jne	L1088
; │ @ REPL[13]:38 within `main'
L1104:
	movq	%r15, %rdi
	movabsq	$julia_energy_16666, %rax
	callq	*%rax
	movsd	%xmm0, -56(%rbp)
	movabsq	$getbuf, %rax
	callq	*%rax
	movsd	-56(%rbp), %xmm0        # xmm0 = mem[0],zero
	movq	%rax, %rbx
; │┌ @ float.jl:553 within `isfinite'
; ││┌ @ float.jl:403 within `-'
	movapd	%xmm0, %xmm1
	subsd	%xmm1, %xmm1
; ││└
; ││┌ @ float.jl:488 within `==' @ float.jl:454
	xorps	%xmm2, %xmm2
; │└└
	ucomisd	%xmm2, %xmm1
	jne	L1163
	jnp	L1244
; │┌ @ printf.jl:150 within `macro expansion'
; ││┌ @ float.jl:503 within `<' @ float.jl:458
L1163:
	ucomisd	%xmm0, %xmm2
; ││└
	movabsq	$140581402197232, %rax  # imm = 0x7FDBA88A3CF0
	movabsq	$jl_system_image_data, %rcx
	cmovbeq	%rax, %rcx
; ││┌ @ float.jl:535 within `isnan'
; │││┌ @ float.jl:456 within `!='
	ucomisd	%xmm0, %xmm0
; ││└└
	movabsq	$jl_system_image_data, %rax
	cmovnpq	%rcx, %rax
; │└
; │┌ @ io.jl:179 within `print'
; ││┌ @ io.jl:177 within `write'
; │││┌ @ gcutils.jl:91 within `macro expansion'
; ││││┌ @ string.jl:85 within `sizeof'
	movq	(%rax), %rdx
	movq	%rax, -80(%rbp)
; ││││└
; ││││┌ @ string.jl:81 within `pointer'
; │││││┌ @ pointer.jl:59 within `unsafe_convert'
; ││││││┌ @ pointer.jl:159 within `+'
	leaq	8(%rax), %rsi
; ││││└└└
	movabsq	$unsafe_write, %rax
	movq	%r13, %rdi
	callq	*%rax
	movq	-64(%rbp), %rbx
	jmp	L1390
; │└└└
; │┌ @ printf.jl:841 within `fix_dec' @ printf.jl:992
; ││┌ @ array.jl:214 within `length'
L1244:
	movq	8(%rbx), %rax
; ││└
; ││ @ printf.jl:841 within `fix_dec' @ int.jl:52
	decq	%rax
; ││ @ printf.jl:841 within `fix_dec' @ printf.jl:992
; ││┌ @ operators.jl:294 within `>'
; │││┌ @ int.jl:49 within `<'
	cmpq	$10, %rax
	movl	$9, %edx
; ││└└
	cmovlq	%rax, %rdx
	movq	%rbx, -80(%rbp)
; ││ @ printf.jl:841 within `fix_dec' @ printf.jl:993
	movabsq	$grisu, %rax
	leaq	-120(%rbp), %rdi
	movl	$2, %esi
	movq	%rbx, %rcx
	callq	*%rax
; ││ @ printf.jl:841 within `fix_dec' @ printf.jl:994
; ││┌ @ promotion.jl:399 within `=='
	movq	-120(%rbp), %r15
	testq	%r15, %r15
; ││└
	je	L1469
; ││ @ printf.jl:841 within `fix_dec' @ printf.jl:998
; ││┌ @ boot.jl:709 within `Int32'
; │││┌ @ boot.jl:619 within `toInt32'
; ││││┌ @ boot.jl:581 within `checked_trunc_sint'
	movslq	%r15d, %rax
; │││││ @ boot.jl:582 within `checked_trunc_sint'
	cmpq	%rax, %r15
	jne	L1585
; │││││ @ boot.jl:580 within `checked_trunc_sint'
	movq	-112(%rbp), %r14
; │││││ @ boot.jl:581 within `checked_trunc_sint'
	movslq	%r14d, %rax
; │││││ @ boot.jl:582 within `checked_trunc_sint'
	cmpq	%rax, %r14
	jne	L1622
; ││└└└
	movb	-104(%rbp), %al
; │└
	testb	%al, %al
	je	L1351
; │┌ @ char.jl:229 within `print'
; ││┌ @ io.jl:647 within `write'
L1340:
	movl	$45, %esi
	movq	%r13, %rdi
	callq	*%r12
; │└└
L1351:
	movabsq	$print_fixed, %rax
	movl	$9, %esi
	movl	$1, %r8d
	movq	%r13, %rdi
	movl	%r14d, %edx
	movl	%r15d, %ecx
	movq	%rbx, %r9
	callq	*%rax
	movq	-64(%rbp), %rbx
; │┌ @ char.jl:229 within `print'
; ││┌ @ io.jl:647 within `write'
L1390:
	movl	$10, %esi
	movq	%r13, %rdi
	callq	*%r12
	movq	-88(%rbp), %rax
	movq	%rax, (%rbx)
; │└└
	leaq	-40(%rbp), %rsp
	popq	%rbx
	popq	%r12
	popq	%r13
	popq	%r14
	popq	%r15
	popq	%rbp
	retq
; │ @ REPL[13]:34 within `main'
; │┌ @ printf.jl:841 within `fix_dec' @ printf.jl:995
; ││┌ @ array.jl:780 within `setindex!'
L1423:
	cmpq	$0, 8(%r12)
	je	L1659
	movq	(%r12), %rax
	movb	$48, (%rax)
	movl	$1, %r13d
; ││└
; ││ @ printf.jl:841 within `fix_dec' @ printf.jl:996
	movb	-128(%rbp), %al
	movl	$1, %edx
; │└
	testb	%al, %al
	jne	L988
	jmp	L1016
; │ @ REPL[13]:38 within `main'
; │┌ @ printf.jl:841 within `fix_dec' @ printf.jl:995
; ││┌ @ array.jl:780 within `setindex!'
L1469:
	cmpq	$0, 8(%rbx)
	je	L1697
	movq	(%rbx), %rax
	movb	$48, (%rax)
	movl	$1, %r15d
; ││└
; ││ @ printf.jl:841 within `fix_dec' @ printf.jl:996
	movb	-104(%rbp), %al
	movl	$1, %r14d
; │└
	testb	%al, %al
	jne	L1340
	jmp	L1351
; │ @ REPL[13]:34 within `main'
; │┌ @ printf.jl:841 within `fix_dec' @ printf.jl:998
; ││┌ @ boot.jl:709 within `Int32'
; │││┌ @ boot.jl:619 within `toInt32'
; ││││┌ @ boot.jl:582 within `checked_trunc_sint'
L1514:
	movabsq	$throw_inexacterror, %rax
	movabsq	$140581378234496, %rdi  # imm = 0x7FDBA71C9880
	movabsq	$jl_system_image_data, %rsi
	movq	%r13, %rdx
	callq	*%rax
	ud2
L1551:
	movabsq	$throw_inexacterror, %rax
	movabsq	$140581378234496, %rdi  # imm = 0x7FDBA71C9880
	movabsq	$jl_system_image_data, %rsi
	callq	*%rax
	ud2
; │└└└└
; │ @ REPL[13]:38 within `main'
; │┌ @ printf.jl:841 within `fix_dec' @ printf.jl:998
; ││┌ @ boot.jl:709 within `Int32'
; │││┌ @ boot.jl:619 within `toInt32'
; ││││┌ @ boot.jl:582 within `checked_trunc_sint'
L1585:
	movabsq	$throw_inexacterror, %rax
	movabsq	$140581378234496, %rdi  # imm = 0x7FDBA71C9880
	movabsq	$jl_system_image_data, %rsi
	movq	%r15, %rdx
	callq	*%rax
	ud2
L1622:
	movabsq	$throw_inexacterror, %rax
	movabsq	$140581378234496, %rdi  # imm = 0x7FDBA71C9880
	movabsq	$jl_system_image_data, %rsi
	movq	%r14, %rdx
	callq	*%rax
	ud2
; │└└└└
; │ @ REPL[13]:34 within `main'
; │┌ @ printf.jl:841 within `fix_dec' @ printf.jl:995
; ││┌ @ array.jl:780 within `setindex!'
L1659:
	movq	%rsp, %rax
	leaq	-16(%rax), %rsi
	movq	%rsi, %rsp
	movq	$1, -16(%rax)
	movabsq	$jl_bounds_error_ints, %rax
	movl	$1, %edx
	movq	%r12, %rdi
	callq	*%rax
; │└└
; │ @ REPL[13]:38 within `main'
; │┌ @ printf.jl:841 within `fix_dec' @ printf.jl:995
; ││┌ @ array.jl:780 within `setindex!'
L1697:
	movq	%rsp, %rax
	leaq	-16(%rax), %rsi
	movq	%rsi, %rsp
	movq	$1, -16(%rax)
	movabsq	$jl_bounds_error_ints, %rax
	movl	$1, %edx
	movq	%rbx, %rdi
	callq	*%rax
	nopw	(%rax,%rax)
; └└└

Palli · August 20, 2019, 8:08pm

I hit the character limit 32000 (not 32768; I was 100 letters over so posting separately)

For version 3 with -O3:

julia> @code_native NBody.perf_nbody(50000000)
	.text
; ┌ @ REPL[1]:132 within `perf_nbody'
	pushq	%rbp
	movq	%rsp, %rbp
	pushq	%r15
	pushq	%r14
	pushq	%r13
	pushq	%r12
	pushq	%rbx
	subq	$200, %rsp
	movq	%rdi, %r14
	xorps	%xmm0, %xmm0
	movaps	%xmm0, -144(%rbp)
	movaps	%xmm0, -160(%rbp)
	movaps	%xmm0, -176(%rbp)
	movq	$0, -128(%rbp)
	movq	%fs:0, %rax
; │┌ @ REPL[1]:119 within `initbody'
; ││┌ @ REPL[1]:17 within `Body'
	movq	$10, -176(%rbp)
	movq	-15712(%rax), %rcx
	movq	%rcx, -168(%rbp)
	leaq	-176(%rbp), %rcx
	movq	%rcx, -15712(%rax)
	leaq	-15712(%rax), %r13
	movabsq	$jl_gc_pool_alloc, %rbx
	movl	$1520, %esi             # imm = 0x5F0
	movl	$96, %edx
	movq	%r13, %rdi
	callq	*%rbx
	movq	%rax, %r12
	movabsq	$139695149897424, %r15  # imm = 0x7F0D4FC956D0
	movq	%r15, -8(%r12)
	movabsq	$-4631240860977730576, %rax # imm = 0xBFBA86F96C25EBF0
	movq	%rax, 16(%r12)
	movabsq	$139695052337072, %rax  # imm = 0x7F0D49F8AFB0
	movaps	(%rax), %xmm0
	movaps	%xmm0, (%r12)
	movabsq	$-4640446117579192555, %rax # imm = 0xBF99D2D79A5A0715
	movq	%rax, 48(%r12)
	movabsq	$139695052337088, %rax  # imm = 0x7F0D49F8AFC0
	movaps	(%rax), %xmm0
	movaps	%xmm0, 32(%r12)
	movabsq	$4585593052079010776, %rax # imm = 0x3FA34C95D9AB33D8
	movq	%rax, 64(%r12)
	movq	%r12, -160(%rbp)
; │└└
; │ @ REPL[1]:140 within `perf_nbody'
; │┌ @ REPL[1]:119 within `initbody'
; ││┌ @ REPL[1]:17 within `Body'
	movl	$1520, %esi             # imm = 0x5F0
	movl	$96, %edx
	movq	%r13, %rdi
	callq	*%rbx
	movq	%rbx, %rcx
	movq	%rax, %rbx
	movq	%r15, -8(%rbx)
	movabsq	$-4622431185293064580, %rax # imm = 0xBFD9D353E1EB467C
	movq	%rax, 16(%rbx)
	movabsq	$139695052337104, %rax  # imm = 0x7F0D49F8AFD0
	movaps	(%rax), %xmm0
	movaps	%xmm0, (%rbx)
	movabsq	$4576004977915405236, %rax # imm = 0x3F813C485F1123B4
	movq	%rax, 48(%rbx)
	movabsq	$139695052337120, %rax  # imm = 0x7F0D49F8AFE0
	movaps	(%rax), %xmm0
	movaps	%xmm0, 32(%rbx)
	movabsq	$4577659745833829943, %rax # imm = 0x3F871D490D07C637
	movq	%rax, 64(%rbx)
	movq	%rbx, -152(%rbp)
; │└└
; │ @ REPL[1]:148 within `perf_nbody'
; │┌ @ REPL[1]:119 within `initbody'
; ││┌ @ REPL[1]:17 within `Body'
	movl	$1520, %esi             # imm = 0x5F0
	movl	$96, %edx
	movq	%r13, %rdi
	callq	*%rcx
	movq	%r15, -8(%rax)
	movabsq	$-4626158513131520608, %rcx # imm = 0xBFCC9557BE257DA0
	movq	%rcx, 16(%rax)
	movabsq	$139695052337136, %rcx  # imm = 0x7F0D49F8AFF0
	movaps	(%rcx), %xmm0
	movaps	%xmm0, (%rax)
	movabsq	$-4645973824767902084, %rcx # imm = 0xBF862F6BFAF23E7C
	movq	%rcx, 48(%rax)
	movabsq	$139695052337152, %rcx  # imm = 0x7F0D49F8B000
	movaps	(%rcx), %xmm0
	movaps	%xmm0, 32(%rax)
	movabsq	$4565592097032511155, %rcx # imm = 0x3F5C3DD29CF41EB3
	movq	%rcx, 64(%rax)
	movq	%rax, -104(%rbp)
	movq	%rax, -144(%rbp)
; │└└
; │ @ REPL[1]:156 within `perf_nbody'
; │┌ @ REPL[1]:119 within `initbody'
; ││┌ @ REPL[1]:17 within `Body'
	movl	$1520, %esi             # imm = 0x5F0
	movl	$96, %edx
	movq	%r13, %rdi
	movabsq	$jl_gc_pool_alloc, %rax
	callq	*%rax
	movq	%r15, -8(%rax)
	movabsq	$4595626498235032896, %rcx # imm = 0x3FC6F1F393ABE540
	movq	%rcx, 16(%rax)
	movabsq	$139695052337168, %rcx  # imm = 0x7F0D49F8B010
	movaps	(%rcx), %xmm0
	movaps	%xmm0, (%rax)
	movabsq	$-4638202354754755082, %rcx # imm = 0xBFA1CB88587665F6
	movq	%rcx, 48(%rax)
	movabsq	$139695052337184, %rcx  # imm = 0x7F0D49F8B020
	movaps	(%rcx), %xmm0
	movaps	%xmm0, 32(%rax)
	movabsq	$4566835785178257836, %rcx # imm = 0x3F60A8F3531799AC
	movq	%rcx, 64(%rax)
	movq	%rax, -48(%rbp)
	movq	%rax, -136(%rbp)
; │└└
; │ @ REPL[1]:164 within `perf_nbody'
; │┌ @ REPL[1]:119 within `initbody'
; ││┌ @ REPL[1]:17 within `Body'
	movl	$1520, %esi             # imm = 0x5F0
	movl	$96, %edx
	movq	%r13, -184(%rbp)
	movq	%r13, %rdi
	movabsq	$jl_gc_pool_alloc, %rax
	callq	*%rax
	movq	%rax, %r13
	movq	%r15, -8(%r13)
	xorps	%xmm0, %xmm0
	movaps	%xmm0, (%r13)
	movq	$0, 16(%r13)
	movaps	%xmm0, 32(%r13)
	movq	$0, 48(%r13)
	movabsq	$4630752910647379422, %rax # imm = 0x4043BD3CC9BE45DE
	movq	%rax, 64(%r13)
	movq	%r13, -128(%rbp)
; │└└
; │ @ REPL[1]:166 within `perf_nbody'
; │┌ @ array.jl:130 within `vect'
; ││┌ @ array.jl:612 within `_array_for'
; │││┌ @ abstractarray.jl:671 within `similar' @ abstractarray.jl:672
; ││││┌ @ boot.jl:413 within `Array' @ boot.jl:404
	movabsq	$jl_system_image_data, %rax
	leaq	214180000(%rax), %rax
	movabsq	$139695149907168, %rdi  # imm = 0x7F0D4FC97CE0
	movl	$5, %esi
	callq	*%rax
	movq	%rax, %r15
	movzwl	16(%r15), %eax
	andl	$3, %eax
	cmpl	$3, %eax
; │└└└└
; │┌ @ tuple.jl:24 within `vect'
	jne	L892
; │└
; │┌ @ array.jl:130 within `vect'
; ││┌ @ array.jl:780 within `setindex!'
	movq	(%r15), %rcx
	movq	40(%r15), %rdi
	movq	-8(%rdi), %rax
	andl	$3, %eax
	cmpq	$3, %rax
	jne	L740
	testb	$1, -8(%r13)
	je	L2237
L740:
	movq	%r13, (%rcx)
	movq	40(%r15), %rdi
	movq	-8(%rdi), %rax
	andl	$3, %eax
	cmpq	$3, %rax
	jne	L772
	testb	$1, -8(%r12)
	je	L2262
L772:
	movq	%r12, 8(%rcx)
	movq	40(%r15), %rdi
	movq	-8(%rdi), %rax
	andl	$3, %eax
	cmpq	$3, %rax
	movabsq	$jl_system_image_data, %r12
	jne	L813
	testb	$1, -8(%rbx)
	je	L2285
L813:
	movq	%rbx, 16(%rcx)
	movq	40(%r15), %rdi
	movq	-8(%rdi), %rax
	andl	$3, %eax
	cmpq	$3, %rax
	movq	-104(%rbp), %rbx
	jne	L848
	testb	$1, -8(%rbx)
	je	L2308
L848:
	movq	%rbx, 24(%rcx)
	movq	40(%r15), %rdi
	movq	-8(%rdi), %rax
	andl	$3, %eax
	cmpq	$3, %rax
	movq	-48(%rbp), %rbx
	jne	L883
	testb	$1, -8(%rbx)
	je	L2331
L883:
	movq	%rbx, 32(%rcx)
; │└└
; │ @ REPL[1]:168 within `perf_nbody'
	jmp	L1050
; │ @ REPL[1]:166 within `perf_nbody'
; │┌ @ array.jl:130 within `vect'
; ││┌ @ array.jl:780 within `setindex!'
L892:
	movq	-8(%r15), %rax
	movq	(%r15), %rcx
	andl	$3, %eax
	cmpq	$3, %rax
	jne	L919
	testb	$1, -8(%r13)
	je	L2354
L919:
	movq	%r13, (%rcx)
	movq	-8(%r15), %rax
	andl	$3, %eax
	cmpq	$3, %rax
	jne	L947
	testb	$1, -8(%r12)
	je	L2382
L947:
	movq	%r12, 8(%rcx)
	movq	-8(%r15), %rax
	andl	$3, %eax
	cmpq	$3, %rax
	movabsq	$jl_system_image_data, %r12
	jne	L984
	testb	$1, -8(%rbx)
	je	L2408
L984:
	movq	%rbx, 16(%rcx)
	movq	-8(%r15), %rax
	andl	$3, %eax
	cmpq	$3, %rax
	movq	-104(%rbp), %rbx
	jne	L1015
	testb	$1, -8(%rbx)
	je	L2434
L1015:
	movq	%rbx, 24(%rcx)
	movq	-8(%r15), %rax
	andl	$3, %eax
	cmpq	$3, %rax
	movq	-48(%rbp), %rbx
	jne	L1046
	testb	$1, -8(%rbx)
	je	L2460
L1046:
	movq	%rbx, 32(%rcx)
; └└└
; ┌ @ array.jl within `perf_nbody'
L1050:
	movq	%r15, -160(%rbp)
; └
; ┌ @ REPL[1]:168 within `perf_nbody'
	movabsq	$julia_init_sun_16583, %rax
	movq	%r15, %rdi
	callq	*%rax
	fstp	%st(0)
; │ @ REPL[1]:170 within `perf_nbody'
	movq	(%r12), %rbx
	movq	%rbx, -136(%rbp)
	movabsq	$julia_energy_16584, %rax
	movq	%r15, %rdi
	callq	*%rax
	movsd	%xmm0, -48(%rbp)
	movabsq	$getbuf, %rax
	callq	*%rax
	movsd	-48(%rbp), %xmm0        # xmm0 = mem[0],zero
	movq	%rax, %r13
; │┌ @ float.jl:553 within `isfinite'
; ││┌ @ float.jl:403 within `-'
	movapd	%xmm0, %xmm1
	subsd	%xmm1, %xmm1
; ││└
; ││┌ @ float.jl:488 within `==' @ float.jl:454
	xorps	%xmm2, %xmm2
; │└└
	ucomisd	%xmm2, %xmm1
	jne	L1144
	jnp	L1234
; │┌ @ printf.jl:150 within `macro expansion'
; ││┌ @ float.jl:503 within `<' @ float.jl:458
L1144:
	ucomisd	%xmm0, %xmm2
; ││└
	movabsq	$139695111274064, %rax  # imm = 0x7F0D4D7BFE50
	movabsq	$jl_system_image_data, %rcx
	cmovbeq	%rax, %rcx
; ││┌ @ float.jl:535 within `isnan'
; │││┌ @ float.jl:456 within `!='
	ucomisd	%xmm0, %xmm0
; ││└└
	movabsq	$jl_system_image_data, %rax
	cmovnpq	%rcx, %rax
; │└
	movq	%rbx, -96(%rbp)
	movq	%rax, -88(%rbp)
	movabsq	$jl_apply_generic, %rax
	movabsq	$jl_system_image_data, %rdi
	leaq	-96(%rbp), %rsi
	movl	$2, %edx
	callq	*%rax
	jmp	L1530
; │┌ @ printf.jl:841 within `fix_dec' @ printf.jl:992
; ││┌ @ array.jl:214 within `length'
L1234:
	movq	8(%r13), %rax
; ││└
; ││ @ printf.jl:841 within `fix_dec' @ int.jl:52
	decq	%rax
; ││ @ printf.jl:841 within `fix_dec' @ printf.jl:992
; ││┌ @ operators.jl:294 within `>'
; │││┌ @ int.jl:49 within `<'
	cmpq	$10, %rax
	movl	$9, %edx
; ││└└
	cmovlq	%rax, %rdx
	movq	%r13, -128(%rbp)
; ││ @ printf.jl:841 within `fix_dec' @ printf.jl:993
	movabsq	$grisu, %rax
	leaq	-232(%rbp), %rdi
	movl	$2, %esi
	movq	%r13, %rcx
	callq	*%rax
; ││ @ printf.jl:841 within `fix_dec' @ printf.jl:994
; ││┌ @ promotion.jl:399 within `=='
	movq	-232(%rbp), %r12
	testq	%r12, %r12
; ││└
	movq	%r13, -104(%rbp)
	je	L2141
; ││ @ printf.jl:841 within `fix_dec' @ printf.jl:998
; ││┌ @ boot.jl:709 within `Int32'
; │││┌ @ boot.jl:619 within `toInt32'
; ││││┌ @ boot.jl:581 within `checked_trunc_sint'
	movslq	%r12d, %rax
; │││││ @ boot.jl:582 within `checked_trunc_sint'
	cmpq	%rax, %r12
	jne	L2486
; │││││ @ boot.jl:580 within `checked_trunc_sint'
	movq	-224(%rbp), %rcx
; │││││ @ boot.jl:581 within `checked_trunc_sint'
	movslq	%ecx, %rax
; │││││ @ boot.jl:582 within `checked_trunc_sint'
	cmpq	%rax, %rcx
	jne	L2523
; ││└└└
	movb	-216(%rbp), %al
; │└
	testb	%al, %al
	je	L1401
L1346:
	movq	%rbx, -96(%rbp)
	movabsq	$jl_system_image_data, %rax
	movq	%rax, -88(%rbp)
	movabsq	$jl_apply_generic, %rax
	movabsq	$jl_system_image_data, %rdi
	leaq	-96(%rbp), %rsi
	movl	$2, %edx
	movq	%rcx, %r13
	callq	*%rax
	movq	%r13, %rcx
L1401:
	movabsq	$jl_box_int32, %r13
	movl	%ecx, %edi
	callq	*%r13
	movq	%r13, %rcx
	movq	%rax, %r13
	movq	%r13, -144(%rbp)
	movl	%r12d, %edi
	callq	*%rcx
	movq	%rax, -152(%rbp)
	movq	%rbx, -96(%rbp)
	movabsq	$139695095480928, %rcx  # imm = 0x7F0D4C8B0260
	movq	%rcx, -88(%rbp)
	movq	%r13, -80(%rbp)
	movq	%rax, -72(%rbp)
	movabsq	$jl_system_image_data, %rax
	movq	%rax, -64(%rbp)
	movq	-104(%rbp), %rax
	movq	%rax, -56(%rbp)
	movabsq	$jl_apply_generic, %rax
	movabsq	$jl_system_image_data, %rdi
	leaq	-96(%rbp), %rsi
	movl	$6, %edx
	callq	*%rax
	movabsq	$jl_system_image_data, %r12
L1530:
	movq	%rbx, -96(%rbp)
	movabsq	$139695095517360, %rax  # imm = 0x7F0D4C8B90B0
	movq	%rax, -88(%rbp)
	movabsq	$jl_apply_generic, %rax
	movabsq	$jl_system_image_data, %rdi
	leaq	-96(%rbp), %rsi
	movl	$2, %edx
	callq	*%rax
; │ @ REPL[1]:172 within `perf_nbody'
; │┌ @ range.jl:5 within `Colon'
; ││┌ @ range.jl:277 within `UnitRange'
; │││┌ @ range.jl:282 within `unitrange_last'
; ││││┌ @ operators.jl:341 within `>='
; │││││┌ @ int.jl:424 within `<='
	testq	%r14, %r14
; │└└└└└
	jle	L1631
	movabsq	$julia_advance_16585, %rbx
	movabsq	$139695052337248, %rax  # imm = 0x7F0D49F8B060
	movsd	(%rax), %xmm0           # xmm0 = mem[0],zero
	movsd	%xmm0, -48(%rbp)
	nopl	(%rax)
; │ @ REPL[1]:173 within `perf_nbody'
L1616:
	movq	%r15, %rdi
	movsd	-48(%rbp), %xmm0        # xmm0 = mem[0],zero
	callq	*%rbx
; │┌ @ range.jl:597 within `iterate'
; ││┌ @ promotion.jl:399 within `=='
	decq	%r14
; │└└
	jne	L1616
; │ @ REPL[1]:175 within `perf_nbody'
L1631:
	movq	(%r12), %r13
	movq	%r13, -144(%rbp)
	movq	%r15, %rdi
	movabsq	$julia_energy_16584, %rax
	callq	*%rax
	movsd	%xmm0, -48(%rbp)
	movabsq	$getbuf, %rax
	callq	*%rax
	movsd	-48(%rbp), %xmm0        # xmm0 = mem[0],zero
	movq	%rax, %r14
; │┌ @ float.jl:553 within `isfinite'
; ││┌ @ float.jl:403 within `-'
	movapd	%xmm0, %xmm1
	subsd	%xmm1, %xmm1
; ││└
; ││ @ float.jl:454 within `isfinite'
	xorps	%xmm2, %xmm2
; │└
	ucomisd	%xmm2, %xmm1
	jne	L1701
	jnp	L1791
; │┌ @ printf.jl:150 within `macro expansion'
; ││┌ @ float.jl:503 within `<' @ float.jl:458
L1701:
	ucomisd	%xmm0, %xmm2
; ││└
	movabsq	$139695111274064, %rax  # imm = 0x7F0D4D7BFE50
	movabsq	$jl_system_image_data, %rcx
	cmovbeq	%rax, %rcx
; ││┌ @ float.jl:535 within `isnan'
; │││┌ @ float.jl:456 within `!='
	ucomisd	%xmm0, %xmm0
; ││└└
	movabsq	$jl_system_image_data, %rax
	cmovnpq	%rcx, %rax
; │└
	movq	%r13, -96(%rbp)
	movq	%rax, -88(%rbp)
	movabsq	$jl_system_image_data, %rdi
	leaq	-96(%rbp), %rsi
	movl	$2, %edx
	movabsq	$jl_apply_generic, %rbx
	callq	*%rbx
	jmp	L2070
; │┌ @ printf.jl:841 within `fix_dec' @ printf.jl:992
; ││┌ @ array.jl:214 within `length'
L1791:
	movq	8(%r14), %rax
; ││└
; ││┌ @ int.jl:52 within `-'
	decq	%rax
; ││└
; ││ @ printf.jl:841 within `fix_dec' @ int.jl:49
	cmpq	$10, %rax
	movl	$9, %edx
; └└
; ┌ @ printf.jl:992 within `perf_nbody'
	cmovlq	%rax, %rdx
	movq	%r14, -136(%rbp)
; │ @ printf.jl:993 within `perf_nbody'
	movabsq	$grisu, %rax
	leaq	-208(%rbp), %rdi
	movl	$2, %esi
	movq	%r14, %rcx
	callq	*%rax
; └
; ┌ @ REPL[1]:175 within `perf_nbody'
; │┌ @ printf.jl:841 within `fix_dec' @ printf.jl:994
; ││┌ @ promotion.jl:399 within `=='
	movq	-208(%rbp), %r12
	testq	%r12, %r12
; ││└
	je	L2189
; ││ @ printf.jl:841 within `fix_dec' @ printf.jl:998
; ││┌ @ boot.jl:709 within `Int32'
; │││┌ @ boot.jl:619 within `toInt32'
; ││││┌ @ boot.jl:581 within `checked_trunc_sint'
	movslq	%r12d, %rax
; │││││ @ boot.jl:582 within `checked_trunc_sint'
	cmpq	%rax, %r12
	jne	L2560
; │││││ @ boot.jl:580 within `checked_trunc_sint'
	movq	-200(%rbp), %r15
; │││││ @ boot.jl:581 within `checked_trunc_sint'
	movslq	%r15d, %rax
; │││││ @ boot.jl:582 within `checked_trunc_sint'
	cmpq	%rax, %r15
	jne	L2597
; ││└└└
	movb	-192(%rbp), %al
; │└
	testb	%al, %al
	je	L1951
L1902:
	movq	%r13, -96(%rbp)
	movabsq	$jl_system_image_data, %rax
	movq	%rax, -88(%rbp)
	movabsq	$jl_system_image_data, %rdi
	leaq	-96(%rbp), %rsi
	movl	$2, %edx
	movabsq	$jl_apply_generic, %rax
	callq	*%rax
L1951:
	movabsq	$jl_box_int32, %rbx
	movl	%r15d, %edi
	callq	*%rbx
	movq	%rbx, %rcx
	movabsq	$jl_apply_generic, %r15
	movq	%rax, %rbx
	movq	%rbx, -152(%rbp)
	movl	%r12d, %edi
	callq	*%rcx
	movq	%rax, -160(%rbp)
	movq	%r13, -96(%rbp)
	movabsq	$139695095480928, %rcx  # imm = 0x7F0D4C8B0260
	movq	%rcx, -88(%rbp)
	movq	%rbx, -80(%rbp)
	movq	%rax, -72(%rbp)
	movabsq	$jl_system_image_data, %rax
	movq	%rax, -64(%rbp)
	movq	%r14, -56(%rbp)
	movabsq	$jl_system_image_data, %rdi
	leaq	-96(%rbp), %rsi
	movl	$6, %edx
	callq	*%r15
	movq	%r15, %rbx
L2070:
	movq	%r13, -96(%rbp)
	movabsq	$139695095517360, %rax  # imm = 0x7F0D4C8B90B0
	movq	%rax, -88(%rbp)
	movabsq	$jl_system_image_data, %rdi
	leaq	-96(%rbp), %rsi
	movl	$2, %edx
	callq	*%rbx
	movq	-168(%rbp), %rax
	movq	-184(%rbp), %rcx
	movq	%rax, (%rcx)
	leaq	-40(%rbp), %rsp
	popq	%rbx
	popq	%r12
	popq	%r13
	popq	%r14
	popq	%r15
	popq	%rbp
	retq
; │ @ REPL[1]:170 within `perf_nbody'
; │┌ @ printf.jl:841 within `fix_dec' @ printf.jl:995
; ││┌ @ array.jl:780 within `setindex!'
L2141:
	cmpq	$0, 8(%r13)
	je	L2634
	movq	(%r13), %rax
	movb	$48, (%rax)
	movl	$1, %r12d
; ││└
; ││ @ printf.jl:841 within `fix_dec' @ printf.jl:996
	movb	-216(%rbp), %al
	movl	$1, %ecx
; │└
	testb	%al, %al
	jne	L1346
	jmp	L1401
; │ @ REPL[1]:175 within `perf_nbody'
; │┌ @ printf.jl:841 within `fix_dec' @ printf.jl:995
; ││┌ @ array.jl:780 within `setindex!'
L2189:
	cmpq	$0, 8(%r14)
	je	L2672
	movq	(%r14), %rax
	movb	$48, (%rax)
	movl	$1, %r12d
; ││└
; ││ @ printf.jl:841 within `fix_dec' @ printf.jl:996
	movb	-192(%rbp), %al
	movl	$1, %r15d
; │└
	testb	%al, %al
	jne	L1902
	jmp	L1951
; │ @ REPL[1]:166 within `perf_nbody'
; │┌ @ array.jl:130 within `vect'
; ││┌ @ array.jl:780 within `setindex!'
L2237:
	movabsq	$jl_gc_queue_root, %rax
	movq	%rcx, -112(%rbp)
	callq	*%rax
	movq	-112(%rbp), %rcx
	jmp	L740
L2262:
	movabsq	$jl_gc_queue_root, %rax
	movq	%rcx, %r13
	callq	*%rax
	movq	%r13, %rcx
	jmp	L772
L2285:
	movabsq	$jl_gc_queue_root, %rax
	movq	%rcx, %r13
	callq	*%rax
	movq	%r13, %rcx
	jmp	L813
L2308:
	movabsq	$jl_gc_queue_root, %rax
	movq	%rcx, %r13
	callq	*%rax
	movq	%r13, %rcx
	jmp	L848
L2331:
	movabsq	$jl_gc_queue_root, %rax
	movq	%rcx, %r13
	callq	*%rax
	movq	%r13, %rcx
	jmp	L883
L2354:
	movabsq	$jl_gc_queue_root, %rax
	movq	%r15, %rdi
	movq	%rcx, -112(%rbp)
	callq	*%rax
	movq	-112(%rbp), %rcx
	jmp	L919
L2382:
	movabsq	$jl_gc_queue_root, %rax
	movq	%r15, %rdi
	movq	%rcx, %r13
	callq	*%rax
	movq	%r13, %rcx
	jmp	L947
L2408:
	movabsq	$jl_gc_queue_root, %rax
	movq	%r15, %rdi
	movq	%rcx, %r13
	callq	*%rax
	movq	%r13, %rcx
	jmp	L984
L2434:
	movabsq	$jl_gc_queue_root, %rax
	movq	%r15, %rdi
	movq	%rcx, %r13
	callq	*%rax
	movq	%r13, %rcx
	jmp	L1015
L2460:
	movabsq	$jl_gc_queue_root, %rax
	movq	%r15, %rdi
	movq	%rcx, %r13
	callq	*%rax
	movq	%r13, %rcx
	jmp	L1046
; │└└
; │ @ REPL[1]:170 within `perf_nbody'
; │┌ @ printf.jl:841 within `fix_dec' @ printf.jl:998
; ││┌ @ boot.jl:709 within `Int32'
; │││┌ @ boot.jl:619 within `toInt32'
; ││││┌ @ boot.jl:582 within `checked_trunc_sint'
L2486:
	movabsq	$throw_inexacterror, %rax
	movabsq	$139695095699584, %rdi  # imm = 0x7F0D4C8E5880
	movabsq	$jl_system_image_data, %rsi
	movq	%r12, %rdx
	callq	*%rax
	ud2
L2523:
	movabsq	$throw_inexacterror, %rax
	movabsq	$139695095699584, %rdi  # imm = 0x7F0D4C8E5880
	movabsq	$jl_system_image_data, %rsi
	movq	%rcx, %rdx
	callq	*%rax
	ud2
; │└└└└
; │ @ REPL[1]:175 within `perf_nbody'
; │┌ @ printf.jl:841 within `fix_dec' @ printf.jl:998
; ││┌ @ boot.jl:709 within `Int32'
; │││┌ @ boot.jl:619 within `toInt32'
; ││││┌ @ boot.jl:582 within `checked_trunc_sint'
L2560:
	movabsq	$throw_inexacterror, %rax
	movabsq	$139695095699584, %rdi  # imm = 0x7F0D4C8E5880
	movabsq	$jl_system_image_data, %rsi
	movq	%r12, %rdx
	callq	*%rax
	ud2
L2597:
	movabsq	$throw_inexacterror, %rax
	movabsq	$139695095699584, %rdi  # imm = 0x7F0D4C8E5880
	movabsq	$jl_system_image_data, %rsi
	movq	%r15, %rdx
	callq	*%rax
	ud2
; │└└└└
; │ @ REPL[1]:170 within `perf_nbody'
; │┌ @ printf.jl:841 within `fix_dec' @ printf.jl:995
; ││┌ @ array.jl:780 within `setindex!'
L2634:
	movq	%rsp, %rax
	leaq	-16(%rax), %rsi
	movq	%rsi, %rsp
	movq	$1, -16(%rax)
	movabsq	$jl_bounds_error_ints, %rax
	movl	$1, %edx
	movq	%r13, %rdi
	callq	*%rax
; │└└
; │ @ REPL[1]:175 within `perf_nbody'
; │┌ @ printf.jl:841 within `fix_dec' @ printf.jl:995
; ││┌ @ array.jl:780 within `setindex!'
L2672:
	movq	%rsp, %rax
	leaq	-16(%rax), %rsi
	movq	%rsi, %rsp
	movq	$1, -16(%rax)
	movabsq	$jl_bounds_error_ints, %rax
	movl	$1, %edx
	movq	%r14, %rdi
	callq	*%rax
	nopw	%cs:(%rax,%rax)
; └└└

Palli · August 21, 2019, 10:38am

For fasta, I was looking into if a faster RNG would help (yes, probably disallowed by the rules, but I discovered a likely legal change). Strangely it hangs with my choice, and whatever datatype I tried (and then even with cast to 32-bit for type-stability).

I noticed the code used uses signed, while the fastest (currently C++) code uses unsigned. I also noticed that the RNG only returns 16-bits I think not 32-bits, with rest zero-padded.

#const last_rnd = Ref(Int32(42))  # I tries to change here to UInt32 and lines above, that works
#gen_random() = (last_rnd[] = (last_rnd[] * IA + IC) % IM)

using RandomNumbers.Xorshifts
r = Xoroshiro128Plus(0x1234567890abcdef)  # with a certain seed. Note that the seed must be non-zero.
gen_random() = UInt32(rand(r, UInt8))

static auto get_random = [] {
        static unsigned last = 42;
        return (last = (last * Config::ia + Config::ic) % Config::im);
    };

Could [any of] you check timing for UInt32 change (or look into other RNG)? Just my change to unsigned should have been faster, since assembly code shorter, but for my old laptop it was slightly slower (but so was O3):

Original with -O3

real	0m5,080s
user	0m4,944s
sys	0m0,192s

Original with -O2

real	0m5,076s
user	0m4,936s
sys	0m0,216s

My modified with UInt32 and -O3

real	0m5,224s
user	0m5,096s
sys	0m0,212s

My modified with -O2

real	0m5,205s
user	0m5,064s
sys	0m0,212s

@code_native gen_random() # For Uint32 (gets you slightly shorter than for Int32, thereafter):

	.text
; ┌ @ REPL[3]:2 within `gen_random'
	movabsq	$139625163228512, %rcx  # imm = 0x7EFD04418560
; │┌ @ int.jl:54 within `*'
	imull	$3877, (%rcx), %eax     # imm = 0xF25
; │└
; │┌ @ int.jl:53 within `+'
	addl	$29573, %eax            # imm = 0x7385
; │└
; │┌ @ int.jl:231 within `rem'
	imulq	$502748801, %rax, %rdx  # imm = 0x1DF75681
	shrq	$46, %rdx
	imull	$139968, %edx, %edx     # imm = 0x222C0
	subl	%edx, %eax
; │└
; │┌ @ refvalue.jl:33 within `setindex!'
; ││┌ @ Base.jl:21 within `setproperty!'
	movl	%eax, (%rcx)
; │└└
	retq
	nopl	(%rax,%rax)
; └

julia> @code_native gen_random()
	.text
; ┌ @ REPL[8]:2 within `gen_random'
	movabsq	$139793992281584, %rcx  # imm = 0x7F2453406DF0
; │┌ @ int.jl:54 within `*'
	imull	$3877, (%rcx), %eax     # imm = 0xF25
; │└
; │┌ @ int.jl:53 within `+'
	addl	$29573, %eax            # imm = 0x7385
; │└
; │┌ @ int.jl:229 within `rem'
	cltq
	imulq	$502748801, %rax, %rdx  # imm = 0x1DF75681
	movq	%rdx, %rsi
	shrq	$63, %rsi
	sarq	$46, %rdx
	addl	%esi, %edx
	imull	$139968, %edx, %edx     # imm = 0x222C0
	subl	%edx, %eax
; │└
; │┌ @ refvalue.jl:33 within `setindex!'
; ││┌ @ Base.jl:21 within `setproperty!'
	movl	%eax, (%rcx)
; │└└
	retq
	nopw	%cs:(%rax,%rax)
; └

For xoroshiro there’s no multiply:

julia> @code_native rand(r, UInt64)
	.text
; ┌ @ xoroshiro128.jl:68 within `rand'
; │┌ @ xoroshiro128.jl:35 within `xorshift_next'
; ││┌ @ xoroshiro128.jl:68 within `getproperty'
	movq	(%rdi), %rcx
	movq	8(%rdi), %rax
; ││└
; ││ @ xoroshiro128.jl:37 within `xorshift_next'
; ││┌ @ int.jl:317 within `xor'
	movq	%rcx, %rdx
	xorq	%rax, %rdx
; │└└
; │┌ @ int.jl:53 within `xorshift_next'
	addq	%rcx, %rax
; │└
; │┌ @ xoroshiro128.jl:38 within `xorshift_next'
; ││┌ @ common.jl:1 within `xorshift_rotl'
; │││┌ @ int.jl:316 within `|'
	rolq	$24, %rcx
; ││└└
; ││┌ @ int.jl:317 within `xor'
	xorq	%rdx, %rcx
; ││└
; ││┌ @ int.jl:446 within `<<' @ int.jl:439
	movq	%rdx, %rsi
	shlq	$16, %rsi
; │└└
; │┌ @ int.jl:317 within `xorshift_next'
	xorq	%rcx, %rsi
; │└
; │┌ @ xoroshiro128.jl:38 within `xorshift_next'
; ││┌ @ Base.jl:21 within `setproperty!'
	movq	%rsi, (%rdi)
; │└└
; │┌ @ int.jl:316 within `xorshift_next'
	rolq	$37, %rdx
; │└
; │┌ @ xoroshiro128.jl:39 within `xorshift_next'
; ││┌ @ Base.jl:21 within `setproperty!'
	movq	%rdx, 8(%rdi)
; │└└
	retq
	nopl	(%rax)
; └

Olof_Salberger · August 21, 2019, 1:29pm

List of comparisons just got changed on the website. It now has a comparison to C and to SB Common Lisp instead of Chapel.

kristoffer.carlsson · August 21, 2019, 2:48pm

I’ve created the JuliaPerf organization and moved the BenchmarksGame.jl repo there https://github.com/juliaperf/BenchmarksGame.jl. I’ve also invited @non-Jedi as an owner to that organization.

The BenchmarksGame.jl repo supports correctness checking and performance checking so I feel it would be useful if we could collect the community efforts in improving the benchmarks to that repo. Feel free to maintain the repo as you wish or ignore it if you feel it isn’t useful.

StefanKarpinski · August 21, 2019, 4:01pm

How about calling it JuliaPerf since caps is pretty idiomatic for Julia orgs?

Karajan · August 21, 2019, 4:09pm

The RNG can’t be changed as all implementations need to have the same output (that’s how correctness is checked), therefore the same stream of random numbers, therefore the same RNG.

Using unsigned ints would definitively be valid though, not sure if I forgot to test that or if it was slower…

I also noticed that the RNG only returns 16-bits I think not 32-bits, with rest zero-padded.

Hm… IM > typemax(Int16) so I think Int32/UInt32 is correct here.

I’ve created the JuliaPerf organization and moved the BenchmarksGame.jl repo there

That’s a nice idea, I’ll see if I can cook up a threaded fasta version there.

kristoffer.carlsson · August 21, 2019, 5:02pm

I wrote that, github changed it to lower case, I deleted the org and tried again, github changed it again… No idea why.

StefanKarpinski · August 21, 2019, 5:17pm

Maybe write a support message to GitHub? Not being able to use capitals in org names is a significant regression.

StefanKarpinski · August 22, 2019, 2:38pm

GitHub support agrees that this is a bug not an official policy change. You can change the name of the org after it’s created to have caps, so maybe try that.

kristoffer.carlsson · August 22, 2019, 3:32pm

Renamed!

Karajan · August 23, 2019, 5:29pm

Inspired by the Mary McGrath talk at JuliaCon I played around with the benchmark data because I wanted to see the correlation between code length and performance. The data is averaged over the best entry of each language for all benchmarks:

I think Julia comes away from that quite nicely.

I’m not quite sure if there is a mistake in my calculations because the plot on the benchmark site has a different ordering for, e.g., Chapel and Haskell. I haven’t been able to pin down why.

Ada is not shown because its gzipped code size of ~2700 made the plot rather unreadable.

StefanKarpinski · August 23, 2019, 5:47pm

I knew that Ada was verbose, but dang. I would have thought that compression would reduce some of the bloat. What about non-compressed code size? While I tend to agree that the difference between fn as a keyword and function as a keyword shouldn’t matter, I do think that general verbosity matters. For example, if you always have to write String s = String(...) that’s pretty compressible but it’s still really annoying.

non-Jedi · August 23, 2019, 6:00pm

Did you remove comments and consecutive whitespace and use only gzip --fast? Afaict, that’s the procedure on the site, and it’d make a big difference for some cases. He might also be including all benchmark implementations rather than just the fastest ones.

As a side note, I’ve been very impressed by the Chapel code I’ve seen in the benchmarks. It’s conciseness and clarity combined with speed has been shocking at times. Of course other times it’s non-explicit handling of state and concurrency has been puzzling.

Karajan · August 23, 2019, 6:11pm

Oh, I just scrapped all the info from the pages (e.g. this one) so the data should be identical to whatever he did. So for now I don’t have access to the raw source files, but I might do that later.

He might also be including all benchmark implementations rather than just the fastest ones.

That might be possible but unlikely, I think – the data overall is just too fitting. For example, for the mandelbrot benchmark Julia has two implementations: one fast one (factor 3.0) and one slow one (factor 29). If both were considered equally that would really screw our results.

Topic		Replies	Views
Benchmarks game Performance	20	3742	May 13, 2020
Yet another language benchmark Performance benchmark	5	332	April 10, 2025
Help with binary trees benchmark games example General Usage benchmarks	53	3614	May 6, 2021
Benchmark game challenge and some optimization questions Performance	29	2741	January 13, 2024
knucleotide benchmark improvement for Julia and hashing Community announcement	8	833	February 1, 2019

Julia programs now shown on benchmarks game website

Related topics