Julia programs now shown on benchmarks game website

Hopefully, now that Julia 1.1.0 has landed, some more programs from BenchmarksGame.jl will make it to the benchmarks game website.

Doesn’t look like it.

I expect the program authors are busy doing other stuff, and will eventually contribute their programs to the benchmarks game.

2 Likes

Benchmarks game website added comparisons to Fortran and Chapel.

I wrote my own benchmark for reverse complement since it is currently the slowest up there, and managed to bring down memory use to competitive with the best from other languages. Haven’t submitted anything though.

Still not fully optimized and mostly written for readability (hopefully). Feel free to suggest improvements.

4 Likes

Unfortunatelly the benchmarks game site doesn’t update the results.

Made some further changes. Just submitted my benchmark as a PR on their repo. I don’t expect too much from it.

I think that having a decent showing on the benchmarks game is a relatively important thing to do from a PR perspective if we want to tell people that the language is fast. The language has been post-1.0 for almost a year now.

2 Likes

Nice!
Your code looks, however, like you are not doing a “read line-by-line”: body = readuntil(instream,UInt8('>'))? This is the part where I broke my neck trying to reach Kristoffers speeds, because you can’t really allocate memory for the whole string until you know how long it’s going to be.

I got down to maybe 40% of the current time, while Kristoffer & Crew are somewhere around 15% on my computer :stuck_out_tongue:

Yeah.

Ugh. That’s an absolutely stupid way to read a big block of text into memory, or even to parse a formated stream.

Okay, that can still be made fast with BufferedStreams.jl at the cost of extra memory use. I’ll write a version based on that, and rewrite a basic implementation of InputStreams myself if I get complaints over dependencies. The Golang solution gets to use BufIO. Big question is whether or not you are allowed to use anchors.

I think your code is fine and the point is simply that it must work interactively: Whenever you get sent enough input that there is a newline, you need to process and flush output and are disallowed from waiting (blocking) for more input. It is not permissible to wait until EOF with the processing.

FASTA is one line per gene if I remember correctly, so each line here could be say 5-40KB and the whole file could be a couple of gigs, so you can’t just slurp the whole file into RAM, well these days you can but 20 years ago or more when the files were invented you couldn’t.

in any case it’s not reading 80 chars at a time

You are right, it’t 60 chars at a time (from a 1GB file) :wink:

And at least for this benchmark you pretty much have to read it all into memory (look at the memory use), because you have to reverse the whole thing and you can’t really do that before you have read in the end.

2 Likes

Update: over the past few months, it looks like a number of people did some amazing work submitting benchmarks, and Julia now has a very respectable showing, ahead of Swift and Go:

I also think that there’s still quite a bit of additional performance that can be squeezed out. But I think this is a great showing that will help steer more people towards the language.

9 Likes

Kind of happy to see Pascal. One of my first languages

3 Likes

https://benchmarksgame-team.pages.debian.net/benchmarksgame/program/fasta-julia-4.html

The best time I get on MY decade old laptop is (i.e.NOT with 4 or 2 threads):

time ~/julia-1.1.0/bin/julia  -O3  -- fasta.jl 25000000 >/dev/null
real 0m6,066s
user 0m4,928s
sys 0m0,232s

While I regularly get under 5 sec (for “user”). I’m unable to get less than 5.1 sec. on julia-1.3.0-alpha, with whatever optimization level or number of threads.

At least I get slower with (for both below) export JULIA_NUM_THREADS=4

time ~/julia-1.1.0/bin/julia  -O2  -- fasta.jl 25000000 >/dev/null
real 0m6,385s
user 0m4,984s
sys 0m0,212s
time ~/julia-1.1.0/bin/julia  -O3  -- fasta.jl 25000000 >/dev/null
real 0m6,479s
user 0m5,008s
sys 0m0,196s

MY best on Julia-1.3.0-alpha (best combination, i.e. NOT -O3, nor more threads faster):

export JULIA_NUM_THREADS=1
time julia -O2  -- fasta.jl 25000000 >/dev/null
real 0m6,372s
user 0m5,212s
sys 0m0,164s

Please consider these settings on your machines, and when submitting benchmarks if LOWER optimization levels (or older Julia versions) is faster; and if disabling threading, or what numbers of is fastest (this may say more about my Core Duo laptop, or threads has startup-overhead?). Best case startup for me is real 0m0,313s" on Julia-1.3.0-alpha, (can easily be “real 0m0,390s”), and 1.1 is just slightly slower at best “real 0m0,330s”, but I’ve seen “real 0m0,958s”.

Could it be that -O0 disables threads? On MY machine in the test below, 1 thread is better for -O3 (and -O0).

I found -O3 to be 46% SLOWER than -O0 on “real” (when both CONFIGURED for 4 threads); 42% slower with best settings for both (2,572s vs. 1,813s for “real” time; even worse on “user” time, then 61% slower 1,992s vs. 1,236s), this was all WHEN I was first testing (for below, not above test) using the shorter test file (not the longer one actually used in the benchmark to make it long-running, still useful to know how it affects speed):

export JULIA_NUM_THREADS=1
time ~/julia-1.3.0-alpha/bin/julia -O0  -- kn.jl 0 < ~/Downloads/knucleotide-input.txt 
real 0m1,813s
user 0m1,296s
sys 0m0,204s

also got:

real 0m2,047s
user 0m1,236s
sys 0m0,272s
export JULIA_NUM_THREADS=4
time ~/julia-1.3.0-alpha/bin/julia -O3  -- kn.jl 0 < ~/Downloads/knucleotide-input.txt
real 0m2,887s
user 0m2,144s
sys 0m0,188s
export JULIA_NUM_THREADS=4
time ~/julia-1.3.0-alpha/bin/julia -O0  -- kn.jl 0 < ~/Downloads/knucleotide-input.txt 
real 0m2,010s
user 0m1,392s
sys 0m0,184s
export JULIA_NUM_THREADS=1
time ~/julia-1.1.0/bin/julia -O3  -- kn.jl 0 < ~/Downloads/knucleotide-input.txt
real 0m2,572s
user 0m2,048s
sys 0m0,180s

[As with here, I’m always going for lowest “user” and have seen lower “real”, but then “user” higher.]

time ~/julia-1.3.0-alpha/bin/julia -O3  -- kn.jl 0 < ~/Downloads/knucleotide-input.txt 
real 0m2,829s
user 0m1,992s
sys 0m0,196s
time julia --compile=min  -- kn.jl 0 < ~/Downloads/knucleotide-input.txt 
real 0m6,511s
user 0m4,820s
sys 0m0,232s

Update: over the past few months, it looks like a number of people did some amazing work submitting benchmarks, and Julia now has a very respectable showing, ahead of Swift and Go:

Yes, a bunch of the work has been done, mostly by the amazing @non-Jedi (currently 4/10 top Julia programs).

I also think that there’s still quite a bit of additional performance that can be squeezed out.

Quite possibly, yes.

  • pidigits: all GMP calls anyways, so not too much hope here
  • revcomp: I’m currently trying to get a buffered version to be accepted. With 1.3 I’ll try to do some multithreading.
  • fasta: 1.3 mulitthreading will help for sure
  • nbody: … Adam is currently fighting with this one, maybe some SIMD wizards can help out. Not sure how Rust manages to be this much faster on pure number crunching.
  • knuc: maybe multithreading helps, maybe some more hacks regarding the hash function… not quite sure.
  • binarytrees: because Julia provides GC, there is not much that can be done here, I think
  • With the other ones I’m not sure because I haven’t tried them. Many of the are quite fast already though.
4 Likes

I do wonder why this one runs so much slower multi-threaded than it does with multiple processes. Is there something to do with heap-allocating and garbage collection that’s inherently not amenable to multi-threaded environments? If we could use multi-threading instead of multiple processes, that would take a rather large chunk off the run time.

To expand on the list off the top of my head:

  • regex-redux: could become faster once 1.3 lands with new threading runtime (and thread-safe regex) since execution time is dominated by a strictly non-parallelizable task that could be started on a single thread ahead of other work.
  • mandelbrot: isn’t currently using all cpu cores effectively compared to other implementations. I haven’t identified why yet.
  • revcomp: in addition to buffered read @Karajan is working on, this could also benefit from new threading runtime to start working on a specific sequence while still reading input. It also may be possible to speedup the reversal of each sequence using multi-threading if you divide it into “chunks” instead of just using a naive Threads.@threads looping over the array; this requires removing new-lines from the array being reversed–slightly different than @Karajan current fastest implementation.
  • knuc: julia’s hashmap in general seems slower than some other implementations; not sure why. There’s an opportunity for better usage of cpu cores by implementing parallelism within the counting of each frame instead of around it.
  • fasta: obvious opportunity to parallelize, but I haven’t taken the time to grok what the benchmark is actually doing yet.
5 Likes

The one I don’t understand is nbody: it takes 3.2 sec on my computer to run, and on the website is says that the benchmark takes 22 sec. I don’t think my computer should be that much faster…

EDIT: The code here: https://github.com/KristofferC/BenchmarksGame.jl/blob/master/nbody/nbody-fast.jl is even faster (and simpler to understand) at 2.8sec.

Which benchmark? You probably made the same mistake I did, running one with a shorter test file. As I did for the second benchmark in my comment above.

Here: n-body Julia #3 program (Benchmarks Game)

I get the same output so I think I’m running the right thing.