Keeping scientific objectivity and details in benchmark reports

There’s not that much you can do HPC-wise with 16 gb of ram. If the M2 lets you go up to 64gb (at a minimum) it will start to be able to solve some of the same problems.

1 Like

@ChunXi_Zhang ARM chips are definitely being used in High Performance Computing. Your comment about memory access is very relevant also.
The fastest supercomputer is Fugaku, which uses an ARM chip and High Bandwidth Memory .
In the United Kingdom we have the Isambard system which is being used to evaluate ARM for the next generation of large HPC clusters.

4 Likes

Numpy is quite optimized, for vectorized operations it has mostly similar performance as Julia (Julia has the edge in more complex calculations, e.g. when it can avoid allocations, but this may or may not be relevant for the concrete use case).
But for many real-world problems the hard work is to express the problem in a “vectorized” way so that it can be efficiently calculated with Numpy. In Julia you are more flexible because you can write vectorized code or loops with similar performance.
But if you already have written your code vectorized with Numpy (and managed to avoid all performance pitfalls), converting it to Julia will usually not give you a 5x speedup (unless you discover and remove algorithmic issues - maybe caused due to the vectorization requirements - in this process).

3 Likes

I don’t know if you used macOS for something heavy on memory, my personal experience is that macOS can handle very large virtual memory without losing much performance. Like consume 64g of memory on a 16g machine, and the system still runs fast. But the same program won’t run on Ubuntu, and everything starts to slow down or freeze the whole system when the linux starts to use swap.
But since mac’s hardware especially GPU is quite limited, nobody really use it for computing anymore. For me this is kinda like black magic.

And I also have experience with spark’s standalone mode on a machine with 16g memory, to deal with very large data, above 100g, it seems that the software can handle it pretty well.

What I’m saying is that maybe there is many ways we can do in software so we don’t need too much memory to deal with data much larger than your memory.

The biggest bottleneck for my case is often the graphic memory, the size of my model is always limited by gram, this I find very hard to bypass. So maybe with arm based system, it’ll be easier access to large chunk of unified memory for parallel computing.

1 Like

Do people who use these computers use some different programming scheme? Or they just port the x86 code on that?

I recently wrote some CUDA kernel and find it’s a lot faster than equivalent array programming, and I have more freedom expressing logic in them.
Python people use vectorization for fast CPU code, but Julia people don’t need vectorization for speed, Julia’s implementation acquired more freedom and performance.
These are examples of under difference platform/language/hardware, there can be very different optimized programming scheme.

We don’t have very fast integrated GPU yet, but with future release of M2 chip, maybe it can be on par with or be faster than my 2080ti on speed, and also has much large RAM. So we maybe need a new programming scheme for these device, I’d imagine it be similar with CUDA but no need for transferring between host and device.
I dream on something like this since this is accessible by normal people, unlike those supercomputers.

I think the biggest impact of the M1 Mac will be the wider adoption of ARM CPUs on the desktop market, which might incetivise even more the use of ARM on servers. Those a64fxs are incredible, but they are only available on supercomputers. And while there are other options like the neoverse chips, you still can’t buy them as easily as a threadripper or the HEDT Intel ones. Maybe one day I hope

1 Like

I’m not sure the point of the article is making ‘fair’ comparisons though. You titled your post “keeping scientific objectivity and details in benchmark reports”, but the part of the article you took issue with wasn’t a benchmark report. It was simply an anecdote about the result of a research group changing the implementation of their code.

The fact that they reported that they were willing to take a 3x performance loss should clue one into guessing that performance isn’t their number one priority and instead were more interested in things like person-hours to write and maintain code, as well as things like targeting a language with a larger package ecosystem and a larger pool of users.

I know that in my field, young graduate students are increasingly hesitant to invest their time in learning Fortran in order to maintain their supervisors’ old code, and supervisors are increasingly hesitant to demand that their students learn Fortran because they’re concerned about the future of the language and they want to make sure their students leave their PhD with relevant skills that are in demand.


Of course, benchmarks are very important and when one is actually benchmarking something, they should make sure they do a good job and actually give fair comparisons. But also, given a wealth of existing benchmark work to draw on, I personally feel pretty convinced that if one is choosing between C, C++, Fotran, and Julia, if one does a true comparison between those that are on equivalent algorithms and are optimized by experts, I won’t expect to see very huge differences.

The fact that these languages are able to squeeze out similar levels of performance once they’re fully optimized should make one start asking other questions, like

  • Was it hard to make that high performance implementation? Did it feel natural?
  • If I wanted to change from Float32 to Float64 or ComplexFloat128 or whatever, how much code will I need to rewrite?
  • How is the language tooling for my use-cases?
  • Do many of my colleagues use this language?
  • Will many of my colleagues be using this language in the future?
  • Is there a good package ecosystem in this language?
  • Did I enjoy using this language?

These are the sorts of questions that don’t really have objective, scientific answers

12 Likes

I think we all recognize that getting a 5x improvement is plausible for rewriting a large system without changing the language – algorithmic improvements, redesign, whatever.

Since the language change per se isn’t going to help performance much, I find it annoying that performance improvement stories are promoted so heavily in Julia’s community. I think this bugs other people too, like in Keno’s comment here. Performance isn’t where Julia’s advantage is (compared to other fast languages).

People like reporting numbers in case studies, but I would rather they report numbers that are more reflective of the changes that switching to Julia actually brings – rather than those it doesn’t bring (but which would occur during a redesign without changing language, namely performance).

In fact, I would rather not mention performance at all, rather than reporting silly claims like “faster than Fortran” that (1) irritate other communities, (2) reduce trust of the Julia community, and (3) confuse our own users.

@Mason mentions that

I’m not so pessimistic about the quantifiablity of these properties. For example, “we reduced the size of our codebase by 2/3 because most of it is in published packages” or “a new team member takes half as much time to be productive” or “changing from Float32 to Float64 took one line in Julia vs n lines in C++”. Those are real advantages of Julia. We should be able to quantify and report them objectively.

I would like to see more restraint in reporting performance results, and more enthusiasm in reporting design and development-experience results.

3 Likes

Sure, you can at least claim to quantify these things, but a lot is lost when you do so, and it ends up biasing your presentation because the aspects that are more amenable to being quantified end up getting weighted heavier, regardless of how important they actually are.

IMO that’s why language comparisons inevitably turn into benchmark pissing contests to the detriment of all else. Benchmarks give you a number, and numbers are ‘easy’ to think about and can be presented as being ‘objective’.

To take your codebase size example, yes it’s nice to say “we reduced the size of our codebase by 60%” or whatever, but the actual number of character in a code base isn’t really as important as how digestible a codebase is. E.g. if you reduce 20 lines of code into a one liner, but that one liner is inscrutable nonsense, then you’ve probably increased your maintenance burden rather than decreased it. How digestible a codebase is, is going ot depend greatly on the person reading it. E.g. APL experts may tell you that this

      life ← {⊃1 ⍵ ∨.∧ 3 4 = +/ +⌿ ¯1 0 1 ∘.⊖ ¯1 0 1 ⌽¨ ⊂⍵}

is incredibly clear, but to me it’s just nonsense.

4 Likes

Everything in science is quantifiable. If something cannot be quantified, then it is not science.

I know that in my field, young graduate students are increasingly hesitant to invest their time in learning Fortran in order to maintain their supervisors’ old code, and supervisors are increasingly hesitant to demand that their students learn Fortran because they’re concerned about the future of the language and they want to make sure their students leave their Ph.D. with relevant skills that are in demand.

I did a recent research study presented in the American Physical Society March Meeting recently as part of a larger project, based on surveying 76000 industry jobs in the US. I attach a summary of one result from this project to this comment.

Was it hard to make that high-performance implementation? Did it feel natural?
If I wanted to change from Float32 to Float64 or ComplexFloat128 or whatever, how much code will I need to rewrite?
How is the language tooling for my use-cases?
Do many of my colleagues use this language?
Will many of my colleagues be using this language in the future?
Is there a good package ecosystem in this language?
Did I enjoy using this language?

Again, I agree with @jzr on this matter that all of these are quantifiable. A scientific report must be quantified and reproducible to be reliable. Otherwise, it is fiction and unreliable.

2 Likes

Exactly. We shouldn’t be presenting numbers that make Julia look faster than Fortran, because it isn’t.

If we’re going to compare two languages, it’s fair to assume the users of each language are competent in that language. An APL user understands from the visual form of the symbols that prepending a matrix with means “flip it across the horizontal axis” and means “flip it across the vertical axis” and means “flip it across the diagonal (transpose it)” (awesome. honestly this makes me smile.). Switching from C++ to APL does mean fewer characters are required and often means less logic (e.g. because of its nice broadcasting semantics, composability, making the semantic relationships between operations clear, etc.) (See also Notation as a tool of thought, Iverson’s Turing Award lecture.)

That said, we know

When a measure becomes a target, it ceases to be a good measure.

In that sense, I agree with your concern that people will start to game whatever metrics we pick. But it really is realistic that Julia’s ecosystem has already written a lot of logic that I don’t need to write myself (e.g. broadcasting is built in, uncertainty propagation is easy, etc.). Maybe the metrics we proposed off hand are too easy to game or not reflective of real benefits. But I am confident it’s possible to measure and report objective, useful information about real-world programming, and that is something we should try to do – and that we should avoid reporting misleading headline-grabbing benchmark numbers.

Yep. When I show colleagues (R and python users) bits of Julia code (to show off), it’s all about the ease of broadcasting, advantages of multiple dispatch, use of unicode math notation, simple parallelization. Speed is cool. But then they also have great packages with all the hard work done by C and Fortran (once they’ve figured out how vectorize everything and avoid function calls).

1 Like

That depends on the purpose of the comparison. Everybody can honestly describe a personal experience with the language (or with anything else). Personal experiences may or may not be useful to other people, but claiming that who presents them is dishonest is just off the point. My personal experience is that I had 20 years of Fortran programing experience and in one year using Julia I was able to write faster code in Julia than what I have ever done in Fortran.

Why? The most simple thing is having a REPL and a practical benchmarking macro, which allowed me to dissect my code much better. I improved the algorithms that way. More complicated is to explain how everything I learned in these forums made me a better programmer overall. I would not write my Fortran codes the way I did if I was starting now. And this fórum and the learning experience is what it is also because Julia is what it is: one is in contact with high level coding experiences and with very low level concepts at the same time.

I don’t see absolutely any problem in saying that a tool, that can be a language, a profiler, a debugger, a text editor, allowed one to improve our code in ways that make it faster. The fact with Julia is that it is actually delivering what it proposed to deliver, and a lot of people who have not dig into it is skeptical about that.

12 Likes

I like your points @lmiq. I’ll clarify

This is not something I intend to claim. I agree with you that the purpose of the comparison is critical, but I think it often gets lost in the contests Mason described.

The intent of my “If we’re going to compare two languages, it’s fair to assume the users of each language are competent in that language.” was to respond to the claim that APL looks like nonsense – that is only true for someone who hasn’t yet learned APL.

2 Likes

At the SC19, the International Conference for Supercomputing in 2019, one of the Julia creators Alan Edelman recounts how a group at the Massachusetts Institute of Technology (MIT) rewrote part of their Fortran climate model into Julia. They determined ahead of time that they would tolerate a 3x slowdown of their code. That was an acceptable tradeoff to get access to a high-level language with higher productivity in their view. Instead they got a 3x speed boost by going over to Julia.

I remember I wrote a question about this topic on HackerNews (here).
Is there possibly some online video (of the talk) or more detailed info about this project?

It’s here

1 Like

I see this extreme line of thinking as a rather big issue of our time. It seems to be extremely positivistic. IMO, the quantifiable part of the world is not the world. It’s just the quantifiable part of the world. But, anyways, it’s perhaps better to not go down this road in this thread. (Just couldn’t leave this entirely uncommented.)

29 Likes

The problem with this approach is that a proper experiment (eg train two groups of people in each language, and some scientific field, give them a hard problem, log their work hours, see how fast the solution is) would be prohibitively expensive or practically infeasible, so people settle for silly proxies like runtime of code with an unquantified amount of micro-optimization.

The outcome is basically random or a tie, as with enough effort you can optimize code within an inch of its life in both Fortran and Julia.

So you are right, currently there is little science for comparing programming languages. The best approach for most people is spending some time programming in a language, and getting a feel for it. Anecdotes are fine for motivating exploration like that, but ultimately people have to form their own opinion.

20 Likes

I worked on performance measurement, estimation and prediction for what was called back then massive parallel computing (~16…32 nodes) in the mid 90s. Fortran 77, commercial compilers, i860 etc.

One thing i still remember is the quote:“benchmarks do lie and liars do benchmarks”.

You can measure system performance or application performance on a dedicated system, but it’s hard and close to impossible to track down certain performance gains to a language or a certain compiler and apply that gain at a second system. It’s obscure to a lot of users or developers how big the impact of memory management and memory hierachies actually are.

From my perspective Julia can enable the developer to write fast code with less effort but this requires an experienced developer and some insight into efficient algorithms (read as: use a library/package).

And as some commentors above mentioned: there is a blurry line between benchmarking and benchmarketing …

8 Likes

We should probably have a new thread here.
Agreeing with @Tamas_Papp and @lobingera What is important is time to solution - getting your results in a timely fashion and also code resusability and maintenance. I think Julia has so much to offer here.

Regarding time to solution, I usually give the example of Formula 1. In normal seasons there is a race every two weeks. Parts are made and flown out to the track in the week before the race. You have one or two days to perform a CFD run or the results are useless (not completely true).
The same holds for many fields of science - the time to get the results back must fit within your workflow.
IO performance is also critically important - you can have blindingly fast training on those AI models and super duper GPUs - but if you cannot pull the data in fast enough you slow down.

1 Like