Keeping scientific objectivity and details in benchmark reports

The problem with this approach is that a proper experiment (eg train two groups of people in each language, and some scientific field, give them a hard problem, log their work hours, see how fast the solution is) would be prohibitively expensive or practically infeasible, so people settle for silly proxies like runtime of code with an unquantified amount of micro-optimization.

The outcome is basically random or a tie, as with enough effort you can optimize code within an inch of its life in both Fortran and Julia.

So you are right, currently there is little science for comparing programming languages. The best approach for most people is spending some time programming in a language, and getting a feel for it. Anecdotes are fine for motivating exploration like that, but ultimately people have to form their own opinion.

20 Likes

I worked on performance measurement, estimation and prediction for what was called back then massive parallel computing (~16…32 nodes) in the mid 90s. Fortran 77, commercial compilers, i860 etc.

One thing i still remember is the quote:“benchmarks do lie and liars do benchmarks”.

You can measure system performance or application performance on a dedicated system, but it’s hard and close to impossible to track down certain performance gains to a language or a certain compiler and apply that gain at a second system. It’s obscure to a lot of users or developers how big the impact of memory management and memory hierachies actually are.

From my perspective Julia can enable the developer to write fast code with less effort but this requires an experienced developer and some insight into efficient algorithms (read as: use a library/package).

And as some commentors above mentioned: there is a blurry line between benchmarking and benchmarketing …

8 Likes

We should probably have a new thread here.
Agreeing with @Tamas_Papp and @lobingera What is important is time to solution - getting your results in a timely fashion and also code resusability and maintenance. I think Julia has so much to offer here.

Regarding time to solution, I usually give the example of Formula 1. In normal seasons there is a race every two weeks. Parts are made and flown out to the track in the week before the race. You have one or two days to perform a CFD run or the results are useless (not completely true).
The same holds for many fields of science - the time to get the results back must fit within your workflow.
IO performance is also critically important - you can have blindingly fast training on those AI models and super duper GPUs - but if you cannot pull the data in fast enough you slow down.

1 Like

You are conflating two separate issues here. We are not concerned about the world in this post. We are talking about science and scientific methodology correctly applied to a topic in computer science. There are certainly concepts in the world that are not quantifiable. God is an example. But that is not science because it is neither quantifiable nor reproducible.

Not all scientific experiments are feasible. But the least that can be done is to report all circumstances around a scientific report. A survey of user feelings and experience could be also quantfied. That is what psychologists have done for decades. But once emotions and personal feelings are involved in a scientific assessment, one has to be extremely careful of hidden cognitive biases that could be involved and take them properly into account.

I cannot believe there are at least 16 people (presumably scientists) in this Julia forum who believe science is not necessarily quantifiable or reproducible. This mentality is a danger to science and scientific community. NIH and NSF are currently spending billions of dollars to make existing pseudo scientific reports reproducible.

On the contrary, I think the attitude that

is the real danger. By overly relying on what can be quantified and dismissing everything else, you are biasing your view of reality towards the easily-quantifiable. Not everything that is true can be quantified, and questions that can be answered in an easily quantifiable way are not the only relevant or interesting questions. And just because something can’t be quantified doesn’t mean it cannot be reasoned about or approached in a disciplined manner.

13 Likes

This, also, does not automatically imply “perfect” science (whatever that may mean). Armies and tobacco companies are known to fund a lot of quantifiable and reproducible work.

4 Likes

We are veering off target here. I’m sure most of us (aim to) produce quantifiable and reproducible research. We probably also throw in some theorising and interpretation without saying “this bit isn’t science”.

Regarding benchmarks, yes things need to be quantifiable and reproducible as the OP says.

But as others have said, these easily quantifiable things may not be the most important for choosing a programming language. Like would you choose to watch a movie based on box office success Vs advice from a friend with similar taste, just because the former is attached to a number.

5 Likes

I missed this before.
I think it’s more likely that there is a general misunderstanding about the points different parties are trying to make than that members of a forum for a modern programming language don’t believe in reproducible research.

2 Likes

I asked a simple question here, to provide more details on a report. Except for a few, most of the responses were off-topic, from the future of ARM chips, to positivism and philosophy, to even comments questioning the basic tenets of science. I appreciate all of your contributions, but these were not the kind of responses I was hoping to get here. So, to avoid the continuation of off-topic comments here, I will stop following this post and responding to comments any further. Thank you all.

1 Like

Let me preface this by saying that I agree with @jzr’s point that the community should engage in as little sensationalism as possible when drawing comparisons between languages. This is something long-time community members are well aware of, but I’d say there’s enough misguided evangelism from inexperienced or new members that we see an inherent resentment of the language in certain internet circles. I’m not sure what can be done to avoid the Rust scenario of this group poisoning the proverbial well of what is otherwise a pretty positive and supportive community, but that’s a topic for another thread.

With that out of the way, let’s talk about computing, reproducibility and software engineering research. For those in other fields, the CS and SE seem like the ideal candidates for replication and easy reproduction. After all, computers and software are more visibly deterministic than most of the biological/physical/chemical/geological processes we observe, right?

Well, kind of. As an example, think of some common control variables when trying to benchmark program performance. e.g. using the same algorithm, software/OS versions, processor architecture, cache access/prewarming, noise from other programs, etc. Now see if any of these made the list:

  • The size and number of environment variables in the current execution context (can have a 3x difference!)
  • One additional stack allocation (e.g. an extra local variable)
  • Swapping the order of two otherwise independent heap allocations
  • Where the compiler decides to load certain code or data segments

(Courtesy of the excellent https://dl.acm.org/doi/abs/10.1145/2490301.2451141, talked about in "Performance Matters" by Emery Berger - YouTube)

Note that all of the above are just layout related factors. For a more holistic overview for what must be accounted for in benchmarking and how to do so effectively (i.e. not generate misleading results), see https://dl.acm.org/doi/abs/10.1145/2491894.2464160.

Now one might say this is an incredibly high bar. I agree! The issue at hand is one of perception. We perceive our machines to be relatively consistent execution environments in theory, but the complexity and variety of modern hardware/software means this is wholly untrue in practice. I won’t even get into areas that incorporate inherent stochasticity (like my home domain of machine learning). The question then follows: why isn’t there a replication crisis in CS and SE? Well, there is, and though less publicized than say the social sciences, it has similar deleterious effects on the trustworthiness of scholarship in both fields.

To return to the main topic of this thread, what should a team who wants to migrate a project to a different language (or a different library, different hardware, etc.) do? They could try to control for as many factors as possible, but how many people are going to put in that time for a real-world project? The only group I see doing so would be those looking to write a software engineering paper, because the end result of such a benchmarking run would be a software engineering paper (and a better than average one at that). They could not talk about anything performance-related at all, but then we get into the territory of what is acceptable speech and how much one can control what others talk about in a formal/semi-formal/informal setting. Perhaps the better compromise would be to emphasize the non-rigour of results and the anecdotal nature of the experiment, but that seems to be covered by the video linked above (which answers the question in the OP but, judging by the most recent responses, unfortunately appears to have been buried).

6 Likes

Luckily enough the project in question is a foss project: Climate Modeling Alliance · GitHub

Thus, if one’s interest is to know what on that project resulted in faster implementations of their models, it is free for research and meta analysis.

I am not sure if the developers will be much interested in analyzing in details specifically why their implementations are faster than the previous ones, as that was not the initial goal, nor may be resumed to a few reasons. It may be well be the result of a multitude of small optimizations.

What may be easier to check is if the claim itself is true, and how the performance of their code compares to other free and open codes available.

I am much more frustrated by claims of performance when they are not accompanied by readily available and well documented codes.

3 Likes

To be fair, it is you who insists on labeling a Medium post about Julia “science” asking why it isn’t scientific enough. It is not a scientific article published in an academic journal, but an intro post about Julia’s performance, and as such it is fairly balanced and recounts various pitfalls.

As you found, asking that something that isn’t science conform to “scientific objectivity” generates fairly unfocused conversations. But please don’t blame others for this.

11 Likes

@shahmoradi I thought the experiment was done with Alan Edelman. He might be a good person to consult about the circumstances.

What is the actual complaint here? That an anecdote that was quoted in a blog post wasn’t sufficiently scientifically rigorous?

13 Likes

I think this topic is in two parts. One part is a specific question about the actual facts of the situation reported in Edelman’s story, in order to understand it. My sense is that part of OP’s professional work includes quantitatively assessing the state of scientific computing ecosystem.

The other part is a complaint about “benchmarketing” which I tried to unpack in my comment

Except that some of Julia’s main competitors are other dynamic languages, not statically typed (“fast”) languages. People are used to assuming that dynamic languages are slow, and seem to need continual reassurance that this does not apply to Julia.

4 Likes

it literally is

Exactly I’d say the main alternatives are Python, R, and Matlab. Julia beats them hands down in basically everything speed wise. The only hope those languages have is to call C code.

When it comes to comparisons with Fortran, C, or C++ or similar, the main advantage Julia has is to express the computation in a more advantageous way, which may lead to better algorithms, or benefits from libraries or autodiff, or etc.

Any lightly optimized simple loops are likely to be very close in speed between Julia and Fortran et al. In the end it’s all machine code.

7 Likes