Let me preface this by saying that I agree with @jzr’s point that the community should engage in as little sensationalism as possible when drawing comparisons between languages. This is something long-time community members are well aware of, but I’d say there’s enough misguided evangelism from inexperienced or new members that we see an inherent resentment of the language in certain internet circles. I’m not sure what can be done to avoid the Rust scenario of this group poisoning the proverbial well of what is otherwise a pretty positive and supportive community, but that’s a topic for another thread.
With that out of the way, let’s talk about computing, reproducibility and software engineering research. For those in other fields, the CS and SE seem like the ideal candidates for replication and easy reproduction. After all, computers and software are more visibly deterministic than most of the biological/physical/chemical/geological processes we observe, right?
Well, kind of. As an example, think of some common control variables when trying to benchmark program performance. e.g. using the same algorithm, software/OS versions, processor architecture, cache access/prewarming, noise from other programs, etc. Now see if any of these made the list:
- The size and number of environment variables in the current execution context (can have a 3x difference!)
- One additional stack allocation (e.g. an extra local variable)
- Swapping the order of two otherwise independent heap allocations
- Where the compiler decides to load certain code or data segments
(Courtesy of the excellent https://dl.acm.org/doi/abs/10.1145/2490301.2451141, talked about in "Performance Matters" by Emery Berger - YouTube)
Note that all of the above are just layout related factors. For a more holistic overview for what must be accounted for in benchmarking and how to do so effectively (i.e. not generate misleading results), see https://dl.acm.org/doi/abs/10.1145/2491894.2464160.
Now one might say this is an incredibly high bar. I agree! The issue at hand is one of perception. We perceive our machines to be relatively consistent execution environments in theory, but the complexity and variety of modern hardware/software means this is wholly untrue in practice. I won’t even get into areas that incorporate inherent stochasticity (like my home domain of machine learning). The question then follows: why isn’t there a replication crisis in CS and SE? Well, there is, and though less publicized than say the social sciences, it has similar deleterious effects on the trustworthiness of scholarship in both fields.
To return to the main topic of this thread, what should a team who wants to migrate a project to a different language (or a different library, different hardware, etc.) do? They could try to control for as many factors as possible, but how many people are going to put in that time for a real-world project? The only group I see doing so would be those looking to write a software engineering paper, because the end result of such a benchmarking run would be a software engineering paper (and a better than average one at that). They could not talk about anything performance-related at all, but then we get into the territory of what is acceptable speech and how much one can control what others talk about in a formal/semi-formal/informal setting. Perhaps the better compromise would be to emphasize the non-rigour of results and the anecdotal nature of the experiment, but that seems to be covered by the video linked above (which answers the question in the OP but, judging by the most recent responses, unfortunately appears to have been buried).