For Monte Carlo simulation with same code same algorithm, how fast is Julia compared with Fortran?

That is a very cool paper, @ranocha, for (1) the development of such a powerful, general PDE code in just one year, (2) the demonstration that Julia is competitive with Fortran for parallel PDE simulation, and (3) the honest assessment of pros, cons, and unknowns in developing such codes in Julia vs Fortran and C. Where have you submitted it?



Thanks a lot! We submitted it to the proceedings of JuliaCon 2021.


Trixi.jl is astonishingly good!


Not if both a and b are const globals, which was the context being discussed. You use const to guarantee type stability, and Ref to allow swapping entire mutable objects.


So, it depends on what you mean with “Same code, same algorithm.” If you’re using identical, line-for-line translations, then Julia and Fortran will be of similar speed, but Fortran will generally be a bit faster. 3 times sounds unusually large, and probably means you could get some improvements to the Julia code. But Julia taking 1.5 times as long as Fortran isn’t all that unusual. In large part, this will improve as Julia gets older and better-optimized.

The main thing is that you will never have a situation of “Same code, same algorithm.” Fortran is lower-level than Julia, so it can get down into the nitty-gritty, doing stuff like playing with pointer arithmetic or memory. This can give you some extra speedups, so that perfectly-optimized Fortran can be made somewhat faster than perfectly-optimized Julia in some situations. However, this isn’t usually what matters most for performance.

A good way to see this is to compare Turing.jl (written in Julia) and Stan (written in C++). Both libraries uses Monte Carlo algorithms for Bayesian statistics. If you run Stan and Turing on exactly the same inputs, with exactly the same algorithm, and write the model in exactly the best way for performance, Stan will be slightly faster. But Turing.jl has way more inference algorithms than Stan. Stan only implements HMC+a bit of variational inference, while Turing.jl has HMC, SMC, variational inference, importance sampling, Gibbs sampling, elliptical slice sampling, parallel tempering, and definitely some other things I’m forgetting.

The advantages of Julia are:

  1. You can implement two algorithms in the time it takes a C++ or Fortran coder to write one.
  2. You can either hire a programmer who knows how to write hyperoptimized Fortran code, or someone who knows the best cutting-edge Monte Carlo algorithm for your use case. Pick one; these are very specialized jobs. (If you ever find someone who can do both, this person is too valuable to be fiddling with low-level memory management for 20% speed boosts.) A researcher can learn Julia much more easily than C++ or Fortran, and using the right algorithm generally matters more than implementation details.

(Not to mention, if you ever really need that memory management, you can just call C or Fortran from Julia.)


I’ve followed your various threads with interest, and having done some learning and porting myself, I have a few suggestions.

  1. The first line-by-line translation should be for testing, not performance. You’ve got your Monte Carlo sim working, which is great. Think of it a ground truth to test an eventual port, but not the basis for truly useful and performant code.
  2. The eventual port should probably not be a direct translation from Fortran. I understand why you set up your globals, which is fine for getting things to work, but now you can sit back and consider how you should really take advantage of Julia, which probably means organizing the data differently. In the big picture, there are lots of options like DataFrames, AxisArrays, StructArrays, NamedTuples, etc. Almost certainly there is something far better than the current globals.
  3. People usually recommend profiling. You may have a lot of unnecessary allocations going on, and could get astonishing gains just by stamping those out. There may be only a few low-level functions in a hot loop. Since you have the first port working, try profiling to find out where time and memory is being spent.
  4. Once I have my initial port going, I also like to build bottom-up. The low-level functions are often just math, and relatively straightforward to get big gains, say by broadcasting and avoiding allocations. If you’re lucky, you can progress on these and benchmark them, without having too much effect on the top-down organization. Unit-testing is easy because you have a working version to compare against. You can go bottom-up in parallel with thinking top-down, although I think bottom-up will only get you on par with Fortran, not usually better.
  5. I’m tempted to think it’s not necessary to reproduce all aspects of your former Fortran other than the computations. It takes a lot of effort to get the formatting exactly right, is it critical? I used to spend a lot of time in printf land, and I’m gradually learning that people don’t do that anymore. Nowadays there are PrettyTables.jl, csv, json, etc. Complicated output is more likely to be plotted, or read by machine, than by human.

To address your overall question, I fully expect that a line-by-line translation will get you Fortran-like performance, but that should not be the goal. Rather, aim to get something that’s easier to maintain, more readable by people, uses modern array/data storage and output, and is backed by good unit-tests.

I fully support this trick as the way to get started, but I’d think seriously about moving on to better approaches than this:

Best of luck, I look forward to your further posts!


Actually, I think that:

A line to line translation of a Fortran code to Julia will almost certainly be faster in Fortran. For the obvious reason of globals, but also for the less obvious reason that Fortran is able to stack allocate arrays easier than Julia.

As mentioned, a line to line translation is nice for testing that the code works, but should not be taken as a reference for performance. Each language has its own way.

Finally, I don’t think that it is true that Fortran allows a more lower level control of the code generated. I think it is pretty much the contrary, and IMO control of the lower level behavior in Fortran relies a lot on the tunning of compiler flags and compiler choices.

From the Julia side this has been discussed some time ago: Julia's applicable context is getting narrower over time? - #57 by tim.holy. That thread is a good read.


Thank you very much indeed for your comments!
I am using Julia 1.61.
I use the default RNG, I have a module called Ran.jl, in which I set the random seed. Then I use just the default rand or randn function.

module Ran

using Random

    function setseed(irn::Int64)
    return nothing

I will use the ProfileVew.jl you recommended to profile my code a little bit. I suspect the Julia RNG can be slower than the RNG I wrote in my Fortran code which is basically a Box-Muller method. Other than the RNG, my Julia code is actually a line by line translation of my Fortran code.

For example, a piece of my code is like this,

  module Step
  using LinearAlgebra
  using DelimitedFiles
  using Printf
  using .Ran
  using .Constants
  #using .Mixture
  using .Analysis
  export step_init,prep,steps,read_Yji,push_Yji,get_musigma,get_musigma_maxLL,musigma,mean_covar_init
  mutable struct Mean_covar
  const musigma = Array{Mean_covar,1}() # const is global
  function mean_covar_init(kmix::Int64,dim_p::Int64,weight::Array{Float64,1},sigma::Array{Float64,3},mu::Array{Float64,2})
      @assert length(weight) == kmix
      @assert size(sigma) == (kmix,dim_p,dim_p)
      @assert size(mu) == (kmix,dim_p)
      resize!(musigma, kmix) # musigma = Vector{MeanCovar}(undef, kmix) 
      for k in 1:kmix
          musigma[k] = Mean_covar(zeros(dim_p,1),zeros(dim_p,dim_p),0.0)
          musigma[k].mu[1,1] = mu[k,1]
          musigma[k].mu[2,1] = mu[k,2]
          musigma[k].sigma[1,1] = sigma[k,1,1]
          musigma[k].sigma[2,2] = sigma[k,2,2]
          musigma[k].w = weight[k]
      return nothing
  const imode = Ref{Int64}() # const is global, make it a Ref to lock the type
  const itermax = Ref{Int64}()
  const msample = Ref{Int64}()
  const mgauss = Ref{Int64}()
  const mi = Ref{Int64}()
  const nsub = Ref{Int64}()
  const dim_p = Ref{Int64}()
  const kmix = Ref{Int64}()
  const sig = Ref{Float64}()
  const sig_inv = Ref{Float64}()
  const SigV = Ref{Float64}()
  const mu1 = Ref{Float64}()
  const sig1 = Ref{Float64}()
  const mu2 = Ref{Float64}()
  const sig2 = Ref{Float64}()
  const muV = Ref{Float64}()
  const w1 = Ref{Float64}()
  const w2 = Ref{Float64}()
  const mitot = Ref{Int64}()
  const mgauss_tot = Ref{Int64}()
  const mgauss_max = Ref{Int64}()
  const mgauss_min = Ref{Int64}()
  const iter = Ref{Int64}()
  const sigma_factor = Ref{Float64}()
  const D = Ref{Float64}()
  const normpdf_factor_pYq_i = Ref{Float64}()
  const log_normpdf_factor_pYq_i = Ref{Float64}()
  const h_diag = Ref(Array{Float64,3}(undef,0,0,0))
  const ymh = Ref(Array{Float64,3}(undef,0,0,0))
  const thetas = Ref(Array{Float64,3}(undef,0,0,0))
  const muk_iter = Ref(Array{Float64,2}(undef,0,0))
  const sigk_iter = Ref(Array{Float64,2}(undef,0,0))
  const muV_iter = Ref(Array{Float64,2}(undef,0,0))
  const sigV_iter = Ref(Array{Float64,2}(undef,0,0))
  const wk_iter = Ref(Array{Float64,2}(undef,0,0))
  const mukerror_iter = Ref(Array{Float64,2}(undef,0,0))
  const sigkerror_iter = Ref(Array{Float64,2}(undef,0,0))
  const ar_gik_k = Ref(Array{Float64,2}(undef,0,0))
  const muVerror_iter = Ref(Array{Float64,2}(undef,0,0))
  const sigVerror_iter = Ref(Array{Float64,2}(undef,0,0))
  const log_pYq_km = Ref(Array{Float64,2}(undef,0,0))
  const nik = Ref(Array{Float64,2}(undef,0,0))
  const nik_sig = Ref(Array{Float64,2}(undef,0,0))
  const wknik = Ref(Array{Float64,2}(undef,0,0))
  const wknik_sig = Ref(Array{Float64,2}(undef,0,0))
  const tauik = Ref(Array{Float64,2}(undef,0,0))
  const Yji = Ref(Array{Float64,2}(undef,0,0))
  const mgauss_ik = Ref(Array{Int64,2}(undef,0,0))
  const isub_Metroplis_gik_all = Ref(Array{Int64,2}(undef,0,0))
  const LL_iter = Ref(Array{Float64,1}(undef,0))
  const sigma_iter = Ref(Array{Float64,1}(undef,0))
  const ar_gik = Ref(Array{Float64,1}(undef,0))
  const sigmaerror_iter = Ref(Array{Float64,1}(undef,0))
  const norm = Ref(Array{Float64,1}(undef,0))
  const norm_sig = Ref(Array{Float64,1}(undef,0))
  const normerr = Ref(Array{Float64,1}(undef,0))
  const log_norm = Ref(Array{Float64,1}(undef,0))
  const wnow = Ref(Array{Float64,1}(undef,0))
  const log_wnow = Ref(Array{Float64,1}(undef,0))
  const t = Ref(Array{Float64,1}(undef,0))
  const fmt_k_iter = Ref{Any}()
  const fmt_float_iter = Ref{Any}()
  const fmt_float_err_iter = Ref{Any}()
  function step_init(imodein::Int64,itermaxin::Int64,min::Int64,mgaussin::Int64,miin::Int64,nsubin::Int64,pin::Int64,kmixin::Int64
      @assert length(tin) == miin
      normpdf_factor_pYq_i[] = Constants.NORMALPDF_FACTOR^mi[]
      log_normpdf_factor_pYq_i[] = log(normpdf_factor_pYq_i[])
      mgauss_tot[] = mgauss[]*kmix[]
      mgauss_max[] = ceil(mgauss_tot[]*0.95)-1
  	mgauss_min[] = mgauss_tot[] - mgauss_max[]  
      isub_Metroplis_gik_all[] = Array{Int64,2}(undef,msample[],kmix[])
      for i in 1:kmix[]
          wnow[][i] = musigma[i].w
      mgauss_ik[] .= mgauss[]
      fmt_k_iter[] = Printf.Format("%20s"*"%15i               |"^kmix[])
      fmt_float_iter[] = Printf.Format("%20s"*" %12.5g                 |"^kmix[])
      fmt_float_err_iter[] = Printf.Format(" %12.5g +- %-12.5g |")
      return nothing

I defined many variables as const with Ref, in order to lock their type and ensure type stability, as many guys suggested if I have to write my Julia code in a Fortran way now.
I treat these const Ref variables as ‘global’ variable which in principle may not decrease performance by too much. The other advantage of using Ref is that then everywhere if I need to use these ‘global’ variable, say use sig, I need to write sig[ ], the [ ] helps me to recognize that it is a ‘global’ variable.
The function step_init is simply to initialize the ‘global’ variables in the module.

May I ask, in your Trixi,jl (which is a very nice and advanced Julia code), do you also have some place your might need to define some constants or parameters that you need to use in your whole code? Do you have a type or struct that serve as parameters etc, so that you do not have to pass a lot of arguments in your functions everytime?

I currently just run in it serial, I did not add MPI in Julia. I just first benchmark the single core performance in Julia and Fortran. Once I could get Julia’s performance close to Fortran, implement MPI should be trivial. Due to the nature of Monte Carlo, I use MPI. I do not use things like openMP which is shared memory stuff and may have thread safe issues.

By the way, I saw in your Trixi.jl code, you used many fancy stuff begin with @, such as @ inline, @ inbound, @ unpack, etc. I wonder, like, if you do not add those commands, will the Julia code slow down by a lot?

Thank you very much!
I will profile a little bit about these stuff you mentioned. Besides RNG, const Ref could be the place that slow down may occur.

In fact, in Intel Fortran, I always set heap-array as 0, which means heap all arrays on heap. It seems that does not influence performance. If I do not heap arrays, my Fortran code will not work properly because the arrays are easily overflow the stack size.
I use array, big arrays a lot. Because in my Monte Carlo code, whenever possible, I try my best not to calculate the same thing twice, or I try to prevent some overhead. So I save many values. But of course, I test and make sure that saving these values and accessing them is faster than calculating them again and again in the code.
For one example, If I know I need 10000 random numbers (like I need to propose 10000 Metropolis move) in a subroutine, I generate them first, save them sequentially in an array. Then in the subroutine, I just read these random numbers from array sequentially.

But I think, if like you said, by doing const Ref trick, all these ‘global’ variables, especially even all the numbers (not arrays) are in the heap and not on the cache, then the code can slow down by a lot I guess. In Fortran, these ‘global’ number I believe, are stored in the cache.

You should really pass in an explicit RNG object to all places where you generate random numbers. It’s better for both performance and reproducibility, the reason being always the same: globals are bad for performance, accessing them takes time.

For example, instead of rand() you should use rand(rng), where rng has type AbstractRNG. This helps also reproducibility because the default RNG in Julia can (and does!) generate different streams of numbers in different versions of Julia. If you want cross-version stability (for example for some tests), using an explicit RNG object lets you easily swap the default RNG with something else, e.g. GitHub - JuliaRandom/StableRNGs.jl: A Julia RNG with stable streams


Thank you very much!

In my code I simply do things like

    rn = rand(1)[1]
    isubn = Int64(ceil(rand(1)[1]*nsub[]))

I just use rand(n) where n is the number of the elements of the rand array.
Are you saying things like

 rn = rand(1)[1] 

is slower than rand(rng), could you give a tiny example of how perhaps would you use the rand function? Thanks a lot! I will read the stableRNG you mentioned, that is great stuff!

Note that rand(1)[1] allocates a one-element vector and then references its only element, which is a waste of memory and time. Maybe you wanted to use rand() instead?

Regarding the example, you can read The need for rand speed | Blog by Bogumił Kamiński


Thank you very much! What you said are all correct and I learned from your suggestions!

By the way, just a comment, yes I did spend some efforts to get the format correct with help of many great guys :sweat_smile:, because I need to display all K components of mixing model horizontally, otherwise if it is vertical displayed it will be too long and difficult to read. So I did make my Julia display exactly as my Fortran code displays.

1 Like

Oh boy! Crucial suggestion! Thank you so much! :stuck_out_tongue_winking_eye:

I don’t think anyone here has seen your full output yet and since it seems like you’re calculating with uncertainties, here’s a link to a package that may make your life with uncertainty easier:

This package defines a number type with builtin uncertainty/error propagation. It works with standard operations like +, -, , * etc. Might be useful to you.

1 Like

The reason why Julia users who don’t usually use const tell newcomers to “Make global variables const” is to recover performance with minimal changes to the code.

In fact, fixing types with const kills a lot of benefits of using Julia.

For example, if you use const to fix the type to Float64, you will not be able to use Measurements.jl, MonteCarloMeasurements.jl and Intervals.jl.

Not only that, but it also closes the way to introduce automatic differentiation for better performance.

Moreover, fixing the type to Array with const also makes it difficult to migrate to GPU-based computations using CuArray in CUDA.jl for higher performance.

Do not use const to fix types to ensure type stability.

A good and simple way to ensure type stability is to pass all of the information needed by a function (including global variables that contain the information shared with other functions) as arguments to the function each time.

In my personal opinion, higher education around the world teaches the opposite of this point in a harmful way. It keeps producing large numbers of people who are misled by it and don’t want to do the above.

If you actually write the code to pass all the information as arguments to a function every time, you will find that the effort is not significant and the benefits you get, independent of Julia, are huge. Don’t be afraid to try it.

The point is to create constructors for data shared by multiple functions. (cf. Function depending on the global variable inside module - #10 by genkuroki)




You might try out the Julia v1.7.0-beta3 - I’m using this version and am pretty happy with it. The default RNG changed and should improve the performance of your RNG-heavy code quite a bit.

Some additional aspect that could improve the performance of your code: I saw that you included some explicit bounds checks. By default, Julia will also check bounds on every indexing operation into arrays, such as musigma[k].mu[1,1] = mu[k,1]. To avoid that, you can use @inbounds (or run Julia with command line argument --check-bounds=no to disable bounds checks globally). As you know, Fortran will not perform bounds checks automatically.

The code that you posted is probably too long such that most people will not want to look at it. If you want to increase the chance of getting help, you could consider positing smaller code snippets that people can review easily on their phone. That’s basically also true for me - it’s too long to motivate me to go through it thoroughly. After performing some profiling, you will easily find the most expensive parts in your code - extract them and post them again and I’m sure people will help you to optimize it.

As others have written in this thread, I would suggest to avoid the use of global variables. Instead, I would pack them in structs or NamedTuples and pass them around in your code. This makes it explicit which variables you will use and allows you to get some library-like behavior instead of a linear program. It will also allow you to exchange types later if that might help (to increase performance or get additional features for free such as automatic differentiation, uncertainty propagation, or simply different real types).

When Trixi.jl was initiated, I wasn’t part of the team (I joined a few months later). At first, it was written mimicking an existing Fortran code (with which we compared the performance in our paper mentioned above). Thus, it was a linear code written to perform a single simulation. To change something, you need to re-compile the Fortran code with different compile time options and/or set some flags in a parameter file which was parsed at first. Then, the code created some global variables, scratchspaces, constants, settings etc.
However, we decided to rewrite Trixi.jl in a completely different style (see also this announcement and our paper linked above). The gist is: Trixi.jl is now designed as a library. This allows us to instantiate multiple different parts of it at the same time, giving us much more flexibility. Additionally, we got rid of parameter files - a simulation setup is defined in pure Julia code. Everything that needs to be set up (such as different internal caches etc.) are created when starting a simulation and passed around. We don’t use any global constants anymore (except in some legacy experimental parts that will be re-designed in the future). Every parameter get’s bundled in some struct that we pass around. For example, the physical equations, the geometric mesh, and the numerical solver are Julia structs bundling information; they are passed around in our code. Moreover, we create a cache as NamedTuple and pass it around in basically every function that needs to access it. This is very convenient for us right now. Additionally, it allowed us to enable automatic differentiation through semidiscretizations or even complete simulations with Trixi.jl (and OrdinaryDiffEq.jl in time).

These macros are quite helpful. One of the reasons why I like Julia as much as I do is exactly that it’s possible to have these sophisticated code manipulation tools. Let me briefly comment o the examples that you mentioned. Feel free to ask whether you want to know more about some of them specifically.

  • @unpack: This is basically a convenience macro from UnPack.jl that makes it easier to unpack things from structs or NamedTuples. When passing those parameters around, you can access things using @unpack a, b, c = foo instead of a = foo.a; b = foo.b; c = foo.c. However, it doesn’t make a performance difference. In Julia v1.7, the same functionality will be available as (; a, b, c) = foo.
  • @inbounds: I mentioned that briefly above. By default, Julia will perform a bounds check before every indexing operation (unlike Fortran). If you want to avoid this and know that you will not have out-of-bounds access, use @inbounds (or disable bounds checking globally). Depending on your code, this can improve the performance significantly, in particular if it enables further optimizations such as SIMD.
  • @inline: The Julia compiler uses some heuristics to decide whether a “small” function should be inlined or not. You can nudge the compiler to do so using @inline. Depending on the use case, this will definitely improve the performance.

It’s also very much worth noting – your code looks like it’s a statistical simulation. You might want check out Turing.jl, and try using their algorithms instead of coding a whole simulation from scratch. You should be able to get non-Bayesian output from it by just setting flat priors on everything, in which case you’ll be drawing samples from the likelihood function instead of the posterior.


If you are concerned about using names that start with ‘@’ on Discourse, then here’s a tip: if you wrap them in single backticks, they will not be interpreted as user handles, but as code, and you can use them freely: @inline, @inbounds, @unpack. No need to put spaces next to the ‘@’.

For example, I can write @CRquantum, and you will not be pinged because of that.