For Monte Carlo simulation with same code same algorithm, how fast is Julia compared with Fortran?

I have just translated my own Monte Carlo parametric expectation maximization (doing Monte Carlo in each iterations) code from modern to Fortran to Julia. This is basically a one to one translation and algorithm and code are the same. Thank you for all of your kind help!
Although I did have some ‘global’ variables in the module, however I used the Ref trick as you guys taught me like

const muk_iter = Ref(Array{Float64,2}(undef,0,0))
const sigk_iter = Ref(Array{Float64,2}(undef,0,0))

so that I can use these arrays in the functions in the module, which I believe will not influence the performance. I know it is a bad habit in Julia, but if it does not influence performance, I may temporarily leave it as is now.

However, I found my Julia version is 3 times slower than my Fortran version.
May I ask, do you guys might have experience in writing the same Monte Carlo simulation code in Fortran and in Julia?
In your opinion, can I say that, if the Julia code is well written, it should almost the same speed as Fortran?
Or, do you generally feel perhaps Julia is indeed a little slower than Fortran? It seems to me that, the Fortran compiler spend like 10 seconds in compiling and optimization, while Julia spend 1.5 seconds in compiling and optimization, so it seems Fortran compiler might have done more optimization (I am using -O3 + -xHost and Intel compiler)?

By the way, is the default random number generator in Julia fast enough?

Thank you very much in advance!

Disclaimer: I am not an expert on Monte Carlo simulations, random number generators, etc. Thus, my following comment will not touch these aspects. My main field of expertise are deterministic discretizations of (ordinary and partial) differential equations.

In my experience: Yes. We are working on discretizations of hyperbolic PDEs in Trixi.jl. Many members of our group are also the core developers of FLUXO, an HPC Fortran code for hyperbolic-parabolic PDEs (mainly Navier-Stokes and magnetohydrodynamics). Both Trixi.jl (Julia) and FLUXO (Fortran) implement a common subset of algorithms. We compared their serial performance on this common subset for 3D simulations of compressible flows on curvilinear meshes. In our benchmarks published in a preprint, our Julia code was never slower than the Fortran code - in fact, it was sometimes even more than 2x faster.

However, you need to be careful when writing your Julia code to get this performance. Of course, your starting point should be the performance tips in the docs.

If you have carefully considered the performance tips in the docs, there are still some caveats when comparing Julia to Fortran. First, Julia uses LLVM, which will not generate fused multiply-add (FMA) instructions by default, unless optimizing FORTRAN/C/C++ compilers such as GCC or the Intel compilers, see also my comment in another thread of yours. TL/DR: Use @muladd from MuladdMacro.jl.
Another aspect of different compiler optimizations might be that Fortran compilers are inlining more aggressively - you can nudge the Julia compiler to do the same via @inline.

Finally, I can only recommend to benchmark and profile your code. For example, you can run some simulations and get a nice visualization of profiling data via @profview from ProfileView.jl. If you discover that the culprit is a function that looks basically the same in Julia and Fortran, you can compare their assembly code to see whether there is any substantial difference. I described that process and resulting optimizations & performance improvements in a blog post on optimizing our hyperbolic PDE framework Trixi.jl .

If you have reduced the performance differences to some minimal examples, you can usually get great feedback here on discourse with many tips how to improve the performance of your Julia code (paste minimal working examples in Julia and Fortran).

22 Likes

Concerning random number generators: Which version of Julia do you use? Do you pass an RNG explicitly or do you use the global default one? Do you run your simulations in serial or parallel (threads or something else?)? It might be worth looking into Julia v1.7 (beta version is currently available) since a lot of improvements of Random were introduced there, see https://github.com/JuliaLang/julia/blob/release-1.7/NEWS.md.

4 Likes

That will affect performance, particularly of you are using that to reference small data structures. By doing that you are heap allocating everything, instead of allowing the variables to be immutable, and thus stack allocated or cached.

8 Likes

What is the purpose of wrapping mutable arrays in a Ref? I thought that trick was for immutable values.

7 Likes

MersenneTwister() is the default random number generator used in Julia v1.6. Note, however, that it is not MT19937, but dSFMT. The dSFMT is two to three times faster than MT19937, and the quality of the generated random numbers is higher.

Julia v1.7 and later uses another random number generator as default, which is relatively light on initialization and has throughput as fast as dSFMT.

If you want to compare Julia and Fortran on simple Monte Carlo methods, you can write a Monte Carlo code for \pi in Fortran and compare it with the Julia version at the following links.

I think it is a waste of time to argue about something that may turn into a fruitless dispute, and we should give priority to repeated verification using concrete code.

8 Likes

The trick of using Ref(...) is meant to wrap into a mutable container something which is immutable. A Matrix itself is mutable, so wrapping it in a Ref doesn’t help much.

3 Likes

Wrapping a mutable object in Ref has its uses. Without it you cannot swap an entire object with another. One use, for example, is having a global double buffer, in which you read from a buffer and write in the other, and then, for the next iteration, you swap which one is for reading and which one is for writing at basically zero cost. However, I am not sure if this extra indirection layer is useful here.

1 Like

Isn’t the same accomplished easily with a, b = b, a? After all a and b are just names.

1 Like

That is a very cool paper, @ranocha, for (1) the development of such a powerful, general PDE code in just one year, (2) the demonstration that Julia is competitive with Fortran for parallel PDE simulation, and (3) the honest assessment of pros, cons, and unknowns in developing such codes in Julia vs Fortran and C. Where have you submitted it?

:clap:

10 Likes

Thanks a lot! We submitted it to the proceedings of JuliaCon 2021.

5 Likes

Trixi.jl is astonishingly good!

5 Likes

Not if both a and b are const globals, which was the context being discussed. You use const to guarantee type stability, and Ref to allow swapping entire mutable objects.

2 Likes

So, it depends on what you mean with “Same code, same algorithm.” If you’re using identical, line-for-line translations, then Julia and Fortran will be of similar speed, but Fortran will generally be a bit faster. 3 times sounds unusually large, and probably means you could get some improvements to the Julia code. But Julia taking 1.5 times as long as Fortran isn’t all that unusual. In large part, this will improve as Julia gets older and better-optimized.

The main thing is that you will never have a situation of “Same code, same algorithm.” Fortran is lower-level than Julia, so it can get down into the nitty-gritty, doing stuff like playing with pointer arithmetic or memory. This can give you some extra speedups, so that perfectly-optimized Fortran can be made somewhat faster than perfectly-optimized Julia in some situations. However, this isn’t usually what matters most for performance.

A good way to see this is to compare Turing.jl (written in Julia) and Stan (written in C++). Both libraries uses Monte Carlo algorithms for Bayesian statistics. If you run Stan and Turing on exactly the same inputs, with exactly the same algorithm, and write the model in exactly the best way for performance, Stan will be slightly faster. But Turing.jl has way more inference algorithms than Stan. Stan only implements HMC+a bit of variational inference, while Turing.jl has HMC, SMC, variational inference, importance sampling, Gibbs sampling, elliptical slice sampling, parallel tempering, and definitely some other things I’m forgetting.

The advantages of Julia are:

  1. You can implement two algorithms in the time it takes a C++ or Fortran coder to write one.
  2. You can either hire a programmer who knows how to write hyperoptimized Fortran code, or someone who knows the best cutting-edge Monte Carlo algorithm for your use case. Pick one; these are very specialized jobs. (If you ever find someone who can do both, this person is too valuable to be fiddling with low-level memory management for 20% speed boosts.) A researcher can learn Julia much more easily than C++ or Fortran, and using the right algorithm generally matters more than implementation details.

(Not to mention, if you ever really need that memory management, you can just call C or Fortran from Julia.)

3 Likes

I’ve followed your various threads with interest, and having done some learning and porting myself, I have a few suggestions.

  1. The first line-by-line translation should be for testing, not performance. You’ve got your Monte Carlo sim working, which is great. Think of it a ground truth to test an eventual port, but not the basis for truly useful and performant code.
  2. The eventual port should probably not be a direct translation from Fortran. I understand why you set up your globals, which is fine for getting things to work, but now you can sit back and consider how you should really take advantage of Julia, which probably means organizing the data differently. In the big picture, there are lots of options like DataFrames, AxisArrays, StructArrays, NamedTuples, etc. Almost certainly there is something far better than the current globals.
  3. People usually recommend profiling. You may have a lot of unnecessary allocations going on, and could get astonishing gains just by stamping those out. There may be only a few low-level functions in a hot loop. Since you have the first port working, try profiling to find out where time and memory is being spent.
  4. Once I have my initial port going, I also like to build bottom-up. The low-level functions are often just math, and relatively straightforward to get big gains, say by broadcasting and avoiding allocations. If you’re lucky, you can progress on these and benchmark them, without having too much effect on the top-down organization. Unit-testing is easy because you have a working version to compare against. You can go bottom-up in parallel with thinking top-down, although I think bottom-up will only get you on par with Fortran, not usually better.
  5. I’m tempted to think it’s not necessary to reproduce all aspects of your former Fortran other than the computations. It takes a lot of effort to get the formatting exactly right, is it critical? I used to spend a lot of time in printf land, and I’m gradually learning that people don’t do that anymore. Nowadays there are PrettyTables.jl, csv, json, etc. Complicated output is more likely to be plotted, or read by machine, than by human.

To address your overall question, I fully expect that a line-by-line translation will get you Fortran-like performance, but that should not be the goal. Rather, aim to get something that’s easier to maintain, more readable by people, uses modern array/data storage and output, and is backed by good unit-tests.

I fully support this trick as the way to get started, but I’d think seriously about moving on to better approaches than this:

Best of luck, I look forward to your further posts!

10 Likes

Actually, I think that:

A line to line translation of a Fortran code to Julia will almost certainly be faster in Fortran. For the obvious reason of globals, but also for the less obvious reason that Fortran is able to stack allocate arrays easier than Julia.

As mentioned, a line to line translation is nice for testing that the code works, but should not be taken as a reference for performance. Each language has its own way.

Finally, I don’t think that it is true that Fortran allows a more lower level control of the code generated. I think it is pretty much the contrary, and IMO control of the lower level behavior in Fortran relies a lot on the tunning of compiler flags and compiler choices.

From the Julia side this has been discussed some time ago: Julia's applicable context is getting narrower over time? - #57 by tim.holy. That thread is a good read.

7 Likes

Thank you very much indeed for your comments!
I am using Julia 1.61.
I use the default RNG, I have a module called Ran.jl, in which I set the random seed. Then I use just the default rand or randn function.

module Ran

using Random

    function setseed(irn::Int64)
       Random.seed!(irn)
    return nothing
    end
 end

I will use the ProfileVew.jl you recommended to profile my code a little bit. I suspect the Julia RNG can be slower than the RNG I wrote in my Fortran code which is basically a Box-Muller method. Other than the RNG, my Julia code is actually a line by line translation of my Fortran code.

For example, a piece of my code is like this,

  module Step
  using LinearAlgebra
  using DelimitedFiles
  using Printf
  include("Ran.jl")
  using .Ran
  include("Constants.jl")
  using .Constants
  #include("Mixture.jl")
  #using .Mixture
  include("Analysis.jl")
  using .Analysis
  
  export step_init,prep,steps,read_Yji,push_Yji,get_musigma,get_musigma_maxLL,musigma,mean_covar_init
  
  mutable struct Mean_covar
      mu::Array{Float64,2}
      sigma::Array{Float64,2}
      w::Float64
  end
  
  const musigma = Array{Mean_covar,1}() # const is global
  
  function mean_covar_init(kmix::Int64,dim_p::Int64,weight::Array{Float64,1},sigma::Array{Float64,3},mu::Array{Float64,2})
      @assert length(weight) == kmix
      @assert size(sigma) == (kmix,dim_p,dim_p)
      @assert size(mu) == (kmix,dim_p)
      resize!(musigma, kmix) # musigma = Vector{MeanCovar}(undef, kmix) 
      for k in 1:kmix
          musigma[k] = Mean_covar(zeros(dim_p,1),zeros(dim_p,dim_p),0.0)
          musigma[k].mu[1,1] = mu[k,1]
          musigma[k].mu[2,1] = mu[k,2]
          musigma[k].sigma[1,1] = sigma[k,1,1]
          musigma[k].sigma[2,2] = sigma[k,2,2]
          musigma[k].w = weight[k]
      end
      return nothing
  end
  
  const imode = Ref{Int64}() # const is global, make it a Ref to lock the type
  const itermax = Ref{Int64}()
  const msample = Ref{Int64}()
  const mgauss = Ref{Int64}()
  const mi = Ref{Int64}()
  const nsub = Ref{Int64}()
  const dim_p = Ref{Int64}()
  const kmix = Ref{Int64}()
  
  const sig = Ref{Float64}()
  const sig_inv = Ref{Float64}()
  const SigV = Ref{Float64}()
  const mu1 = Ref{Float64}()
  const sig1 = Ref{Float64}()
  const mu2 = Ref{Float64}()
  const sig2 = Ref{Float64}()
  const muV = Ref{Float64}()
  const w1 = Ref{Float64}()
  const w2 = Ref{Float64}()
  
  const mitot = Ref{Int64}()
  const mgauss_tot = Ref{Int64}()
  const mgauss_max = Ref{Int64}()
  const mgauss_min = Ref{Int64}()
  
  const iter = Ref{Int64}()
  
  const sigma_factor = Ref{Float64}()
  const D = Ref{Float64}()
  const normpdf_factor_pYq_i = Ref{Float64}()
  const log_normpdf_factor_pYq_i = Ref{Float64}()
  
  
  const h_diag = Ref(Array{Float64,3}(undef,0,0,0))
  const ymh = Ref(Array{Float64,3}(undef,0,0,0))
  const thetas = Ref(Array{Float64,3}(undef,0,0,0))
  
  
  const muk_iter = Ref(Array{Float64,2}(undef,0,0))
  const sigk_iter = Ref(Array{Float64,2}(undef,0,0))
  const muV_iter = Ref(Array{Float64,2}(undef,0,0))
  const sigV_iter = Ref(Array{Float64,2}(undef,0,0))
  const wk_iter = Ref(Array{Float64,2}(undef,0,0))
  const mukerror_iter = Ref(Array{Float64,2}(undef,0,0))
  const sigkerror_iter = Ref(Array{Float64,2}(undef,0,0))
  const ar_gik_k = Ref(Array{Float64,2}(undef,0,0))
  const muVerror_iter = Ref(Array{Float64,2}(undef,0,0))
  const sigVerror_iter = Ref(Array{Float64,2}(undef,0,0))
  const log_pYq_km = Ref(Array{Float64,2}(undef,0,0))
  const nik = Ref(Array{Float64,2}(undef,0,0))
  const nik_sig = Ref(Array{Float64,2}(undef,0,0))
  const wknik = Ref(Array{Float64,2}(undef,0,0))
  const wknik_sig = Ref(Array{Float64,2}(undef,0,0))
  const tauik = Ref(Array{Float64,2}(undef,0,0))
  const Yji = Ref(Array{Float64,2}(undef,0,0))
  
  const mgauss_ik = Ref(Array{Int64,2}(undef,0,0))
  const isub_Metroplis_gik_all = Ref(Array{Int64,2}(undef,0,0))
  
  const LL_iter = Ref(Array{Float64,1}(undef,0))
  const sigma_iter = Ref(Array{Float64,1}(undef,0))
  const ar_gik = Ref(Array{Float64,1}(undef,0))
  const sigmaerror_iter = Ref(Array{Float64,1}(undef,0))
  const norm = Ref(Array{Float64,1}(undef,0))
  const norm_sig = Ref(Array{Float64,1}(undef,0))
  const normerr = Ref(Array{Float64,1}(undef,0))
  const log_norm = Ref(Array{Float64,1}(undef,0))
  const wnow = Ref(Array{Float64,1}(undef,0))
  const log_wnow = Ref(Array{Float64,1}(undef,0))
  const t = Ref(Array{Float64,1}(undef,0))
  
  const fmt_k_iter = Ref{Any}()
  const fmt_float_iter = Ref{Any}()
  const fmt_float_err_iter = Ref{Any}()
  
  function step_init(imodein::Int64,itermaxin::Int64,min::Int64,mgaussin::Int64,miin::Int64,nsubin::Int64,pin::Int64,kmixin::Int64
                  ,mu1in::Float64,sig1in::Float64,mu2in::Float64,sig2in::Float64
                  ,muVin::Float64,sigVin::Float64 
                  ,w1in::Float64,w2in::Float64,sigin::Float64
                  ,Din::Float64,tin::Array{Float64,1})
      @assert length(tin) == miin
      imode[]=imodein
      itermax[]=itermaxin
      msample[]=min
      mgauss[]=mgaussin
      mi[]=miin
      nsub[]=nsubin
      dim_p[]=pin
      kmix[]=kmixin
      sig[]=sigin
      sig_inv[]=Constants.ONE/sig[]
      SigV[]=sigVin
      mu1[]=mu1in
      sig1[]=sig1in
      mu2[]=mu2in
      sig2[]=sig2in
      muV[]=muVin
      w1[]=w1in
      w2[]=w2in
      mitot[]=mi[]*nsub[]
      sigma_factor[]=nsub[]/mitot[]
      D[]=Din
      normpdf_factor_pYq_i[] = Constants.NORMALPDF_FACTOR^mi[]
      log_normpdf_factor_pYq_i[] = log(normpdf_factor_pYq_i[])
      mgauss_tot[] = mgauss[]*kmix[]
      mgauss_max[] = ceil(mgauss_tot[]*0.95)-1
  	mgauss_min[] = mgauss_tot[] - mgauss_max[]  
  
      #resize!(isub_Metroplis_gik_all,msample[]*kmix[])
      #reshape(isub_Metroplis_gik_all,(msample[],kmix[]))
  
      isub_Metroplis_gik_all[] = Array{Int64,2}(undef,msample[],kmix[])
      log_pYq_km[]=Array{Float64,2}(undef,msample[],kmix[])
  
      LL_iter[]=Array{Float64,1}(undef,itermax[])
      sigma_iter[]=Array{Float64,1}(undef,itermax[])
      ar_gik[]=Array{Float64,1}(undef,itermax[])
      sigmaerror_iter[]=Array{Float64,1}(undef,itermax[])
      norm[]=Array{Float64,1}(undef,nsub[])
      norm_sig[]=Array{Float64,1}(undef,nsub[])
      normerr[]=Array{Float64,1}(undef,nsub[])
      log_norm[]=Array{Float64,1}(undef,nsub[])
      wnow[]=Array{Float64,1}(undef,kmix[])
      log_wnow[]=Array{Float64,1}(undef,kmix[])
      t[]=Array{Float64,1}(undef,mi[])
      
      muk_iter[]=Array{Float64,2}(undef,kmix[],itermax[])
      sigk_iter[]=Array{Float64,2}(undef,kmix[],itermax[])
      muV_iter[]=Array{Float64,2}(undef,kmix[],itermax[])
      sigV_iter[]=Array{Float64,2}(undef,kmix[],itermax[])
      wk_iter[]=Array{Float64,2}(undef,kmix[],itermax[])
      mukerror_iter[]=Array{Float64,2}(undef,kmix[],itermax[])
      sigkerror_iter[]=Array{Float64,2}(undef,kmix[],itermax[])
      ar_gik_k[]=Array{Float64,2}(undef,kmix[],itermax[])
      muVerror_iter[]=Array{Float64,2}(undef,kmix[],itermax[])
      sigVerror_iter[]=Array{Float64,2}(undef,kmix[],itermax[])
  
      nik[]=Array{Float64,2}(undef,nsub[],kmix[])
      nik_sig[]=Array{Float64,2}(undef,nsub[],kmix[])
      wknik[]=Array{Float64,2}(undef,nsub[],kmix[])
      wknik_sig[]=Array{Float64,2}(undef,nsub[],kmix[])
      tauik[]=Array{Float64,2}(undef,nsub[],kmix[])
      Yji[]=Array{Float64,2}(undef,mi[],nsub[])
  
      mgauss_ik[]=Array{Int64,2}(undef,nsub[],kmix[])
  
      h_diag[]=Array{Float64,3}(undef,mi[],kmix[],msample[])
      ymh[]=Array{Float64,3}(undef,mi[],kmix[],msample[])
      thetas[]=Array{Float64,3}(undef,dim_p[],msample[],kmix[])
      
      t[]=tin
      for i in 1:kmix[]
          wnow[][i] = musigma[i].w
      end
      mgauss_ik[] .= mgauss[]
  
  
      fmt_k_iter[] = Printf.Format("%20s"*"%15i               |"^kmix[])
      fmt_float_iter[] = Printf.Format("%20s"*" %12.5g                 |"^kmix[])
      fmt_float_err_iter[] = Printf.Format(" %12.5g +- %-12.5g |")
  
  
      return nothing
  end
  ....

I defined many variables as const with Ref, in order to lock their type and ensure type stability, as many guys suggested if I have to write my Julia code in a Fortran way now.
I treat these const Ref variables as ‘global’ variable which in principle may not decrease performance by too much. The other advantage of using Ref is that then everywhere if I need to use these ‘global’ variable, say use sig, I need to write sig[ ], the [ ] helps me to recognize that it is a ‘global’ variable.
The function step_init is simply to initialize the ‘global’ variables in the module.

May I ask, in your Trixi,jl (which is a very nice and advanced Julia code), do you also have some place your might need to define some constants or parameters that you need to use in your whole code? Do you have a type or struct that serve as parameters etc, so that you do not have to pass a lot of arguments in your functions everytime?

I currently just run in it serial, I did not add MPI in Julia. I just first benchmark the single core performance in Julia and Fortran. Once I could get Julia’s performance close to Fortran, implement MPI should be trivial. Due to the nature of Monte Carlo, I use MPI. I do not use things like openMP which is shared memory stuff and may have thread safe issues.

By the way, I saw in your Trixi.jl code, you used many fancy stuff begin with @, such as @ inline, @ inbound, @ unpack, etc. I wonder, like, if you do not add those commands, will the Julia code slow down by a lot?

Thank you very much!
I will profile a little bit about these stuff you mentioned. Besides RNG, const Ref could be the place that slow down may occur.

In fact, in Intel Fortran, I always set heap-array as 0, which means heap all arrays on heap. It seems that does not influence performance. If I do not heap arrays, my Fortran code will not work properly because the arrays are easily overflow the stack size.
I use array, big arrays a lot. Because in my Monte Carlo code, whenever possible, I try my best not to calculate the same thing twice, or I try to prevent some overhead. So I save many values. But of course, I test and make sure that saving these values and accessing them is faster than calculating them again and again in the code.
For one example, If I know I need 10000 random numbers (like I need to propose 10000 Metropolis move) in a subroutine, I generate them first, save them sequentially in an array. Then in the subroutine, I just read these random numbers from array sequentially.

But I think, if like you said, by doing const Ref trick, all these ‘global’ variables, especially even all the numbers (not arrays) are in the heap and not on the cache, then the code can slow down by a lot I guess. In Fortran, these ‘global’ number I believe, are stored in the cache.

You should really pass in an explicit RNG object to all places where you generate random numbers. It’s better for both performance and reproducibility, the reason being always the same: globals are bad for performance, accessing them takes time.

For example, instead of rand() you should use rand(rng), where rng has type AbstractRNG. This helps also reproducibility because the default RNG in Julia can (and does!) generate different streams of numbers in different versions of Julia. If you want cross-version stability (for example for some tests), using an explicit RNG object lets you easily swap the default RNG with something else, e.g. GitHub - JuliaRandom/StableRNGs.jl: A Julia RNG with stable streams

4 Likes

Thank you very much!

In my code I simply do things like

    rn = rand(1)[1]
    isubn = Int64(ceil(rand(1)[1]*nsub[]))

I just use rand(n) where n is the number of the elements of the rand array.
Are you saying things like

 rn = rand(1)[1] 

is slower than rand(rng), could you give a tiny example of how perhaps would you use the rand function? Thanks a lot! I will read the stableRNG you mentioned, that is great stuff!