Achieving C++ Speeds for a CFD Algorithm: Julia spends too much time on pre-allocation?

I have a CFD Algorithm that I implemented first in MATLAB (slow), then in C++ (blazing fast), and now I am trying to recreate it in Julia. I have already achieved much faster performance than MATLAB but there seem to be some final hurdles before I can get to ~C++ speeds in Julia.

Currently, my code works like this. I have a script which pre-allocates empty arrays of size eg 33x33, and then runs a function called mainLoop() which does all the calculations. I currently test my code by using the @time macro during the call to the mainLoop function.

Scenario 1:

# include some header files which are later used by mainLoop etc.
# other code, not relevant here

# pre-allocate about 10-15 arrays. not all are shown here.
u = zeros(nc);		v = zeros(nc);
oldu = zeros(nc); 	oldv = zeros(nc);
deltau = zeros(nc); 	deltav = zeros(nc);

# call the main function which does the calculations.
@time mainLoop(deltaVL2Norm);

Doing the above returns a time for the mainLoop function of about 4.5 seconds.

If instead, I notice that once I have run this script once, my 10-15 arrays such as u,v etc. have all already been defined (initial guesses don’t really matter much), I can just erase the part of my script which does the pre-allocation and simply run mainLoop() in the script.

Scenario 2:

# include some header files which are later used by mainLoop etc.
# other code, not relevant here

# don't do any pre-allocation since this script was already called once, so 
# the arrays already exist.
# u = zeros(nc);		v = zeros(nc);
# oldu = zeros(nc); 	oldv = zeros(nc);
# deltau = zeros(nc); 	deltav = zeros(nc);

# call the main function which does the calculations.
@time mainLoop(deltaVL2Norm);

When I do this, the @time macro gives me 0.5 seconds for mainLoop!

I realize I may be making some basic mistakes in how I think about functions, scripts, and so on but I still think there should be a way for me to get this ~0.5 second performance without the antics I have described here.

The behavior is identical if I put all my array declarations into a function and call that function in this script.

What’s particularly puzzling is that I am not timing the entire script - only the call to mainLoop, which I didn’t even change at all!

1 Like

I think you’re timing the compilation of mainLoop, not the allocation of the arrays.
It compiles the first time you run it, so you’re measuring compile time + run time. That’d be like $ time g++ ... && ./a.out
After the first run, the function is compiled, so all future calls / timings will only measure the run time of the function.

1 Like

Try the BenchmarkTools package to get more accurate timing.

using BenchmarkTools
@btime mainLoop($deltaVL2Norm) # or @benchmark for more detailed statistics

In particular, this avoids measuring compilation time. That $ tells @btime to “import” deltaVL2Norm into a local scope before timing, so as not to include the time for looking up the name in the global namespace and dispatching based on its type – this is usually a better estimate of the performance of the call when it’s inside another function.

Also, if mainLoop uses those arrays, you should either pass them as arguments or declare them as const to get good performance (see the performance tips in the manual).

1 Like

Using @btime gives me ~300 microseconds but @time gives 0.5 seconds. So this means that most of the time spent in calling my mainLoop seems to be in looking up the names of variables, etc?

That would make sense since, as you pointed out, I am using a bunch of arrays that I’m not passing as arguments.

Does this mean that if I pass in everything as arguments, I should be able to get run time of mainLoop down from 0.5 to 0.3 seconds?

No. If your function uses global variables, the @btime measurement includes their lookup (and more importantly, the performance cost of the compiler not knowing their type in advance). It just avoids measuring the lookup of deltaVL2Norm which you pass in, provided you prefix it with $.
But passing the variables to the function or declaring them const should speed up the function anyway.

Okay, l see what you’re saying. I’ll pass in everything as arguments and declare types and that should make things better.

Getting back to my question of how to get the ~0.5 second performance instead of the ~5 seconds taken by mainLoop - how should I set things up so that, at least when I run my script a second time, it takes 0.5 seconds and not 5 seconds?

I cannot do away with the pre-allocation, so I made a function initializeArrays(delx,dely) (declared in a header file where the rest of my functions are declared, andinclude -d at the top of this script) which returns all the arrays I need. Now, my code looks like this.

#-- Pre-declaration--#
u,v,oldu,oldv,deltau,deltav, = initializeArrays(delx,dely);

@btime mainLoop(deltaVL2Norm);

However, this didn’t help.
The above configuration gives ~350 microseconds as the result of @btime, and using@time (or, just waiting for the program to run and measuring how long it takes in the real world, since mainLoop is the only time-consuming part of my code), the time taken in reality is closer to ~5 seconds. Running this script a second time takes the same amount of time.

Do you really need it to run fast as a script, as opposed to a Julia function? If you call your mainLoop function repeatedly from another Julia function/script, you will only wait for compilation on the first call (assuming you always call with the same argument types).

Declaring types in the function signature should not affect performance. A version of the function is compiled for each combination of argument types it is invoked with anyway.

I don’t need it to run fast as a script. What I am currently running as a script can easily be placed inside a function. However, at the moment, if I crudely wrap everything in my script into a function (and keep the @time macro next to the call to mainLoop ), it increases the run time by 3x ! I realize that eventually I should wrap everything in functions so that nothing is ever in global scope, but I will do that later once I figure out how to get mainLoop to run in ~ 0.5 seconds instead of ~ 5 seconds.

I guess my issue is more an algorithmic one than a ‘Julia’ one. Essentially, I want to know how I can structure my code so that every time I run it, mainLoop doesn’t have to be compiled. All of the recursive and time-consuming parts of my code are inside mainLoop, and I understand that compiling it takes time. However, I think I’m missing some very simple way to write a program which does not have to compile mainLoop every time it is run.

And, thanks for the note about type declaration in functions. I guess I misunderstood what I’ve read about the need for type-stable code in Julia. If my variables do not change type, ever, then I guess I should be good on the type-stability front without having to declare types in my function declarations.

Very cool. What are the flow conditions and the algorithm? Is it a two-dimensional flow?

That’s why I asked if you need it fast as a script. If you do

mainLoop(some_arguments...)
mainLoop(different_arguments...)

and different_arguments are of the same types as some_arguments, the second call will not compile mainLoop (and the same applies if instead of mainLoop you have a function that includes the initialization). But if you rerun Julia, compilation will happen again on the first call.
If you really need to run Julia multiple times, and have your function compiled only on the first run, you probably need to make a precompiled module, which is a bit more work. There’s a section in the manual about how to do that.

It’s a two-dimensional incompressible flow, and I’m using a transient solver which marches forward in time with a simple Euler time integration for the convection and diffusion terms, and then solves an iterative equation for pressure (Poisson equation) so as to satisfy continuity. Right now I am simulating a lid-driven cavity flow because I have benchmark results to compare with.

Yes, I understand that I can simply call mainLoop twice and the second time it is called, it will not take any time to compile and so it should be much faster on the second call.

However, with an algorithm that needs to scale drastically, I hope you can see where the problem now lies. I have so far been testing on a 32x32 grid size, for which I don’t mind just calling mainLoop twice. But if I’m using a significantly larger grid, it will be far too difficult to perform any benchmarking because the first of those two runs will take hours, at which point it doesn’t really help me that the second one was much faster.

When I did this on C++, I simply compiled into an executable (using make, etc.) so that I could easily separate out compilation from running.

I realize that Julia is a Just-in-Time compiled language which is (maybe?) the reason why I am in this situation…

One intermediary solution I can think of, is to call mainLoop on an extremely small domain (eg 4x4) the first time, and then the second time call it with the grid size I actually want to simulate. However, the flow physics may not be well-represented by such a small domain, and then it won’t converge… Perhaps there is a way for me to ‘dummy-call’ mainLoop such that it gets compiled but does not perform all the thousands of iterations it’s supposed to, in that dummy call. I just don’t think that’s very good coding practice though…

I don’t understand why you expect a big difference in runtime for large models. Presumably, the types in your function are not affected by grid size, and so neither is the compilation time. So compilation time would still be about 5 seconds for your large model, and would be negligible relative to the running time.

Yes, that is the behavior I should be able to get. I think my code is very half-baked at this point and I need to put things in functions, stop using the global scope, and so on. I’ll make another topic for further help!

Thank you :slight_smile:

For your kind of problem, I really recommend using a jupyter notebook during dev (just use revise and include your source files). That way, you can evaluate a cell simply twice to get timings right, and play around until you get a feeling for what triggers how expensive compilation. Afterwards, it is time to decide how to package stuff in functions with nice API. For smaller development, the REPL is also very nice.

Warm up is only relevant for accurate benchmarking, you don’t have to warm up in “real life”, only if you want to accurately measure execution time. For large problems, the compilation time is a negligible portion of the entire time. It’s the same as timing the compile time of C++ code: it will give you completely useless timing results for small problems that take less time than the compilation does; for large, realistic problems, however, the compilation time is negligible.

2 Likes

Hmm. That’s a good idea. I hadn’t realized that I could just include header files in a Jupyter notebook and then proceed to do this sort of quick trial and error in the notebook interface. Are there any performance losses due to using an interface other than the Julia command line? I have used the Juno IDE for some quick coding and found it a bit slow, significantly more than C++, so I sort of gave up on anything other than text files and the “include” command typed at the REPL.

What kind of spatial discretization?

Finite Volume

No, your Julia code should run at the same speed whether it’s executed at the command line, in Juno, in a Jupyter notebook, in VSCode, or in any other environment. I believe there were some interesting cases where code run in Juno was slower than it should have been, but I think those were resolved (hopefully others can fill in the details).

However, there is a caveat to the above statement, which is that there are a few command-line flags which can affect Julia’s performance. For example, running Julia with julia -O3 can enable additional SIMD optimizations (source), and running julia --check-bounds=no will disable array bounds checking globally (which can be completely unsafe but can also improve performance). All of those options can also be used by Julia when run through Juno, Jupyter, VSCode, or the command line; it’s just a matter of finding the right setting.

But I wouldn’t worry about modifying the command-line flags of Julia until you’ve already explored all of the other general performance tips, and you can do that no matter how you’re running Julia.

1 Like