My Julia workflow

Hi guys!

I have just published a post in my website describing my Julia workflow. This was necessary because I have mainly two kind of students:

  1. Ones that code everything without worrying performance tips at all, get a very slow code (but very “readable”), and complain that MATLAB is faster.
  2. Ones that code from ground-up trying to optimize every bit, leading to a huge time to obtain the results. In this case, usually the code is not much friendly to be debugged. The problem here is if the code is supposed to be run once or twice…

Here is the text, hope it can help someone :slight_smile: I will appreciate any comments to improve my workflow!

https://www.ronanarraes.com/2019/04/my-julia-workflow-readability-vs-performance/

27 Likes

I am a staunch advocate of splitting code into lots of separate functions, perhaps to a fault. One of the reasons for that is to not repeat yourself, particularly in the same package. If you decide to change something, you don’t want to have to change it a dozen times in a dozen different variations. I personally tend to find code split into lots of functions much easier to follow, though opinions on this vary. I like to say that no function should be longer than a couple of dozen lines, though admittedly this is overzealous in some applications. Fortunately the Julia compiler is current very good at inlining, so usually I find you can split functions with reckless abandon without even worrying about the @inline macro.

I haven’t analyzed your code thoroughly enough to have good suggestions as to where I would split it up, but rest assured, if I were tasked with re-writing it I’d probably give you several hundred separate functions :laughing:

(At the very least I’d have lots of separate stuff for initialization.)

5 Likes

Some may argue that it would be faster if I used -march=native. Indeed, it is! However, for a fair comparison, I would have also to recompile Julia system image with the same flag, which, unfortunately, I do not have time now. This is an easy procedure if you use the package PackageCompiler.jl.

I see here some problem. Adding -march=native is really(!) easy and doesn’t require much time to spend. On the other side so called(!) easy procedure with PackageCompiler require so much time that you was not able to find. Right?

1 Like

Actually, it is quite easy. I just really did not have the time to test :slight_smile:

However, now I am thinking about something. If Eigen does not use BLAS to perform the matrix operations (those are somewhat small matrices), then -march=native will be faster in C++ and slower in Julia even with a native system image. In this case, I would have to compile the entire openBLAS and Julia dependencies using native.

Very cool, thank you for sharing this. Completely irrelevant to the main subject of this post, it looks like you know your Kalman filtering… Ever considered to join forces with @mschauer on his Kalman.jl?

As a biologist that needs to track various animals with a Kalman filter it would be great if that package saw even more love :slight_smile:

3 Likes

Oh nice! Actually I had a project since 0.4 era to create a package to Julia with many filters, like Kalman filter, Extended Kalman Filter, Unscented Kalman Filter, Particle Filter, etc., using a user-friendly API. I already have the code of some filters, but I was never able to think about a good API.

I do not have time right now, but I will eventually contact @mschauer to join efforts :slight_smile:

2 Likes

perfect.

Also irrelevant to the OP, but I’ll also mention StateSpace.jl, which I did a bunch of work on a few years ago but never got around to fully testing/documenting/releasing. It includes basic, extended, unscented, and ensemble Kalman filters, plus the beginnings of a particle filter. I don’t really have time to work on it at this point, but on the off-chance this thread revives interest in a unified Kalman filtering package, that code is available to be revived/updated/stripped for parts.

5 Likes

I really like this discussion thread as I find myself facing similar issues. Writing high performance code takes time and energy when a lot of research only requires simple codes and proofs of concept. So just because one can write highly modular, fast, fully inplace and SIMD-able code doesn’t mean that one should. I think this is a mindset adjustment thing. Since many of us were drawn to Julia for its incredible speed and it being open source compared to Python or Matlab for example, some may tend to focus on these 2 features a bit too much, when in fact neither speed nor opening the source is a critical goal at the moment. At the end of the day, I think it is a tradeoff with risks on both side. If you just get the job done using simple non-modular non-extensible code, you are less likely to get collaborators/users of your code in the future and it becomes harder to extend your research code to explore additional directions. On the other hand, if you focus too much energy on the software development side rather than the research side, the program may be amazingly fast and user-friendly but research results will be delayed. Finding a happy medium is not the easiest task.

4 Likes

I disagree with this. Always go for basic high performance code - its just too easy to do, and it will become habitual over time. It lets us ask much better questions and have faster feedback. Just don’t worry about the last 30% difference, that’s the mistake newbs need to be warned against.

I just don’t see the time and energy you are talking about in getting anything but the last dregs of performance. For example: once you understand array prealocation, you just define the arrays you need first, instead of in the loop. It’s no harder. This even relates to what we would previously think of as dark arts of performance. Once you get the limitations of functions that will run with GPUarrays, you can just write code for that and it just works, with very little effort.

Similarly always write relatively modular code (except for pure glue scripts). You will also learn to do it habitually and everything you write will be more useful to yourself and other people.

3 Likes

Unlike python, Julia is very suitable for this because of the multiple dispatch, which solves many function naming headaches.

4 Likes

@Ronis_BR as a side note, you don’t need to type ; at the end of matrix rows :slight_smile:

A = [1 2 3
     4 5 6
     7 8 9]

produces a 3x3 matrix.

2 Likes

Some may argue that it would be faster if I used -march=native. Indeed, it is! However, for a fair comparison, I would have also to recompile Julia system image with the same flag, which, unfortunately, I do not have time now. This is an easy procedure if you use the package PackageCompiler.jl.

I think for a fair comparison --march=native is necessary. Julia already marches native on your computer for your functions, by default. I believe Julia system image functions utilize some form of function multi-versioning for library functions (libraries used by Julia also do that) and I would expect compiling Julia system image with the same flag does not make a significant difference. The same might go for C++ version as Eigen library may have the same feature. I may be wrong, though.

1 Like

I am really not sure here. Julia uses openBLAS (in my system) to multiply the libraries and it was not build with --march=native. I think you are right when you say that building the system image with this option will have little effect. The true test would be to build Julia entirely (including deps.) with --march=native. Maybe someone using Gentoo can help us :slight_smile: It should be quite easy since they have a Julia ebuild.

Thank you for the ping, @yakir12. @Ronis_BR are you on slack? Don’t hesitate to ping me.

2 Likes

You may get a slight increase in performance (both in time and memory) by replacing line 77 in the third version

mul!(K, aux3, pinv(R + aux4))

with

mul!(K, aux3, pinv(aux4 .+= R))

Actually, you implement third approach which i like very much- at the first time you made a “quick to make” yet “readable” solution, then you perform an optimization upon it. I believe it is the best approach for newcomers espetially for thouse who is not firm in both with programming (lang) and the problem that is working about. yet it is well known that is very difficult to change heavy optimized code. So i believe that the second category of your students spares their time pretty ineffectively.

It is very interesting - ypu used basic vectorization (in the first, unoptimized solution) why does it perform slower than its equivalent written in MatLab(Octave) ?

Nice catch!

<Before modification>

julia> @btime kalman_filter();
  874.418 ms (1680288 allocations: 901.44 MiB)

<After modification>

julia> @btime kalman_filter();
  864.590 ms (1620289 allocations: 880.38 MiB)

Indeed, you are right. When you are a newcomer to Julia, it is not that easy to optimize your code. In this case, sometimes it is better to have something with no optimizations at all. For example, a code that will do an analysis that is supposed to be run only a couple of times.

Well, Julia is very different from MATLAB. I think that vectorized code in MATLAB is fast because it can translate it to machine code (compile) before execution. However, I am no expert, can anyone help here?

In Julia, all your code is compiled. You just need to “help” the compiler a little so that it does the right thing. In the case of the example, I avoid a lot of allocations, which are slow operations, leading to a huge improvement in execution time.

Are you actually interested in such small dimensional dynamical models or did you choose this only as an example? In the former case, I think nothing will be faster than using plain update formulas

Pp = Fk_1*Pu*Fk_1' + Q
        K  = Pp*Hk'*pinv(R + Hk*Pp*Hk')
        Pu = (I - K*Hk)*Pp*(I - K*Hk)' + K*R*K'

combined with static arrays (that is what I am doing often)

2 Likes

In fact this was chosen only for this example. But thanks for the tip!

IIRC, I tried to do a version with StaticArrays, but it gave me problems due to the 18x18 matrix. Julia crashed because of a memory problem that I cannot remember. It is a good thing to do now to compare the speed with C++.