I am working on a physical-process simulation, set on a large spatial grid. Each cell of the grid contains a set variables representing the status of the model. Let’s say the variables are a, b, c, d, and e, stored as arrays A, B, C, D, and E (all with the same size). Now, I need to perform a bunch of element-by-element arithmetic with these variables. For example, a = b + d; c = a * d; e = 2 * b + 20 * e / a; etc.
In terms of performance, is it better to:
(i) use a single loop to perform all the arithmetic (involving all the different arrays) at each array index; OR
(ii) use multiple loops to run through the array index, with each loop containing one (or at least a subset) of the arithmetic operations; OR
(iii) use broadcasting (e.g. the dot syntax) all the way; OR
(iv) some other way(s) that I have not considered here?
Let’s assume that all loops are sequenced in the memory access-friendly column-major order. (Actually, this is the essence of my questions: Does going through in column-major order help in any way when the loop needs to access multiple arrays?)
I am new to Julia and not a computer scientist, and have yet to know/understand the details of how the compiler goes about things. Any ideas and suggestions would be very welcome. Thank you!
In general, writing a single loop will be the fastest and most flexible way to do things, and is probably the easiest to code as well if you are doing lots of operations like this. (In a function — don’t write nontrivial code in global scope!)
(The problem with trying to do array operations like A .= B .+ D followed by C .= A .* D is that each array operation will be a separate pass over the arrays, even though Julia fuses the loops within a single operation like A .= B .+ D. Doing separate loops, either manually or via broadcasting, is suboptimal both because of poor cache utilization and because you repeat the loop-index computations multiple times.)
Even for multiple arrays, you should try to loop through data sequentially in memory (i.e. column-major for Julia), because sequential memory access is better for your computer’s cache. (This has nothing to do with compilers.) For the same reason may get even better performance if you use an array of data structures that contain (a,b,c,d,e) rather than 5 separate arrays, since in general if you compute on (a,b,c,d,e) together you want to store them nearby in memory.
Although five separate arrays (and similarly, Struct of Arrays layouts) are likely to be more SIMD-friendly. If the five separate arrays version can vectorize, it will probably be faster.
I wouldn’t worry about that yet if you’re just starting to learn about optimization.