To factor or not to factor? Functions with small differences within loops

I have a number of functions that perform updates on a number of large arrays in the same loop. Previously, I learned that this is often a good idea for performance. Now, I am facing another issue: I need a few variants of these functions that do the updates only slightly differently. Here is a generic example:

module SomeModule

# Some declarations and preallocations of variable arrays, 
# let's say the following arrays came out of this:
#   A, B, C, D, and E. 
# All have the same size.

function modify_arrays1!(A::T, B::T, C::S, D::T, E::T) where {
    T <: Array{Float64,2},
    S <: BitArray{2}
    }
    @inbounds @simd for i in eachindex(A)
        # Update A[i], B[i], C[i], D[i], and E[i] with some long (but
        #   simple) arithmetic and boolean checks between them.
    end
end

function modify_arrays2!(A::T, B::T, C::S, D::T, E::T) where {
    T <: Array{Float64,2},
    S <: BitArray{2}
    }
    @inbounds @simd for i in eachindex(A)
        # Same updates as in `modify_arrays1`, 
        #   but with a two-line difference, 
        #   out of, say, around 10 or 20.
    end
end

function modify_arrays3!(A::T, B::T, C::S, D::T) where {
    T <: Array{Float64,2},
    S <: BitArray{2}
    }
    @inbounds @simd for i in eachindex(A)
        # Still the same updates as in `modify_arrays1`, 
        #   but now E is absent from the process, leading
        #   to another 1 or 2 lines of difference in code, 
        #   out of, again, around 10 or 20.
    end
end

These functions have to be run many thousands of times as part of a bigger iteration process. Is it better to factor out the few different lines (which, by the way, still involves 2 or 3 of the input arrays) into their own functions (e.g. function few_lines(A, B, D) ... end), and call them from a unified modify_arrays! function (perhaps with E becoming an optional argument); OR to leave the functions as is with lots of duplicate lines between them?

Or, is the answer: “It depends”?

I have also tried declaring the different lines as expressions (Expr) and passing them as arguments to a unified modify_arrays! function, but it occasionally (after a few hundred successful iterations) give me an UndefVarError on some variable name I used in the expression, and I couldn’t figure out why. I was not yet able to find out at that point if it gave any performance (dis-)advantage.

(Side question: Is it better or worse, in terms of performance, to “combine” modify_arrays2! and modify_arrays3! to make use of multiple dispatch instead?)

Please kindly advise. Any additional side comments about other aspects of the generic example above will also be very welcomed! Thank you!

I would go with “It depends?” more specifically “Did you profile it?”.

I think that you should only decide for an option that brings code repetition if the performance is justifiable. Alternatively, did you try to create a function that has the lines common to all loops and just call it in each distinct loop (being followed by the lines that are different in each loop). There is a good chance of such function being inlined, so the common code is not repeated (you just need to change one place, less bug-prone) and there is no overhead too.

I think Expr are really not the tool you want to use here.

1 Like

Higher-order functions (HOF) should help you abstract away the loop (implemented by the higher order function itself), and separate it from the body (implemented in the function passed as argument to the HOF). Combining this with some slurping & splatting, you can even write an HOF that is independent from the arity of its argument.

A minimal example could be along the lines of:

function my_foreach(f, args...)
    for i in eachindex(args...)
        f(i, args...)
    end
end

This is very versatile, and can be used in a wide variety of situations. If you want to double an array element-wise, just write the corresponding loop body. do blocks are especially useful in such occasions:

# a .*= 2
my_double!(a) = my_foreach(a) do i, a
    a[i] *= 2
end
julia> a = collect(1:10);
julia> my_double!(a); a
10-element Array{Int64,1}:
  2
  4
  6
  8
 10
 12
 14
 16
 18
 20

It works the same way for as many arguments as you want:

# res .= a .* b
my_prod!(res, a, b) = my_foreach(res, a, b) do i, res, a, b
    res[i] = a[i] * b[i]
end
julia> b = collect(101:110);
julia> c = similar(a);
julia> my_prod!(c, a, b); c
10-element Array{Int64,1}:
  202
  408
  618
  832
 1050
 1272
 1498
 1728
 1962
 2200

You can also define regular functions for your loop bodies; this helps making use of a particular loop body as the basis for another one:

# Reuseable helper function
function my_sum_(i, res, a, b)
    res[i] = a[i] + b[i]
end

# res .= a .+ b
my_sum!(res, a, b) = my_foreach(my_sum_, res, a, b)

# res .= 2*a .+ b
my_2apb!(res, a, b) = my_foreach(res, a, b) do i, res, a, b
    a[i] *= 2
    my_sum_(i, res, a, b)
end
julia> my_2apb!(c, a, b); c
10-element Array{Int64,1}:
 105
 110
 115
 120
 125
 130
 135
 140
 145
 150

You should be able to verify fairly easily that the performance obtained this way is essentially the same as if you had inlined the loop bodies inside the loop in the higher order function (because that’s what the compiler should actually be able to do).

6 Likes

I assume you have done some benchmarking with BenchmarkTools. Maybe you are looking for more insight. Sometimes having less code can be faster because it fits in a cache. Depends on how big your caches are. It may also depend on whether the use pattern switches from one to another frequently. You can usually find this from benchmarking. This is standard software engineering advice.
Using higher order functions, as mentioned in this thread, is a good choice. But, (semantically at least) this creates two methods, one for each function passed. You may end up swapping rapidly between these functions.
Finally, internal changes to Julia (same applies to some other languages), can sometimes change the relative performance of different strategies.

2 Likes

Thank you for the swift and insightful advices!

Thank you also for the examples illustrating the HOF!

I have done some testing, but due to the (somewhat) randomised nature of some of the variables being fed in, it is not as straightforward as reading off from @btime; I will find some more ways to make things more “test-able”. Whatever test I was able to do so far had been inconclusive, which might indicate that the performance gain is not going to be very large, if any. It would still be good to be able to reduce the incidence of copy-and-pasting code though, so I will experiment with the HOF approach and carefully (and perhaps with some creativity) benchmark.

I learned a lot! Thank you! :blush: