Compiler optimizations with broadcast or map

I come to Julia from R and have never thought much about compilation or compiler optimizations. As far as I am aware, a simple optimization that a compiler can implement is to take an operation outside of a loop, e.g.

data = [1, 2, 3]
x = 1:100
for i in 1:100
    N = length(data)
    sum(data - x[i]) / N
end

becomes

data = [1, 2, 3]
x = 1:100
N = length(data)
for i in 1:100
    sum(data - x[i]) / N
end

Now suppose we define a function to do the same task:

foo(data, x) = sum(data - x) / length(data)
for i in 1:100
    foo(data, x[i])
end

or we broadcast foo() over x, instead:

foo.((data,), x)

I think the following questions are all related.

  • Would the compiler compute length(data) outside foo() when called within a for-loop? Outside of the call to broadcast?
  • Does the answer depend on whether foo is inlined?
  • Does the compiler always respect the boundaries of non-inlined functions? I.e. Would the compiler ever inline only part of the body of a function.
1 Like

Generally, how would the compiler know that length is constant for data? I could define

struct SillyData end

Base.length(::SillyData) = rand()

Of course this would violate a lot of interface conventions, but it is not the job of the compiler to be aware of those.

1 Like

Thanks, Tamas.

Would changing the signature of foo to

foo(data::Vector{Float64}, x::Float64)

do the trick?

Let’s suppose there are no type stability issues in my code. Imagine each line can be translated into optimized machine instructions. Now I am wondering how far the compiler will go to avoid unnecessary repetition of computations and allocations, assuming we have told the compiler everything it needs to know that length(data) is the same each time it is allocated in these blocks.

I am not sure how you plan to do that. Also, I don’ t understand why you think that length(data) is “allocated”, there shouldn’t be any allocation.

Generally, the compiler will elide repeated evaluations of functions it considers “pure” (see eg this discussion). You can provide hint for this using Base.@pure, but that is not recommended unless you really know what you are doing. Given that your code has mistakes that would prevent it from working correctly as is (eg data - x instead of data .- x), I am not sure I would suggest pursuing that.

In your example, I don’t think that that repeated calls to length are elided. But length of a vector is a very, very cheap operation, I would not worry about this.

As far as I know, the answers are:

  • Maybe
  • Probably
  • Probably not

respectively. I realize that’s pretty unhelpful, and perhaps an LLVM expert can weigh in with more precision. The language itself doesn’t guarantee that those optimizations will or will not happen, so you are at the mercy of the optimization choices and heuristics both in Julia’s compiler and in LLVM itself.

You can, however, look at the @code_lowered and even the @code_llvm or @code_native if you want to dig really deeply into the low-level details of how a particular function has been compiled.

3 Likes

Thanks for your response and suggestion. Using @code_llvm, I think I confirmed part of what you said: that the answer depends on whether foo is inlined.

I defined

function foo(data::Vector{Float64}, x::Float64)
    N = length(data)
    return (data[1] - x) / N
end

function optforloop(data::Vector{Float64}, x::Float64)
    N = length(data)
    for y in x
       (data[1] - y) / N
    end
end

function forloopfoo(data::Vector{Float64},x::Float64}
    for y in x
       foo(data, y)
    end
end

Both loops result in identical output from code_llvm(). On the other hand, if I use @noinline to define foo, it appears that there are multiple calls to length(). The second appearance of length() is accompanied by a shorter list of instructions, however. Maybe LLVM is caching the result of length(data)?

In any case, the general lesson I have drawn from this exercise is that a function that can be written in the form

function foo(data::Vector{T}, x::T)
    y = bar(data)
    ret = fun(data, y, x)
    return ret
end

should probably be split into the constituent functions bar(data) and fun(data,x,y) so that a vectorized method for foo can be computed efficiently:

function foo(data::Vector{T}, x::Vector{T})
    y = bar(data)
    return fun.((data,), (y,), x)
end

Perhaps this was obvious, but I was wondering how well the compiler could make up for my laziness. Or maybe there are better ways.