AoT / JIT and runtime constant folding

I have no understanding at all of Julia’s internals, and I was wondering how much constant folding the current compiler is able to do at runtime?

Basically, I am facing a case which I believe is pretty common and I was trying to figure out the most elegant way to optimize it. In short, I am optimizing a non-linear function with JuMP:

@NLobjective(mle, Max, sum(log_dis_pdf(...) for z in x))

I have an enormous dataset which I have discretized into a few dozen values. Hence, that big sum is actually composed by lots of calls to log_dis_pdf that are identical.

Obviously, I can collapse my dataset into a frequency table and turn the above expression into a weighted sum, where weights are frequencies. But I was wondering if there’s a more Julian way to do this, perhaps by unrolling the loop at runtime and then folding all constant calls.

Julia’s optimizer works more like an AoT compiler in this situation — if you have a large vector x it can use the type of x (for example Vector{Float64}?) but not the values within x for generating code.

If you encode the knowledge of your discretization into Julia’s type system so that eltype(x) represents a small number of values, it would be technically possible for the compiler to emit more efficient code for log_dis_pdf(x[i]) than recalculating every time. However convincing the compiler to do this is likely difficult (possibly unless x[i] has a very small number of possible values? But even then I’m not sure.)

Personally I think this is an elegant solution and I’d take this approach. It’s also possible that it exposes more optimization opportunities and leads to a much faster algorithm which the compiler would have no hope of discovering itself. For example, could you summarize your dataset outside the optimization loop and only use the summary while computing the objective function? This would yield enormous speedups.

5 Likes

This is a case where converting the vector to PooledVector and doing map(log_dis_pdf, x, pure=true) should do exactly what you request: call the function only once for each unique value.

4 Likes

Nice! This is a cool feature of PooledVector.

Related to the original question about compiler optimizations — the reason this works is not because of the compiler, but because PooledArrays has a special efficient implementation of map() for PooledVector.

I guess PooledArrays would also need an implementation of mapreduce or reduce or sum() (or some combination involving Generator?) for the sum of the generator in the original question to be as fast as possible (not including algorithmic speedups from moving the histogram generation outside the optimization loop).

Yes it’s right that ideally you wouldn’t have to do sum(map(...)) to avoid an intermediate allocation. Feel free to file an issue about implementing mapreduce in PooledArrays, I don’t think this has been discussed before.

1 Like