Numerical optimization function does not get cached

I have an application with a number of image analysis functions based on LeastSquaresOptim.jl. The application has to be re-started often, unfortunately — maybe a topic for another post. Every time the application is ran there’s a clear latency before these numerical functions actually work with the expected speed. The program speed “in regime” is good, there’s just an annoying start-up time. Very annoying, I might add. Seemingly the same as the “time to first plot” problem, or compilation latency problem, or whatever it goes by these days.

I understand one way to ameliorate this situation might be PackageCompiler.jl. I’m fully aware of it and hope to try it eventually. Another alternative, of course, is to simply pick up another language such as C++ or Python. That’s what every colleague and friend tells me to do, and it’s not the advice I’m looking for either. And it makes sense. I mean, if I need a separate package-compiling step, it’s not better than C++, and if I’m paying an extra interpreter overhead every time, it’s not better than Python… (whoa that sounds like I’ll regret saying it!)

I’ve heard of many improvements in this area lately, and I imagine there could be some detail either in my code or the underlying libraries that is causing method invalidation. I’m hopeful this can be solved, because despite the complexity, “it’s just” a numerical application. It literally could have been written just as functions with numerical arguments containing for-loops going over numerical arrays. How am I preventing those sweet zero-cost abstractions to work for me?

Apart from LeastSquaresOptim, my application uses Images, ForwardDiff and CoordinateTransformations a lot.

The result from SnoopCompile running my test code is at the bottom. Do those type signatures indicate something bad in my code? Is this perhaps about closures? Should I just pray for 1.6 to solve it?.. My numerical functions are fine and type-safe as far as I know. How can I proceed figuring out how to make my module cache-able?

I don’t think the time-to-first-plot problem is an existential threat for Julia. It might be for my own projects, though, and I’m quite interested in learning how to better deal with this.

 (0.09813404083251953, MethodInstance for inv(::StaticArrays.SArray{Tuple{3,3},ForwardDiff.Dual{ForwardDiff.Tag{CameraRegression.var"#residues!#80"{Array{Float32,2},Array{Bool,2},Tuple{Int64,Int64}},Float64},Float64,9},2,9}))
 (0.10171914100646973, MethodInstance for (::CoordinateTransformations.ComposedTransformation{CameraRegression.Homog{Float64},CoordinateTransformations.AffineMap{LinearAlgebra.Diagonal{Float64,StaticArrays.SArray{Tuple{2},Float64,1,2}},StaticArrays.SArray{Tuple{2},Float64,1,2}}})(::Array{Int64,1}))
 (0.12486886978149414, MethodInstance for (::Reactive.var"#30#31")())
 (0.1467750072479248, MethodInstance for mysquircle(::StaticArrays.SArray{Tuple{2},ForwardDiff.Dual{ForwardDiff.Tag{CameraRegression.var"#residues!#80"{Array{Float32,2},Array{Bool,2},Tuple{Int64,Int64}},Float64},Float64,9},1,2}, ::ForwardDiff.Dual{ForwardDiff.Tag{CameraRegression.var"#residues!#80"{Array{Float32,2},Array{Bool,2},Tuple{Int64,Int64}},Float64},Float64,9}))
 (0.15564608573913574, MethodInstance for (::CameraRegression.HarrisDistortion{Float64})(::StaticArrays.SArray{Tuple{2},Float64,1,2}))
 (0.16643190383911133, MethodInstance for #s37#6(::Any, ::Any, ::Any, ::Any, ::Any, ::Type{T} where T, ::Type{T} where T, ::Type{T} where T, ::Any))
 (0.1896228790283203, MethodInstance for ForwardDiff.JacobianConfig(::CameraRegression.var"#residues!#80"{Array{Float32,2},Array{Bool,2},Tuple{Int64,Int64}}, ::Array{Float64,1}, ::Array{Float64,1}, ::ForwardDiff.Chunk{9}))
 (0.20583295822143555, MethodInstance for #LeastSquaresProblem#2(::Array{Float64,1}, ::Nothing, ::Function, ::Nothing, ::Nothing, ::Int64, ::Symbol, ::Type{LeastSquaresOptim.LeastSquaresProblem}))
 (0.34380102157592773, MethodInstance for permutedims(::Array{Gray{Normed{UInt8,8}},2}, ::Tuple{Int64,Int64}))
 (3.0026931762695312, MethodInstance for optimize!(::LeastSquaresOptim.LeastSquaresProblem{Array{Float64,1},Array{Float64,1},CameraRegression.var"#residues!#80"{Array{Float32,2},Array{Bool,2},Tuple{Int64,Int64}},Array{Float64,2},LeastSquaresOptim.var"#4#6"{Array{Float64,1},CameraRegression.var"#residues!#80"{Array{Float32,2},Array{Bool,2},Tuple{Int64,Int64}},ForwardDiff.JacobianConfig{ForwardDiff.Tag{CameraRegression.var"#residues!#80"{Array{Float32,2},Array{Bool,2},Tuple{Int64,Int64}},Float64},Float64,9,Tuple{Array{ForwardDiff.Dual{ForwardDiff.Tag{CameraRegression.var"#residues!#80"{Array{Float32,2},Array{Bool,2},Tuple{Int64,Int64}},Float64},Float64,9},1},Array{ForwardDiff.Dual{ForwardDiff.Tag{CameraRegression.var"#residues!#80"{Array{Float32,2},Array{Bool,2},Tuple{Int64,Int64}},Float64},Float64,9},1}}}}}, ::LeastSquaresOptim.LevenbergMarquardt{Nothing}))
 (10.107409000396729, MethodInstance for testfitmymodel())

Yes, it shows these methods have a very high inference time. It’s hard to give concrete suggestions without access to the code but perhaps you don’t need to specialize the optimize function on the input function, or you can use a lower chunk size for ForwardDiff etc.

2 Likes

I’m curious what your usage model is? I mean if the program runs for 10 minutes, and you have a 20 second startup time, trimming down that time doesn’t do much for you. If however the program runs for 30 seconds with a 20 second startup time, that’s a major issue.

2 Likes

Thanks for the reply, @kristoffer.carlsson. I created a MWE that seems to show the same behavior, I hope if we can optimize this I may be able to do the same to my other project.

https://github.com/nlw0/MyTest.jl

The idea is we have two sets of points related by a 2D transform, AffineMap. Then we use least-squares to find the 6 parameters of the transform. I hope it sounds like a compelling use case of these libraries, that I’m not the only one hoping to improve.

In my tests, this project also shows a pretty high inference time for the optimize! function, and generating the precompile script with SnoopCompile did not seem to help.

I still need to experiment with the chunk size, like you say. Can you tell me more about what you mean regarding specializing the function? I’m just declaring my residue function and passing it on to optimize, using ForwardDiff for the Jacobian. It’s just how it’s supposed to work, I’m not trying to do anything special as far as I understand.

Just to add a small piece of information: LeastSquaresOptim creates a configuration object, using the input vector. Still haven’t tried to tune that, though.