I am sorry in advance if my reply looks offensive, but let me say that I can’t agree with your results. They are impossible and can mean only one thing: something is broken either in your time measurements or in your implementation.
The thing is that despite various tips and techniques, in a nutshell Julia code optimization is rather straightforward: if you are doing less amount of operations your code is faster. This is somewhat different situation compared to languages like Python, where you can have parts of your code implemented in “black magic highly optimized C code” and you can write suboptimal code which is running faster due to these C detours, compared to direct but slow python implementation.
This means, that if your code is making less passes over array it is running faster. If your code calculates square root from distance only at the end of calculations it is running faster than code that calculates square root each turn. If your code calculating something on the fly it is faster than the same code which tries to store intermediate values in an array. Less operations - faster code, period. Which is why, when you see regressions with better code it can mean only one thing, that something is broken.
Now, it’s hard to tell what is exactly went wrong, but I have suspicions. If you plug your new code in some big multithreaded implementation, then it is possible that the measurement noise from multithreading shadowed performance gain. As an example, if timing due to multithreading can vary in plus-minus 10 seconds (due to scheduling or your OS starts some other process on background), and performance increase is 5 seconds, you can see all sorts of random things as a result including spurious regressions.
In order to avoid these difficulties, one can use following two rules to properly optimize code:
- Optimize single threaded version only.
- Use tools like BenchmarkTools.jl to properly measure function runtime and exclude all random noise.
I’ve made some incremental optimizations, applying these two rules, which you can find in https://github.com/Arkoniak/Circles_example.jl
Structure of the repository is the following:
- benchmark.jl - script where I gathered all intermediate benchmark results.
- circles.jl - original single threaded script
- circles1.jl - same script, but reorganized in such a way, that one can benchmark runtime of a single segment.
- circles2.jl - circles4.jl - incremental speedup of the original script.
I benchmarked segment number 5, since it was relatively small, but also measured time of the segment 1, to see that changes properly reflected in other segments as well.
In circles2.jl
I removed creation of intermediate circle indices and moved mask update and mean color calculation inside circle indices generation. It gave ~6% time decrease
In circles3.jl
unnecessary allocations for radius calculation were removed and radius was calculated in a single pass. It gave another ~7% time decrease, with overall time decrease ~13%
In circle4.jl
arrays preallocations from feature_transform
were moved outside. It gave another ~5% time decrease with overall time decrease ~18%.
So, in the end I get the following
➜ circles julia --project=. circles1.jl
No args supplied, using defaults: Dict{String, Any}("min_radius" => 2, "image" => "test.jpg")
Loading test.jpg ...
1.429699 seconds (1.94 M allocations: 141.943 MiB, 2.51% gc time, 63.19% compilation time)
Resizing ...
0.433025 seconds (1.31 M allocations: 90.066 MiB, 6.20% gc time, 94.54% compilation time)
Segmenting ...
1.233071 seconds (2.68 M allocations: 247.583 MiB, 2.17% gc time, 35.45% compilation time)
Circling ...
88.037141 seconds (3.50 M allocations: 41.992 GiB, 0.98% gc time, 1.50% compilation time)
Made 8517 circles
Drawing ...
0.690908 seconds (174.99 k allocations: 12.977 MiB, 0.58% gc time, 38.12% compilation time)
and
➜ circles julia --project=. circles4.jl
No args supplied, using defaults: Dict{String, Any}("min_radius" => 2, "image" => "test.jpg")
Loading test.jpg ...
1.109887 seconds (1.94 M allocations: 141.757 MiB, 4.24% gc time, 79.77% compilation time)
Resizing ...
0.407106 seconds (1.31 M allocations: 90.177 MiB, 3.77% gc time, 93.59% compilation time)
Segmenting ...
1.191467 seconds (2.67 M allocations: 247.250 MiB, 2.81% gc time, 35.59% compilation time)
Circling ...
73.818242 seconds (3.21 M allocations: 288.362 MiB, 0.28% gc time, 1.41% compilation time)
Made 8517 circles
Drawing ...
0.463433 seconds (174.99 k allocations: 12.977 MiB, 1.33% gc time, 56.77% compilation time)
With circle timing went down from 88 seconds to 73 seconds and much lighter footprint.
Now when it is known that this implementation is more performant than original, one can do one of the following: either implement multithreaded version or try to speedup computeft!
taking into account that only small part of the mask should be updated.