My script performance lags Python

Context: @dpsanders encouraged me to try rewriting a slow Python script of mine in Julia. The code is intended largely as an art project – to do something similar to images that the Percolator app does.

I implemented as close to a literal equivalent to my Python in Julia, and the result runs ~17 seconds slower than the Python script.

The results from profiling show that most of the time is spent in Images.feature_transform and Images.distance_transform – this isn’t too surprising to me.

My code is here: https://gist.github.com/jsundram/83685591e3e11a6ef2d2f7c4bc1aa6a6. I’ve implemented a threaded version here, which improves the runtime a bit, but unfortunately the biggest job takes quite a while, so even with some cleverness with job allocation, it isn’t even twice as fast.

Suggestions for improvement would be gratefully accepted, since this is my first Julia program.

4 Likes

Haven’t looked at the code, but the first thing is that the comments on that Gist suggest you’re @timeing those calls only once, the first time you run them in a given session, so you might be mostly timing JIT compilation time. Try the @btime macro from BenchmarkTools.jl to run them multiple times and throw out the timing of the first call. If the script-based workflow is necessary in your work, DaemonMode.jl could be interesting to look at too.

2 Likes

Thanks for pointing that out, @marius311! The @time calls were largely to give me a sense of where the code was spending its time (this was complementary to my use of the profiler). I hadn’t been familiar with DaemonModel.jl – thanks for the pointer; it sounds like that could be quite useful, since I do have a script-based workflow; I had been considering using PackageCompiler.jl to get around the JIT and startup costs once the code itself was as performant as possible.

3 Likes

This looks cool! It will be a lot easier if you have tests to verify the optimizations are correct though.

… Can we nerdsnipe you into making it into a small julia package :D?

hehe nice try :slight_smile:

Seriously, though, I added the script output to the gist here, so you can compare the image hashes, or just squint to make sure they look roughly the same . . . Another thing to do is to dump the circles as data instead of rendering an image, and then comparing the list of circles (x, y, radius should be enough) generated.

Cool script! Judging from the fact that both scipy and julia use the same recursive implementation for edt(at least they cite the same article) and the scipy version is actually coded in C, I would say the julia version behaved quite admirably.
I tried the julia version and, on my slower pc it took ~186s first time and 173s second time, so you definitely pay the compilation price.
The 8 threads version takes roughly half the time, ~80s.

Few thoughts.

First of all, to decrease compilation time, you can use ImageMorphology.jl, since Images.jl only reexport its functions. And here is style advice: do not use Images.feature_transform form, it’s better to use directly feature_transform. If you absolutely need to somehow remember where this or that function is coming from you can just add using Images: feature_transform at the beginning of the file.

As for performance upgrade, than culprits are the usual - unnecessary memory allocations.

Consider for example this piece of code:

circle_ix = circle_indices(ix, r, size(edt)...)
mask[circle_ix] .= 1

offset_ix = map(x -> x + CartesianIndex(top - 1, left - 1), circle_ix)
clr = to_clr(Statistics.mean(img[offset_ix]))

In this short snippet three intermediate arrays are generated: circle_ix, offset_ix and img[offset_ix], all of which are scraped almost immediately after their creation. All of them are needed only to do one of the following operations: update mask array and calculate mean over some set of colors. It is much more efficient to do these operations in place, i.e. in the definition of circle_indices instead of push!(indices, ix) you can do

cnt = 0
colour = RGB(0, 0, 0)
for ix in center - r: center + r
    row, col = ix.I
    if 1 <= row <= height && 1 <= col <= width
        if (row - y)^2 + (col - x)^2 <= radius^2
             cnt += 1
             mask[ix] = true
             offsett_ix = ix + CartesianIndex(top - 1, left - 1) 
             colour += img[offset_ix]
        end
    end
end
return mask, to_clr(colour/cnt)

Of course, mask and img should be passed as an arguments to the function, and one should verify initial value for colour, but I think main idea is apparent: this implementation has zero allocations and it calculate everything in one pass.

Same idea can be applied to other parts of the script. If you look at the distance_transform definition, then it is simply creates new array of distances. But this is excessive, since the only information that is used in the script is location of the pixel with maximum distance from the edge and value of this distance. So, instead of creating huge throwaway array, one can do something like this

R = CartesianIndices(axes(f))
dst = 0.0
ix = CartesianIndex(0, 0)
∅ = nullindex(F)
@inbounds for i in R
    fi = F[i]
    dst1 =  fi == ∅ ? Inf : wnorm2(fi - i, nothing)
    if dst1 > dst
         dst = dst1
         ix = i
    end
end
dst = sqrt(dst1)

Here two things are used: no intermediate array is generated (point location and radius are calculated on the fly) and square root (which is a rather costly operation) is applied only once per whole calculation.

Last thing which can be improved is feature_transform. I havn’t read to deeply in it’s definition, so I can be wrong but it seems that this is a modification of a usual floodfill algorithm. Now, on one hand feature_transform just wraps recursive procedure computeft! and before doing so, it creates temporary array which is used for subsequent calculations. On the other hand, current implementation of Making Circles recalculate all distances for whole mask array which seems rather excessive because we know beforehand that disjointed components will remain the same and there is no need to recalculate them. So, idea that worth investigating is the following: for all subsequent feature transform calculations, use the same array, but use edge of the new circle as a seed in recursive procedure. This way on each iteration the part that has changed will be recalculated and that should lead to additional speedup. In best case scenario, one can recalculated distances and detect center and radius of new circle in the same pass, but it can be tricky.

7 Likes

Thanks @Skoffer for your detailed thoughts!

I was a bit disappointed to notice that replacing circle_indices above with code along the lines you suggest:

function update_mask_compute_color(center, radius, offset, mask, edt, img)
    height, width = size(mask)
    r = CartesianIndex(int_round(radius), int_round(radius))
    y, x = center.I

    count = 0
    color = Images.RGBX{Float64}(0, 0, 0) 
    for ix in center - r: center + r
        row, col = ix.I
        if 1 <= row <= height && 1 <= col <= width
            if (row - y)^2 + (col - x)^2 <= radius^2
                count += 1
                @inbounds mask[ix] = 1
                @inbounds edt[ix] = 0
                @inbounds color += img[offset + ix]
            end
        end
    end

    return color / count
end

was actually 1.2 seconds slower overall, despite saving about 1M allocations. The GC percentage actually went up, which I don’t have an explanation for:

68.863995 seconds (18.10 M allocations: 51.164 GiB, 6.88% gc time)

vs

67.658403 seconds (19.15 M allocations: 51.270 GiB, 6.69% gc time)

I had better luck with a small algorithm change to use the results of distance_transform more than once when possible (saved about 20% of the calls to both feature_transform and distance_transform).

Because of that algorithm change, just computing the max distance as you suggest, without allocating the whole distance array, was a regression. However, I built a version of the function which accepts an array to populate, to avoid reallocs, and computed the max while building the array, which shaved about 8 seconds off the threaded version.

The code (I simplified/specialized the library code while I was at it) is as follows:

function distance_transform_max(F::Array{CartesianIndex{2},2}, D::Array{Float64,2})
    ix, d  = CartesianIndex(0, 0), -1.0
    @inbounds for i in CartesianIndices(axes(F))
        fi = F[i] - i
        dst =  sqrt(fi[1] * fi[1] + fi[2] * fi[2])
        if dst > d
             d = dst
             ix = i
        end
        D[i] = dst
    end
    return (D, d, ix)
end

Interestingly, simply modifying distance_transform to accept a buffer to fill with distances was a regression of about 5 seconds vs a call to Images.distance_transform I wonder if this is due to compilation time?

function distance_transform_na(F::Array{CartesianIndex{2},2}, D::Array{Float64,2})
    """specialized Images.distance_transform for my purposes, allow passing
        in a return array D, to avoid memory churn with repeated calls.
        (suffix na stands for no allocs)
    """
    @inbounds for i in CartesianIndices(axes(F))
        fi = F[i] - i
        D[i] = sqrt(fi[1] * fi[1] + fi[2] * fi[2])
    end
    return D
end

I’m intrigued by your suggestion of customizing a feature_transform seeded by the previously detected circle, but I first want to figure out how to make the code more thread-friendly. Currently, the largest job (the background) takes most of the time, while the rest of the threads finish quite quickly:

Thread 8 worked for 0.4810616970062256 seconds to produce 325 circles [675.5890190854128 c/s]
Thread 7 worked for 2.7836406230926514 seconds to produce 610 circles [219.1374831002014 c/s]
Thread 6 worked for 2.9757633209228516 seconds to produce 680 circles [228.51279710952167 c/s]
Thread 5 worked for 7.285462141036987 seconds to produce 830 circles [113.92551137213935 c/s]
Thread 4 worked for 7.3920910358428955 seconds to produce 871 circles [117.82863546683618 c/s]
Thread 2 worked for 22.020059823989868 seconds to produce 1520 circles [69.02796868626253 c/s]
Thread 3 worked for 30.390032291412354 seconds to produce 1153 circles [37.94007156503798 c/s]
Thread 1 worked for 70.07796239852905 seconds to produce 1853 circles [26.44197885580781 c/s]

Splitting the jobs up a bit better seems like it will allow me to use the CPUs at my disposal more effectively. Of course once I’ve done that, I’ll want to continue improving my algorithmic efficiency some more …

I have use the excelent TimerOutputs.jl, to test the time:

@timeit "transform" begin
                @timeit "feature_transform" f = Images.feature_transform(mask)
                @timeit "distance_transform" edt = Images.distance_transform(f)
                @timeit "argmax" ix = argmax(edt)
                r = edt[ix]
            end

I have obtained (for the first call, it is called several times):

Section               ncalls     time   %tot     avg     alloc   %tot      avg
 ──────────────────────────────────────────────────────────────────────────────
 transform              14.3k     137s  99.2%  9.59ms   49.6GiB  100%   3.54MiB
   feature_transform      747    8.83s  6.38%  11.8ms   2.38GiB  4.78%  3.26MiB
   argmax                 747    1.08s  0.78%  1.45ms     0.00B  0.00%    0.00B
   distance_transform     747    735ms  0.53%   984μs   1.17GiB  2.34%  1.60MiB
 assign                 7.45k    990ms  0.71%   133μs   93.8MiB  0.18%  12.9KiB
 circle_indices         7.45k   61.5ms  0.04%  8.26μs   47.2MiB  0.09%  6.48KiB
 make_mask                171   18.6ms  0.01%   109μs   38.7MiB  0.08%   232KiB
 ──────────────────────────────────────────────────────────────────────────────

It is clear that this part of the code is the bottleneck of the program, I think because it is allocating a lot of size.

I hope this could help you.

3 Likes

I am sorry in advance if my reply looks offensive, but let me say that I can’t agree with your results. They are impossible and can mean only one thing: something is broken either in your time measurements or in your implementation.

The thing is that despite various tips and techniques, in a nutshell Julia code optimization is rather straightforward: if you are doing less amount of operations your code is faster. This is somewhat different situation compared to languages like Python, where you can have parts of your code implemented in “black magic highly optimized C code” and you can write suboptimal code which is running faster due to these C detours, compared to direct but slow python implementation.

This means, that if your code is making less passes over array it is running faster. If your code calculates square root from distance only at the end of calculations it is running faster than code that calculates square root each turn. If your code calculating something on the fly it is faster than the same code which tries to store intermediate values in an array. Less operations - faster code, period. Which is why, when you see regressions with better code it can mean only one thing, that something is broken.

Now, it’s hard to tell what is exactly went wrong, but I have suspicions. If you plug your new code in some big multithreaded implementation, then it is possible that the measurement noise from multithreading shadowed performance gain. As an example, if timing due to multithreading can vary in plus-minus 10 seconds (due to scheduling or your OS starts some other process on background), and performance increase is 5 seconds, you can see all sorts of random things as a result including spurious regressions.

In order to avoid these difficulties, one can use following two rules to properly optimize code:

  1. Optimize single threaded version only.
  2. Use tools like BenchmarkTools.jl to properly measure function runtime and exclude all random noise.

I’ve made some incremental optimizations, applying these two rules, which you can find in https://github.com/Arkoniak/Circles_example.jl

Structure of the repository is the following:

  • benchmark.jl - script where I gathered all intermediate benchmark results.
  • circles.jl - original single threaded script
  • circles1.jl - same script, but reorganized in such a way, that one can benchmark runtime of a single segment.
  • circles2.jl - circles4.jl - incremental speedup of the original script.

I benchmarked segment number 5, since it was relatively small, but also measured time of the segment 1, to see that changes properly reflected in other segments as well.

In circles2.jl I removed creation of intermediate circle indices and moved mask update and mean color calculation inside circle indices generation. It gave ~6% time decrease

In circles3.jl unnecessary allocations for radius calculation were removed and radius was calculated in a single pass. It gave another ~7% time decrease, with overall time decrease ~13%

In circle4.jl arrays preallocations from feature_transform were moved outside. It gave another ~5% time decrease with overall time decrease ~18%.

So, in the end I get the following

➜  circles julia --project=. circles1.jl
 No args supplied, using defaults: Dict{String, Any}("min_radius" => 2, "image" => "test.jpg")
 Loading test.jpg ...
   1.429699 seconds (1.94 M allocations: 141.943 MiB, 2.51% gc time, 63.19% compilation time)
 Resizing ...
   0.433025 seconds (1.31 M allocations: 90.066 MiB, 6.20% gc time, 94.54% compilation time)
 Segmenting ...
   1.233071 seconds (2.68 M allocations: 247.583 MiB, 2.17% gc time, 35.45% compilation time)
 Circling ...
  88.037141 seconds (3.50 M allocations: 41.992 GiB, 0.98% gc time, 1.50% compilation time)
 Made 8517 circles
 Drawing ...
   0.690908 seconds (174.99 k allocations: 12.977 MiB, 0.58% gc time, 38.12% compilation time)

and

➜  circles julia --project=. circles4.jl
 No args supplied, using defaults: Dict{String, Any}("min_radius" => 2, "image" => "test.jpg")
 Loading test.jpg ...
   1.109887 seconds (1.94 M allocations: 141.757 MiB, 4.24% gc time, 79.77% compilation time)
 Resizing ...
   0.407106 seconds (1.31 M allocations: 90.177 MiB, 3.77% gc time, 93.59% compilation time)
 Segmenting ...
   1.191467 seconds (2.67 M allocations: 247.250 MiB, 2.81% gc time, 35.59% compilation time)
 Circling ...
  73.818242 seconds (3.21 M allocations: 288.362 MiB, 0.28% gc time, 1.41% compilation time)
 Made 8517 circles
 Drawing ...
   0.463433 seconds (174.99 k allocations: 12.977 MiB, 1.33% gc time, 56.77% compilation time)

With circle timing went down from 88 seconds to 73 seconds and much lighter footprint.

Now when it is known that this implementation is more performant than original, one can do one of the following: either implement multithreaded version or try to speedup computeft! taking into account that only small part of the mask should be updated.

6 Likes

Thanks @dmolina! I haven’t used TimerOutputs.jl before, thanks for the pointer and the example code using it!

@Skoffer, thanks very much for your help!

I appreciate your persistence as well as the work you put into constructing working examples and showing the steps along the way.

I think your hypothesis that threading added too much noise to the timings I reported is likely correct; it’s a mistake I won’t make again.

2 Likes

To give a concrete example of something that could be “broken” in the implementation is if some arrays don’t have concrete element types. This could make code accessing and processing elements of those arrays allocate a lot of memory, and run much slower.
But I assume you’ve gone through the code to ensure type stability?

FWIW, I didn’t comb over the code, but at a glance it didn’t look like this’d be a problem. E.g., CartesianIndex{2}[] makes it look like jsundram was aware of the problem and deliberately addressing it (making sure the array was concretely typed).

EDIT:
On something I did see:

function make_mask(label_ix, imgh, imgw)
    # Create a mask array that's just large enough to contain
    # the segment we are working on. This will allow the distance transform
    # to operate the smallest array possible, for speed.
    pad = Dict(true => 0, false => 1)
    (top, left), (bottom, right) = map(ix -> ix.I, extrema(label_ix))

    # Allocate a pixel-wide boundary strip if we aren't at the image border.
    top_pad, bottom_pad = pad[top == 1], pad[bottom == imgh]
    left_pad, right_pad = pad[left == 1], pad[right == imgw]

    mask = ones(Bool,
        bottom - top + 1 + top_pad + bottom_pad,
        right - left + 1 + left_pad + right_pad
    )

    new_ix = map(x -> x - CartesianIndex(top - top_pad - 1, left - left_pad - 1), label_ix)
    mask[new_ix] .= 0

    return (left - left_pad, top - top_pad, mask)
end

You could replace pad with a tuple:

    # Create a mask array that's just large enough to contain
    # the segment we are working on. This will allow the distance transform
    # to operate the smallest array possible, for speed.
    pad = (1, 0)
    (top, left), (bottom, right) = map(ix -> ix.I, extrema(label_ix))

    # Allocate a pixel-wide boundary strip if we aren't at the image border.
    top_pad, bottom_pad = pad[1 + (top == 1)], pad[1 + (bottom == imgh)]
    left_pad, right_pad = pad[1 + (left == 1)], pad[1 + (right == imgw)]

    mask = ones(Bool,
        bottom - top + 1 + top_pad + bottom_pad,
        right - left + 1 + left_pad + right_pad
    )

    new_ix = map(x -> x - CartesianIndex(top - top_pad - 1, left - left_pad - 1), label_ix)
    mask[new_ix] .= 0

    return (left - left_pad, top - top_pad, mask)
end

Similarly, like earlier advice, replacing these multiple allocations & passes over the data with just one loop that creates mask would be faster.

Oops, I must have done something silly and accidentally posted in the wrong thread. Sorry about that