# How to make a function 120 times faster

The function has the purpose of labeling the groups of “interconnected” cells [two cells are connected if have a 1 and are side by side or one above the other].
I had tried a first crude algorithm that did the job around 200us.
As I refined and tweaked the code I reduced the time to a few us (3 / 4us).

Instead, I find myself not understanding the result of using the @btime macro in the following cases.
When the avriables are defined inside the archipelago function I get a time 120 times greater than when these variables are defined in the main.
This is the how.
I hope someone will explain to me why.

`````` A=rand([1,0], 8,10)
a=copy(A);

julia> function arcipelago(a)
cidx=findall(>(0),a)
dcidx=Dictionary(cidx,fill(-1, length(cidx)))
CI=CartesianIndex.([(0,1), (1,0), (0,-1), (-1,0)])
i=1
c=findfirst(==(-1),dcidx)
while !isnothing(c)
cc=CartesianIndex{2}[]
ci=[c]
while !isempty(ci)
for j in ci
a[j]=i
dcidx[j]=1
for n in CI
x=j+n
dcidx[x]=0
push!(cc,x)
end
end
end
ci=cc
cc=CartesianIndex{2}[]
end
c=findfirst(==(-1),dcidx)
i+=1
end
a
end
arcipelago (generic function with 1 method)

julia> @btime arcipelago(a)
6.180 μs (72 allocations: 7.75 KiB)
8×10 Matrix{Int64}:
1  0  0  2  2  0  0  7  7  0
0  2  2  2  0  0  0  0  7  7
2  2  2  2  2  0  0  7  7  7
2  2  0  0  2  0  0  0  0  0
0  2  2  2  0  2  0  0  0  0
3  0  0  2  2  2  2  0  0  9
0  0  0  0  0  0  0  8  8  0
4  4  0  5  0  6  6  0  8  0
``````
``````cidx=findall(>(0),a)
dcidx=Dictionary(cidx,fill(-1, length(cidx)))
CI=CartesianIndex.([(0,1), (1,0), (0,-1), (-1,0)])

julia> function arcipelago(a)
# cidx=findall(>(0),a)
# dcidx=Dictionary(cidx,fill(-1, length(cidx)))
# CI=CartesianIndex.([(0,1), (1,0), (0,-1), (-1,0)])
i=1
c=findfirst(==(-1),dcidx)
while !isnothing(c)
cc=CartesianIndex{2}[]
ci=[c]
while !isempty(ci)
for j in ci
a[j]=i
dcidx[j]=1
for n in CI
x=j+n
dcidx[x]=0
push!(cc,x)
end
end
end
ci=cc
cc=CartesianIndex{2}[]
end
c=findfirst(==(-1),dcidx)
i+=1
end
a
end
arcipelago (generic function with 1 method)

julia>

julia>

julia> @btime arcipelago(a)
53.179 ns (0 allocations: 0 bytes)
8×10 Matrix{Int64}:
1  0  0  2  2  0  0  7  7  0
0  2  2  2  0  0  0  0  7  7
2  2  2  2  2  0  0  7  7  7
2  2  0  0  2  0  0  0  0  0
0  2  2  2  0  2  0  0  0  0
3  0  0  2  2  2  2  0  0  9
0  0  0  0  0  0  0  8  8  0
4  4  0  5  0  6  6  0  8  0
``````

That’s because global variables cause slowdowns (because the compiler needs to assume the type could change at any time). See the performance tips.

1 Like

there is probably something I do wrong in the measures, but the problem is that I get the fastest result with global variables.

2 Likes

Anything you can do with global variables you can do faster without with a bit of rearranging

Because you have moved some code outside the function, the time setting up ‘cidx’, ‘dcidx’, and ‘CI’ are not counted by @btime in the second version of arcipelago. I think this would explain the difference in speed: the second version simply does less work.

The main problem is that `@btime` makes several calls to the function while timing it and the global variables are not reset between calls. In particular this goes for `dcidx`, causing the function to do no work at all in the calls after the first. Look into the setup functionality of BenchmarkTools and pay attention to set `evals` to one.

3 Likes

I had already tested this hypothesis and it is not enough to justify the big difference. At least I think so.

``````@btime cidx=findall(>(0),a) # 500ns

@btime dcidx=Dictionary(cidx,fill(-1, length(cidx))) # 300 ns

@btime CI=[ CartesianIndex(0, 1),
CartesianIndex(1, 0),
CartesianIndex(0, -1),
CartesianIndex(-1, 0)]  #20ns
``````

if it is the following parameter, It seems to be that way by default already

``````julia> using Dictionaries, BenchmarkTools

julia> BenchmarkTools.DEFAULT_PARAMETERS.evals
1

``````

leaving out only the definition of CI which is a read-only vector (in the sense that in the function it is not modified), even more strange numbers come out

``````julia> function arcipelago(a)
cidx=findall(>(0),a)
dcidx=Dictionary(cidx,fill(-1, length(cidx)))
# CI=CartesianIndex.([(0,1), (1,0), (0,-1), (-1,0)])
i=1
c=findfirst(==(-1),dcidx)
while !isnothing(c)
cc=CartesianIndex{2}[]
ci=[c]
while !isempty(ci)
for j in ci
a[j]=i
dcidx[j]=1
for n in CI
x=j+n
dcidx[x]=0
push!(cc,x)
end
end
end
ci=cc
cc=CartesianIndex{2}[]
end
c=findfirst(==(-1),dcidx)
i+=1
end
a
end
arcipelago (generic function with 1 method)

julia> a
8×10 Matrix{Int64}:
1  1  1  0  1  0  1  1  1  0
1  1  1  1  0  1  0  0  1  0
1  1  1  1  0  1  1  0  1  0
0  0  0  0  0  0  0  0  1  0
1  1  0  1  1  1  1  1  0  1
1  0  0  1  1  1  1  1  1  0
0  0  0  1  0  1  0  0  1  1
1  1  1  0  0  1  1  1  1  1

julia> @btime arcipelago(a) #60ns
ERROR: UndefVarError: CI not defined
Stacktrace:
``````
``````julia>        CI=[ CartesianIndex(0, 1),
CartesianIndex(1, 0),
CartesianIndex(0, -1),
CartesianIndex(-1, 0)]
4-element Vector{CartesianIndex{2}}:
CartesianIndex(0, 1)
CartesianIndex(1, 0)
CartesianIndex(0, -1)
CartesianIndex(-1, 0)

julia> @btime arcipelago(a) #60ns
16.300 μs (465 allocations: 21.53 KiB)
8×10 Matrix{Int64}:
1  1  1  0  5  0  7  7  7  0
1  1  1  1  0  6  0  0  7  0
1  1  1  1  0  6  6  0  7  0
0  0  0  0  0  0  0  0  7  0
2  2  0  4  4  4  4  4  0  8
2  0  0  4  4  4  4  4  4  0
0  0  0  4  0  4  0  0  4  4
3  3  3  0  0  4  4  4  4  4
``````

I guess mutable globals don’t play well with BenchmarkTools. This is how I would do the benchmarking.

``````using BenchmarkTools, Dictionaries
const CI = CartesianIndex.([(0,1), (1,0), (0,-1), (-1,0)])
function arcipelago(a, cidx, dcidx)
i=1
c=findfirst(==(-1),dcidx)
while !isnothing(c)
cc=CartesianIndex{2}[]
ci=[c]
while !isempty(ci)
for j in ci
a[j]=i
dcidx[j]=1
for n in CI
x=j+n
dcidx[x]=0
push!(cc,x)
end
end
end
ci=cc
cc=CartesianIndex{2}[]
end
c=findfirst(==(-1),dcidx)
i+=1
end
a
end
A = rand([1, 0], 8, 10)
@btime arcipelago(a, cidx, dcidx) setup=(a=copy(A); cidx=findall(>(0),a); dcidx=Dictionary(cidx,fill(-1, length(cidx)))) evals=1
``````

And to reiterate, all the timing artifacts are due to repeated calls where `dcidx` no longer is set to `-1`, `findfirst(==(-1), dcidx)` returns `nothing` and the while loop isn’t ever entered. You can add some debug prints to see it in action while running `@btime`.

1 Like

ok. this seems plausible. but measured in this other way which is equally reasonable gives a still different result.

``````using BenchmarkTools, Dictionaries
const CI = CartesianIndex.([(0,1), (1,0), (0,-1), (-1,0)])
const cidx=findall(>(0),a)
function arcipelago(a,  dcidx)
i=1
c=findfirst(==(-1),dcidx)
while !isnothing(c)
cc=CartesianIndex{2}[]
ci=[c]
while !isempty(ci)
for j in ci
a[j]=i
dcidx[j]=1
for n in CI
x=j+n
dcidx[x]=0
push!(cc,x)
end
end
end
ci=cc
cc=CartesianIndex{2}[]
end
c=findfirst(==(-1),dcidx)
i+=1
end
a
end
A = rand([1, 0], 8, 10)julia> @btime arcipelago(a, dcidx) setup=(a=copy(A); dcidx=Dictionary(cidx,fill(-1, length(cidx)))) evals=1
5.800 μs (75 allocations: 6.33 KiB)
8×10 Matrix{Int64}:
``````

I’m not sure what you are comparing now but I don’t see any significant difference between my proposal and your latest post (after some minor fixes to make it run).

Yes, of course.
the difference is that I put out the other constant vector and used the scheme you propose.
I regretted that it is not easy to make the correct measurements

Sorry, I mean I don’t see any significant difference in reported speed between those two implementations.

you are right. I had done a test and I thought I was seeing 13us and not 5 / 6us.

PS
for future reference I did a test as you suggested, putting a debug print in a “strategic” position

``````
function arcipelago(a)
# cidx=findall(>(0),a)
dcidx=Dictionary(cidx,fill(-1, length(cidx)))
# CI=CartesianIndex.([(0,1), (1,0), (0,-1), (-1,0)])
i=1
c=findfirst(==(-1),dcidx)
while !isnothing(c)
cc=CartesianIndex{2}[]
ci=[c]
while !isempty(ci)
for j in ci
a[j]=i
dcidx[j]=1
for n in CI
x=j+n
dcidx[x]=0
push!(cc,x)
end
end
end
ci=cc
cc=CartesianIndex{2}[]
end
c=findfirst(==(-1),dcidx)
i+=1
**println(i)**
end
a
end
``````

I know this was a performance question, but are you aware of Docstrings · ImageMorphology.jl this?

Yes.
my problem derives from the attempt to “emulate” what a package function does after reading the first post of this discussion.
I found the problem really inspiring and wanted to try and see if I could solve it.
Then I tried to improve (in terms of running time),
making some progress and experimenting / learning various things as I changed some loops or supporting structures of the algorithm.
PS
In reality, as I imagined, the problem here was more of a measure of performance than of performance in the strict sense.