How to make a function 120 times faster

The function has the purpose of labeling the groups of “interconnected” cells [two cells are connected if have a 1 and are side by side or one above the other].
I had tried a first crude algorithm that did the job around 200us.
As I refined and tweaked the code I reduced the time to a few us (3 / 4us).

Instead, I find myself not understanding the result of using the @btime macro in the following cases.
When the avriables are defined inside the archipelago function I get a time 120 times greater than when these variables are defined in the main.
This is the how.
I hope someone will explain to me why.

 A=rand([1,0], 8,10)
 a=copy(A);

julia> function arcipelago(a)
           cidx=findall(>(0),a) 
           dcidx=Dictionary(cidx,fill(-1, length(cidx))) 
           CI=CartesianIndex.([(0,1), (1,0), (0,-1), (-1,0)]) 
           i=1
           c=findfirst(==(-1),dcidx)
           while !isnothing(c)
               cc=CartesianIndex{2}[]
               ci=[c]
               while !isempty(ci)
                   for j in ci
                           a[j]=i
                           dcidx[j]=1
                       for n in CI
                           x=j+n
                           if haskey(dcidx,x) &&  dcidx[x]==-1
                               dcidx[x]=0
                               push!(cc,x)
                           end
                       end
                   end
                   ci=cc
                   cc=CartesianIndex{2}[]
               end
               c=findfirst(==(-1),dcidx)
               i+=1
           end
           a
       end
arcipelago (generic function with 1 method)

julia> @btime arcipelago(a)
  6.180 μs (72 allocations: 7.75 KiB)
8×10 Matrix{Int64}:
 1  0  0  2  2  0  0  7  7  0
 0  2  2  2  0  0  0  0  7  7
 2  2  2  2  2  0  0  7  7  7
 2  2  0  0  2  0  0  0  0  0
 0  2  2  2  0  2  0  0  0  0
 3  0  0  2  2  2  2  0  0  9
 0  0  0  0  0  0  0  8  8  0
 4  4  0  5  0  6  6  0  8  0
cidx=findall(>(0),a) 
dcidx=Dictionary(cidx,fill(-1, length(cidx))) 
CI=CartesianIndex.([(0,1), (1,0), (0,-1), (-1,0)])

julia> function arcipelago(a)
           # cidx=findall(>(0),a) 
           # dcidx=Dictionary(cidx,fill(-1, length(cidx))) 
           # CI=CartesianIndex.([(0,1), (1,0), (0,-1), (-1,0)]) 
           i=1
           c=findfirst(==(-1),dcidx)
           while !isnothing(c)
               cc=CartesianIndex{2}[]
               ci=[c]
               while !isempty(ci)
                   for j in ci
                           a[j]=i
                           dcidx[j]=1
                       for n in CI
                           x=j+n
                           if haskey(dcidx,x) &&  dcidx[x]==-1      
                               dcidx[x]=0
                               push!(cc,x)
                           end
                       end
                   end
                   ci=cc
                   cc=CartesianIndex{2}[]
               end
               c=findfirst(==(-1),dcidx)
               i+=1
           end
           a
       end
arcipelago (generic function with 1 method)

julia> 

julia> 

julia> @btime arcipelago(a)
  53.179 ns (0 allocations: 0 bytes)
8×10 Matrix{Int64}:
 1  0  0  2  2  0  0  7  7  0
 0  2  2  2  0  0  0  0  7  7
 2  2  2  2  2  0  0  7  7  7
 2  2  0  0  2  0  0  0  0  0
 0  2  2  2  0  2  0  0  0  0
 3  0  0  2  2  2  2  0  0  9
 0  0  0  0  0  0  0  8  8  0
 4  4  0  5  0  6  6  0  8  0

That’s because global variables cause slowdowns (because the compiler needs to assume the type could change at any time). See the performance tips.

1 Like

there is probably something I do wrong in the measures, but the problem is that I get the fastest result with global variables.

2 Likes

Anything you can do with global variables you can do faster without with a bit of rearranging

Because you have moved some code outside the function, the time setting up ‘cidx’, ‘dcidx’, and ‘CI’ are not counted by @btime in the second version of arcipelago. I think this would explain the difference in speed: the second version simply does less work.

The main problem is that @btime makes several calls to the function while timing it and the global variables are not reset between calls. In particular this goes for dcidx, causing the function to do no work at all in the calls after the first. Look into the setup functionality of BenchmarkTools and pay attention to set evals to one.

3 Likes

I had already tested this hypothesis and it is not enough to justify the big difference. At least I think so.

@btime cidx=findall(>(0),a) # 500ns

@btime dcidx=Dictionary(cidx,fill(-1, length(cidx))) # 300 ns

@btime CI=[ CartesianIndex(0, 1),
CartesianIndex(1, 0),
CartesianIndex(0, -1),
CartesianIndex(-1, 0)]  #20ns

if it is the following parameter, It seems to be that way by default already

julia> using Dictionaries, BenchmarkTools

julia> BenchmarkTools.DEFAULT_PARAMETERS.evals
1

leaving out only the definition of CI which is a read-only vector (in the sense that in the function it is not modified), even more strange numbers come out

julia> function arcipelago(a)
           cidx=findall(>(0),a)
           dcidx=Dictionary(cidx,fill(-1, length(cidx)))
           # CI=CartesianIndex.([(0,1), (1,0), (0,-1), (-1,0)])
           i=1
           c=findfirst(==(-1),dcidx)
           while !isnothing(c)
               cc=CartesianIndex{2}[]
               ci=[c]
               while !isempty(ci)
                   for j in ci
                           a[j]=i
                           dcidx[j]=1
                       for n in CI
                           x=j+n
                           if haskey(dcidx,x) &&  dcidx[x]==-1
                               dcidx[x]=0
                               push!(cc,x)
                           end
                       end
                   end
                   ci=cc
                   cc=CartesianIndex{2}[]
               end
               c=findfirst(==(-1),dcidx)
               i+=1
           end
           a
       end
arcipelago (generic function with 1 method)

julia> a
8×10 Matrix{Int64}:
 1  1  1  0  1  0  1  1  1  0
 1  1  1  1  0  1  0  0  1  0
 1  1  1  1  0  1  1  0  1  0
 0  0  0  0  0  0  0  0  1  0
 1  1  0  1  1  1  1  1  0  1
 1  0  0  1  1  1  1  1  1  0
 0  0  0  1  0  1  0  0  1  1
 1  1  1  0  0  1  1  1  1  1

julia> @btime arcipelago(a) #60ns
ERROR: UndefVarError: CI not defined
Stacktrace:
julia>        CI=[ CartesianIndex(0, 1),
       CartesianIndex(1, 0),
       CartesianIndex(0, -1),
       CartesianIndex(-1, 0)]
4-element Vector{CartesianIndex{2}}:
 CartesianIndex(0, 1)
 CartesianIndex(1, 0)
 CartesianIndex(0, -1)
 CartesianIndex(-1, 0)

julia> @btime arcipelago(a) #60ns
  16.300 μs (465 allocations: 21.53 KiB)
8×10 Matrix{Int64}:
 1  1  1  0  5  0  7  7  7  0
 1  1  1  1  0  6  0  0  7  0
 1  1  1  1  0  6  6  0  7  0
 0  0  0  0  0  0  0  0  7  0
 2  2  0  4  4  4  4  4  0  8
 2  0  0  4  4  4  4  4  4  0
 0  0  0  4  0  4  0  0  4  4
 3  3  3  0  0  4  4  4  4  4

I guess mutable globals don’t play well with BenchmarkTools. This is how I would do the benchmarking.

using BenchmarkTools, Dictionaries
const CI = CartesianIndex.([(0,1), (1,0), (0,-1), (-1,0)])
function arcipelago(a, cidx, dcidx)
    i=1
    c=findfirst(==(-1),dcidx)
    while !isnothing(c)
        cc=CartesianIndex{2}[]
        ci=[c]
        while !isempty(ci)
            for j in ci
                a[j]=i
                dcidx[j]=1
                for n in CI
                    x=j+n
                    if haskey(dcidx,x) &&  dcidx[x]==-1      
                        dcidx[x]=0
                        push!(cc,x)
                    end
                end
            end
            ci=cc
            cc=CartesianIndex{2}[]
        end
        c=findfirst(==(-1),dcidx)
        i+=1
    end
    a
end
A = rand([1, 0], 8, 10)
@btime arcipelago(a, cidx, dcidx) setup=(a=copy(A); cidx=findall(>(0),a); dcidx=Dictionary(cidx,fill(-1, length(cidx)))) evals=1

And to reiterate, all the timing artifacts are due to repeated calls where dcidx no longer is set to -1, findfirst(==(-1), dcidx) returns nothing and the while loop isn’t ever entered. You can add some debug prints to see it in action while running @btime.

1 Like

ok. this seems plausible. but measured in this other way which is equally reasonable gives a still different result.

using BenchmarkTools, Dictionaries
const CI = CartesianIndex.([(0,1), (1,0), (0,-1), (-1,0)])
const cidx=findall(>(0),a) 
function arcipelago(a,  dcidx)
    i=1
    c=findfirst(==(-1),dcidx)
    while !isnothing(c)
        cc=CartesianIndex{2}[]
        ci=[c]
        while !isempty(ci)
            for j in ci
                a[j]=i
                dcidx[j]=1
                for n in CI
                    x=j+n
                    if haskey(dcidx,x) &&  dcidx[x]==-1      
                        dcidx[x]=0
                        push!(cc,x)
                    end
                end
            end
            ci=cc
            cc=CartesianIndex{2}[]
        end
        c=findfirst(==(-1),dcidx)
        i+=1
    end
    a
end
A = rand([1, 0], 8, 10)julia> @btime arcipelago(a, dcidx) setup=(a=copy(A); dcidx=Dictionary(cidx,fill(-1, length(cidx)))) evals=1
  5.800 μs (75 allocations: 6.33 KiB)
8×10 Matrix{Int64}:

I’m not sure what you are comparing now but I don’t see any significant difference between my proposal and your latest post (after some minor fixes to make it run).

Yes, of course.
the difference is that I put out the other constant vector and used the scheme you propose.
I regretted that it is not easy to make the correct measurements

Sorry, I mean I don’t see any significant difference in reported speed between those two implementations.

you are right. I had done a test and I thought I was seeing 13us and not 5 / 6us.

PS
for future reference I did a test as you suggested, putting a debug print in a “strategic” position


function arcipelago(a)
    # cidx=findall(>(0),a) 
    dcidx=Dictionary(cidx,fill(-1, length(cidx))) 
    # CI=CartesianIndex.([(0,1), (1,0), (0,-1), (-1,0)]) 
    i=1
    c=findfirst(==(-1),dcidx)
    while !isnothing(c)
        cc=CartesianIndex{2}[]
        ci=[c]
        while !isempty(ci)
            for j in ci
                    a[j]=i
                    dcidx[j]=1
                for n in CI
                    x=j+n
                    if haskey(dcidx,x) &&  dcidx[x]==-1      
                        dcidx[x]=0
                        push!(cc,x)
                    end
                end
            end
            ci=cc
            cc=CartesianIndex{2}[]
        end
        c=findfirst(==(-1),dcidx)
        i+=1
        **println(i)**
    end
    a
end

I know this was a performance question, but are you aware of Docstrings · ImageMorphology.jl this?

Yes.
my problem derives from the attempt to “emulate” what a package function does after reading the first post of this discussion.
I found the problem really inspiring and wanted to try and see if I could solve it.
Then I tried to improve (in terms of running time),
making some progress and experimenting / learning various things as I changed some loops or supporting structures of the algorithm.
PS
In reality, as I imagined, the problem here was more of a measure of performance than of performance in the strict sense.