How to make a function 120 times faster

rocco_sprmnt21 · July 6, 2022, 5:17pm

The function has the purpose of labeling the groups of “interconnected” cells [two cells are connected if have a 1 and are side by side or one above the other].
I had tried a first crude algorithm that did the job around 200us.
As I refined and tweaked the code I reduced the time to a few us (3 / 4us).

Instead, I find myself not understanding the result of using the @btime macro in the following cases.
When the avriables are defined inside the archipelago function I get a time 120 times greater than when these variables are defined in the main.
This is the how.
I hope someone will explain to me why.

 A=rand([1,0], 8,10)
 a=copy(A);

julia> function arcipelago(a)
           cidx=findall(>(0),a) 
           dcidx=Dictionary(cidx,fill(-1, length(cidx))) 
           CI=CartesianIndex.([(0,1), (1,0), (0,-1), (-1,0)]) 
           i=1
           c=findfirst(==(-1),dcidx)
           while !isnothing(c)
               cc=CartesianIndex{2}[]
               ci=[c]
               while !isempty(ci)
                   for j in ci
                           a[j]=i
                           dcidx[j]=1
                       for n in CI
                           x=j+n
                           if haskey(dcidx,x) &&  dcidx[x]==-1
                               dcidx[x]=0
                               push!(cc,x)
                           end
                       end
                   end
                   ci=cc
                   cc=CartesianIndex{2}[]
               end
               c=findfirst(==(-1),dcidx)
               i+=1
           end
           a
       end
arcipelago (generic function with 1 method)

julia> @btime arcipelago(a)
  6.180 μs (72 allocations: 7.75 KiB)
8×10 Matrix{Int64}:
 1  0  0  2  2  0  0  7  7  0
 0  2  2  2  0  0  0  0  7  7
 2  2  2  2  2  0  0  7  7  7
 2  2  0  0  2  0  0  0  0  0
 0  2  2  2  0  2  0  0  0  0
 3  0  0  2  2  2  2  0  0  9
 0  0  0  0  0  0  0  8  8  0
 4  4  0  5  0  6  6  0  8  0

cidx=findall(>(0),a) 
dcidx=Dictionary(cidx,fill(-1, length(cidx))) 
CI=CartesianIndex.([(0,1), (1,0), (0,-1), (-1,0)])

julia> function arcipelago(a)
           # cidx=findall(>(0),a) 
           # dcidx=Dictionary(cidx,fill(-1, length(cidx))) 
           # CI=CartesianIndex.([(0,1), (1,0), (0,-1), (-1,0)]) 
           i=1
           c=findfirst(==(-1),dcidx)
           while !isnothing(c)
               cc=CartesianIndex{2}[]
               ci=[c]
               while !isempty(ci)
                   for j in ci
                           a[j]=i
                           dcidx[j]=1
                       for n in CI
                           x=j+n
                           if haskey(dcidx,x) &&  dcidx[x]==-1      
                               dcidx[x]=0
                               push!(cc,x)
                           end
                       end
                   end
                   ci=cc
                   cc=CartesianIndex{2}[]
               end
               c=findfirst(==(-1),dcidx)
               i+=1
           end
           a
       end
arcipelago (generic function with 1 method)

julia> 

julia> 

julia> @btime arcipelago(a)
  53.179 ns (0 allocations: 0 bytes)
8×10 Matrix{Int64}:
 1  0  0  2  2  0  0  7  7  0
 0  2  2  2  0  0  0  0  7  7
 2  2  2  2  2  0  0  7  7  7
 2  2  0  0  2  0  0  0  0  0
 0  2  2  2  0  2  0  0  0  0
 3  0  0  2  2  2  2  0  0  9
 0  0  0  0  0  0  0  8  8  0
 4  4  0  5  0  6  6  0  8  0

ericphanson · July 6, 2022, 5:25pm

That’s because global variables cause slowdowns (because the compiler needs to assume the type could change at any time). See the performance tips.

rocco_sprmnt21 · July 6, 2022, 5:26pm

there is probably something I do wrong in the measures, but the problem is that I get the fastest result with global variables.

brenhinkeller · July 6, 2022, 5:32pm

Anything you can do with global variables you can do faster without with a bit of rearranging

Maarten_Keijzer · July 6, 2022, 5:49pm

Because you have moved some code outside the function, the time setting up ‘cidx’, ‘dcidx’, and ‘CI’ are not counted by @btime in the second version of arcipelago. I think this would explain the difference in speed: the second version simply does less work.

GunnarFarneback · July 6, 2022, 6:16pm

The main problem is that @btime makes several calls to the function while timing it and the global variables are not reset between calls. In particular this goes for dcidx, causing the function to do no work at all in the calls after the first. Look into the setup functionality of BenchmarkTools and pay attention to set evals to one.

rocco_sprmnt21 · July 6, 2022, 7:58pm

I had already tested this hypothesis and it is not enough to justify the big difference. At least I think so.

@btime cidx=findall(>(0),a) # 500ns

@btime dcidx=Dictionary(cidx,fill(-1, length(cidx))) # 300 ns

@btime CI=[ CartesianIndex(0, 1),
CartesianIndex(1, 0),
CartesianIndex(0, -1),
CartesianIndex(-1, 0)]  #20ns

rocco_sprmnt21 · July 6, 2022, 7:59pm

if it is the following parameter, It seems to be that way by default already

julia> using Dictionaries, BenchmarkTools

julia> BenchmarkTools.DEFAULT_PARAMETERS.evals
1

leaving out only the definition of CI which is a read-only vector (in the sense that in the function it is not modified), even more strange numbers come out

julia> function arcipelago(a)
           cidx=findall(>(0),a)
           dcidx=Dictionary(cidx,fill(-1, length(cidx)))
           # CI=CartesianIndex.([(0,1), (1,0), (0,-1), (-1,0)])
           i=1
           c=findfirst(==(-1),dcidx)
           while !isnothing(c)
               cc=CartesianIndex{2}[]
               ci=[c]
               while !isempty(ci)
                   for j in ci
                           a[j]=i
                           dcidx[j]=1
                       for n in CI
                           x=j+n
                           if haskey(dcidx,x) &&  dcidx[x]==-1
                               dcidx[x]=0
                               push!(cc,x)
                           end
                       end
                   end
                   ci=cc
                   cc=CartesianIndex{2}[]
               end
               c=findfirst(==(-1),dcidx)
               i+=1
           end
           a
       end
arcipelago (generic function with 1 method)

julia> a
8×10 Matrix{Int64}:
 1  1  1  0  1  0  1  1  1  0
 1  1  1  1  0  1  0  0  1  0
 1  1  1  1  0  1  1  0  1  0
 0  0  0  0  0  0  0  0  1  0
 1  1  0  1  1  1  1  1  0  1
 1  0  0  1  1  1  1  1  1  0
 0  0  0  1  0  1  0  0  1  1
 1  1  1  0  0  1  1  1  1  1

julia> @btime arcipelago(a) #60ns
ERROR: UndefVarError: CI not defined
Stacktrace:

julia>        CI=[ CartesianIndex(0, 1),
       CartesianIndex(1, 0),
       CartesianIndex(0, -1),
       CartesianIndex(-1, 0)]
4-element Vector{CartesianIndex{2}}:
 CartesianIndex(0, 1)
 CartesianIndex(1, 0)
 CartesianIndex(0, -1)
 CartesianIndex(-1, 0)

julia> @btime arcipelago(a) #60ns
  16.300 μs (465 allocations: 21.53 KiB)
8×10 Matrix{Int64}:
 1  1  1  0  5  0  7  7  7  0
 1  1  1  1  0  6  0  0  7  0
 1  1  1  1  0  6  6  0  7  0
 0  0  0  0  0  0  0  0  7  0
 2  2  0  4  4  4  4  4  0  8
 2  0  0  4  4  4  4  4  4  0
 0  0  0  4  0  4  0  0  4  4
 3  3  3  0  0  4  4  4  4  4

GunnarFarneback · July 6, 2022, 8:31pm

I guess mutable globals don’t play well with BenchmarkTools. This is how I would do the benchmarking.

using BenchmarkTools, Dictionaries
const CI = CartesianIndex.([(0,1), (1,0), (0,-1), (-1,0)])
function arcipelago(a, cidx, dcidx)
    i=1
    c=findfirst(==(-1),dcidx)
    while !isnothing(c)
        cc=CartesianIndex{2}[]
        ci=[c]
        while !isempty(ci)
            for j in ci
                a[j]=i
                dcidx[j]=1
                for n in CI
                    x=j+n
                    if haskey(dcidx,x) &&  dcidx[x]==-1      
                        dcidx[x]=0
                        push!(cc,x)
                    end
                end
            end
            ci=cc
            cc=CartesianIndex{2}[]
        end
        c=findfirst(==(-1),dcidx)
        i+=1
    end
    a
end
A = rand([1, 0], 8, 10)
@btime arcipelago(a, cidx, dcidx) setup=(a=copy(A); cidx=findall(>(0),a); dcidx=Dictionary(cidx,fill(-1, length(cidx)))) evals=1

And to reiterate, all the timing artifacts are due to repeated calls where dcidx no longer is set to -1, findfirst(==(-1), dcidx) returns nothing and the while loop isn’t ever entered. You can add some debug prints to see it in action while running @btime.

rocco_sprmnt21 · July 6, 2022, 8:49pm

ok. this seems plausible. but measured in this other way which is equally reasonable gives a still different result.

using BenchmarkTools, Dictionaries
const CI = CartesianIndex.([(0,1), (1,0), (0,-1), (-1,0)])
const cidx=findall(>(0),a) 
function arcipelago(a,  dcidx)
    i=1
    c=findfirst(==(-1),dcidx)
    while !isnothing(c)
        cc=CartesianIndex{2}[]
        ci=[c]
        while !isempty(ci)
            for j in ci
                a[j]=i
                dcidx[j]=1
                for n in CI
                    x=j+n
                    if haskey(dcidx,x) &&  dcidx[x]==-1      
                        dcidx[x]=0
                        push!(cc,x)
                    end
                end
            end
            ci=cc
            cc=CartesianIndex{2}[]
        end
        c=findfirst(==(-1),dcidx)
        i+=1
    end
    a
end
A = rand([1, 0], 8, 10)julia> @btime arcipelago(a, dcidx) setup=(a=copy(A); dcidx=Dictionary(cidx,fill(-1, length(cidx)))) evals=1
  5.800 μs (75 allocations: 6.33 KiB)
8×10 Matrix{Int64}:

GunnarFarneback · July 6, 2022, 9:00pm

I’m not sure what you are comparing now but I don’t see any significant difference between my proposal and your latest post (after some minor fixes to make it run).

rocco_sprmnt21 · July 6, 2022, 9:02pm

Yes, of course.
the difference is that I put out the other constant vector and used the scheme you propose.
I regretted that it is not easy to make the correct measurements

GunnarFarneback · July 6, 2022, 9:04pm

Sorry, I mean I don’t see any significant difference in reported speed between those two implementations.

rocco_sprmnt21 · July 6, 2022, 9:08pm

you are right. I had done a test and I thought I was seeing 13us and not 5 / 6us.

PS
for future reference I did a test as you suggested, putting a debug print in a “strategic” position


function arcipelago(a)
    # cidx=findall(>(0),a) 
    dcidx=Dictionary(cidx,fill(-1, length(cidx))) 
    # CI=CartesianIndex.([(0,1), (1,0), (0,-1), (-1,0)]) 
    i=1
    c=findfirst(==(-1),dcidx)
    while !isnothing(c)
        cc=CartesianIndex{2}[]
        ci=[c]
        while !isempty(ci)
            for j in ci
                    a[j]=i
                    dcidx[j]=1
                for n in CI
                    x=j+n
                    if haskey(dcidx,x) &&  dcidx[x]==-1      
                        dcidx[x]=0
                        push!(cc,x)
                    end
                end
            end
            ci=cc
            cc=CartesianIndex{2}[]
        end
        c=findfirst(==(-1),dcidx)
        i+=1
        **println(i)**
    end
    a
end

fabiangans · July 7, 2022, 6:43am

I know this was a performance question, but are you aware of Docstrings · ImageMorphology.jl this?

rocco_sprmnt21 · July 7, 2022, 10:15am

Yes.
my problem derives from the attempt to “emulate” what a package function does after reading the first post of this discussion.
I found the problem really inspiring and wanted to try and see if I could solve it.
Then I tried to improve (in terms of running time),
making some progress and experimenting / learning various things as I changed some loops or supporting structures of the algorithm.
PS
In reality, as I imagined, the problem here was more of a measure of performance than of performance in the strict sense.

Topic		Replies	Views
@time and @btime report different values? General Usage	17	691	May 1, 2022
Running small functions bundled in an outer function taking twice the time of running them separately Performance benchmarktools	7	1056	April 2, 2020
Allocations and time of running the same program twice are orders of magnitude larger than running them separately New to Julia performance , memory-allocation , benchmark	11	662	June 17, 2024
Push! versus preallocation New to Julia	17	2785	June 11, 2020
Function calls in global scope, benchmarking, etc General Usage question	6	2556	August 14, 2019

How to make a function 120 times faster

Related topics