Memory reuse problem in for loop

I was implementing an algorithm to detect edges in RGB images, when I noticed very high memory usage:

function coloredge(img::Matrix{RGB{T}})::Matrix{T} where T <: AbstractFloat
    Sy, Sx = size(img)
    
    z = zeros(T, Sy, Sx)
    
    for x=2:Sx-1
        for y=2:Sy-1
            ∂img∂x = img[y, x+1] - img[y, x-1]
            ∂img∂y = img[y+1, x] - img[y-1, x]
            
            
            u = [red(∂img∂x), green(∂img∂x), blue(∂img∂x)]
            v = [red(∂img∂y), green(∂img∂y), blue(∂img∂y)]
            
            g_xx = u'*u
            g_xy = u'*v
            g_yy = v'*v
            
            θ = 1/2 * atan(2*g_xy/(g_xx - g_yy))
            
            if isnan(θ)
                θ = 1/2 * π/2
            end
            
            F_θ = sqrt(max(0, 1/2 * ((g_xx + g_yy) + (g_xx - g_yy)*cos(2*θ) + 2*g_xy*sin(2*θ))))
            
            z[y, x] = F_θ
        end
    end

    return z
end     

I tested it on the testimage("mandrill") using @btime and I got

  51.996 ms (520202 allocations: 57.56 MiB)

Then I changed the code to

function coloredge_noalloc(img::Matrix{RGB{T}})::Matrix{T} where T <: AbstractFloat
    Sy, Sx = size(img)
    
    z = zeros(T, Sy, Sx)
    
    u = Vector{T}(undef, 3)
    v = Vector{T}(undef, 3)
    
    for x=2:Sx-1
        for y=2:Sy-1
            ∂img∂x =  img[y, x+1] -  img[y, x-1]
            ∂img∂y =  img[y+1, x] -  img[y-1, x]
            
            u[1] = red(∂img∂x)
            u[2] = green(∂img∂x)
            u[3] = blue(∂img∂x)
            
            v[1] = red(∂img∂y)
            v[2] = green(∂img∂y)
            v[3] = blue(∂img∂y)
            
            g_xx = u'*u
            g_xy = u'*v
            g_yy = v'*v
            
            θ = 1/2 * atan(2*g_xy/(g_xx - g_yy))
            
            if isnan(θ)
                θ = 1/2 * π/2
            end
            
            F_θ = sqrt(max(0, 1/2 * ((g_xx + g_yy) + (g_xx - g_yy)*cos(2*θ) + 2*g_xy*sin(2*θ))))
            
            
            z[y, x] = F_θ
        end
    end

    return z
end           

and now it performs as expected

  32.185 ms (4 allocations: 2.00 MiB)

Why do I have to do it component-wise? I tried with u .= [red(∂img∂x), green(∂img∂x), blue(∂img∂x)] but that did not work, neither did u= ....

(required imports: Images, TestImages, BenchmarkTools)

The thing is that these lines are creating a new mutable vector at every iteration (on the right).

That can be solved using StaticArrays and doing

u = @SVector [red(...) green (...) blue (...)]

(Or sometimes just a tuple, depending on how you use it after, may suffice)

2 Likes

Thank you very much for the swift response! Using a Tuple worked for me (and requires no additional packages).

P.S. I forgot to mention that I am running Julia 1.6 (in Jupyter notebook)

1 Like

I’m intrigued. Did you managed the adjoint and product by hand there?

julia> u = (1,2,3)
(1, 2, 3)

julia> v = (1,2,3)
(1, 2, 3)

julia> u' * v
ERROR: MethodError: no method matching adjoint(::Tuple{Int64, Int64, Int64})

This is the kind of thing that works with static arrays:

julia> using StaticArrays

julia> u = @SVector [1,2,3];

julia> v = @SVector [1,2,3];

julia> u' * v
14


I preallocated u outside the for-loop

u = Vector{Float64}(undef, 3)

and inside the for loop I used .=, e.g.

u .= (1, 2, 3)
1 Like

Ah, ok. Note that using the static arrays version will probably be faster (not sure how much and if that will be relevant for your case).

1 Like

Can confirm! This was about twice as fast as the tuple version:

using StaticArrays

function coloredge_s(img::Matrix{RGB{T}})::Matrix{T} where T <: AbstractFloat
    Sy, Sx = size(img)
    
    z = zeros(T, Sy, Sx)
    
    u = SVector{3, T}(0,0,0)
    v = SVector{3, T}(0,0,0)

    for x=2:Sx-1
        for y=2:Sy-1
            ∂img∂x = img[y, x+1] - img[y, x-1]
            ∂img∂y = img[y+1, x] - img[y-1, x]
            
            
            u =  @SVector [red(∂img∂x), green(∂img∂x), blue(∂img∂x)]  
            v =  @SVector [red(∂img∂y), green(∂img∂y), blue(∂img∂y)]
            
            g_xx = u'*u
            g_xy = u'*v
            g_yy = v'*v
            
            θ = 1/2 * atan(2*g_xy/(g_xx - g_yy))
            
            if isnan(θ)
                θ = 1/2 * π/2
            end
            
            F_θ = sqrt(max(0, 1/2 * ((g_xx + g_yy) + (g_xx - g_yy)*cos(2*θ) + 2*g_xy*sin(2*θ))))
            
            z[y, x] = F_θ
        end
    end

    return z
end           
1 Like

In this case you don’t need to define the vectors initially. Static vectors are like any number, you get a new one every time you assign one of them to a label.

1 Like

Out of curiosity: what is then the reason that these arrays avoid the huge memory allocation of regular Arrays? Does the compiler figure out how to optimize them like it does with the number variables that I compute in every loop?

Yes, exactly. The compiler does not figure out that that regular array is created in and useless after that loop, and therefore that it does not need to be an actually mutable array. You tell him that by using static arrays. When the array lives only the scope of the loop, it does not need to be allocated at all (in the “heap” memory), which is what is slow, it just lives in the stack memory, the processor register, or is simply not created at all if the sequence of operations can be simplified by the compiler.

And probably it could in principle figure that out and optimize that out. The reason why it does not in Julia is beyond my understanding.

More generally, I have these notes written when I was learning this stuff.

2 Likes

My non-expert understanding of this is that the compiler just doesn’t know enough about arrays to perform this level of optimization on them. Certainly it’s much harder to handle escape analysis for an immutable value type (e.g. Int) than a mutable reference type like Array. That said, there has been more talk of working on escape analysis recently, so perhaps this will all be out of date soon :slight_smile:

1 Like