@simd with array views

performance
simd

#1

Hello everyone and Happy New Year!

I have a decrease in performance in my cross-correlation function when I use array views. I am comparing functions without views, with views and one with the classical imfilter from ImageFiltering.jl:

using ImageFiltering, Statistics, BenchmarkTools

# basically, slide the template over the padded image and calculate the correlation between template and a slice of the image under it.
function crossCorrelation_raw!(out::Array{T}, image::Array{T,},
        template::Array{T, N}, outsize::Tuple{Int64, Int64}) where {T,}
        
        size_template = size(template)
    
        @inbounds for i2=1:outsize[2], i1=1:outsize[1]
            
            # this is the slice of the image under the template for the current iteration
            subImage = image[i1:i1+size_template[1]-1, i2:i2+size_template-1]
            
            # caulculate the product for each position and sum them all up, then write to out
            s = dot_simd(template, subImage)
            @inbounds out[i1, i2] += s
        end
    return out
end

# the same thing as above, only that subImage is now a view, template is also view passed as an argument in the wrapper function func_view
function crossCorrelation_raw_view!(out::Array{T}, image::Array{T},
        template::AbstractArray{T}, outsize::Tuple{Int64, Int64}) where {T}
        
        template_size = size(template)
        
        @inbounds for i2=1:outsize[2], i1=1:outsize[1]
            
            subImage = view(image, i1:i1+template_size[1]-1, i2:i2+template_size[2]-1)
            @inbounds out[i1, i2] += dot_simd(template, subImage)
        end
    return out
end

@inline function dot_simd(x::AbstractArray{T}, y::AbstractArray{T}) where {T}
    s = 0.0::T
    @simd for i in eachindex(x)
        @inbounds s += x[i] * y[i]
    end
    s
end
 
# just a wrapper function for the non-views version. Takes the images, prepares image by padding and allocates the output
function func(image::Array{T}, template::Array{T}) where {T}
     
    size_template = size(template)
    
    image_pad = padarray(image, Fill(0.0::T, (size_template[1]-1, size_template[2]-1)))
    image_pad = Array{T, 2}(image_pad[Base.axes(image_pad)[1], Base.axes(image_pad)[2]]);
    outsize = size(image) .+ size_template.-1
    out = zeros(T, outsize)
    crossCorrelation_raw!(out, image_pad, template, outsize);
end

# just a wrapper function for the views version. Takes the input images, prepares image by padding and allocates the output
function func_view(image::Array{T, N}, template::Array{T, N}) where {T, N}
    
    size_template = size(template)
    template_view = view(template, :, :)
    
    image_pad = padarray(image, Fill(0.0::T, (size_template[1]-1, size_template[2]-1)))
    image_pad = Array{T, 2}(image_pad[Base.axes(image_pad)[1], Base.axes(image_pad)[2]]);
    outsize = size(image) .+ size_template.-1
    out = zeros(T, outsize)
    
    crossCorrelation_raw_view!(out, image_pad, template_view, outsize);
end

# the imfilter version from ImageFiltering.jl
function correlate_image(img::Array{T, N}, template::Array{T, N}) where {T, N}
    img = padarray(img, Fill(0, (Int64(round((size(template)[1]-1)/2)), Int64(round((size(template)[2]-1)/2)))))
    z = imfilter(T, img, centered(template), Algorithm.FIR())
end 

Benchmark:

#creating two images to correlate and a 
n = 100
im1 = rand(1.0:2.0, n, n)
im2 = copy(im1);

@btime func(im1, im2)
@btime func_view(im1, im2)
@btime correlate_image(im1, im2)
w/o views: 702.911 ms (79212 allocations: 2.96 GiB)
w/ views: 1.882 s (12 allocations: 2.34 MiB)
imfilter: 616.570 ms (46 allocations: 1.85 MiB)

Although there are only 12 allocs for the views version, it takes almost 3 times longer to run. Why is that? w/o views allocates almost 3 GiB of data and is still so much faster than the views function…

Some notes:

  • adding @inline to dot_simd enormously decreased the allocations. Which is good I guess? Only after doing this and reading more about @inline, I got that @simd goes with @inline.
  • I benchmarked dot_simd with both arrays and views and the performance was the same. So dot_simd should not be the problem here.
  • If someone would be kind to check whether my type declarations are “the Julia way” and if they are properly set. Maybe some of them are unnecessary?

Thanks!

Edit: forgot to include the definition of correlate_image().


#2

Hi,

Note that to perform SIMD instructions (single instruction, multiple data), the CPU likes to load contiguous vectors of numbers from the underlying array, and perform the same operation on them.
When you copy an array to make a new one, then all the underlying data is contiguous, so it can be loaded quickly.

However, when you use views, there is no such guarantee.
If you’re comfortable looking at assembly, check this out

julia> @inline function dot_simd(x::AbstractArray{T}, y::AbstractArray{T}) where {T}
           s = 0.0::T
           @simd for i in eachindex(x)
               @inbounds s += x[i] * y[i]
           end
           s
       end
dot_simd (generic function with 1 method)

julia> X = rand(100,100);

julia> Y = rand(100,100);

julia> Xv = @view X[1:100,1:100];

julia> Yv = @view Y[1:100,1:100];

julia> @code_native dot_simd(X, Y)

Here, we see this as the primary loop body:

L80:
	vmovupd	(%rcx,%rdi,8), %ymm4
	vmovupd	32(%rcx,%rdi,8), %ymm5
	vmovupd	64(%rcx,%rdi,8), %ymm6
	vmovupd	96(%rcx,%rdi,8), %ymm7
; ││└
; ││┌ @ float.jl:395 within `+'
	vfmadd231pd	(%rdx,%rdi,8), %ymm4, %ymm0
	vfmadd231pd	32(%rdx,%rdi,8), %ymm5, %ymm1
	vfmadd231pd	64(%rdx,%rdi,8), %ymm6, %ymm2
	vfmadd231pd	96(%rdx,%rdi,8), %ymm7, %ymm3
; ││└
; ││ @ simdloop.jl:74 within `macro expansion'
; ││┌ @ int.jl:53 within `+'
	addq	$16, %rdi
	cmpq	%rdi, %rsi
	jne	L80

It says move 4 vectors into ymm registers (these registers are 256 bits, meaning move 4 * 4, or 16 doubles total), and then perform fused multiply add instructions (vfmadd) on another region of memory (the other input array) accumulate in other registers. Then perform a comparison, and conditionally jump back to location L80 (start of the loop).
This is what it means for the compiler to “vectorize” the loop.

Unfortunately, this vectorization requires loading contiguous blocks of memory with those vmovupd instructions. This is guaranteed with a regular old array, where A[i+1] always follows A[i], but not in a view, where who-knows how far apart they can be.
We therefore don’t see any such vectorized portion when we do @code_native dot_simd(Xv, Yv). You can also use BenchmarkTools to confirm that, even though both arrays are actually contiguous in this case, the function on the guaranteed-contiguous arrays is much faster.

EDIT:
If you’d like to recover some of the performance, here is a simple way:

julia> @inline function dot_simd(x::SubArray{T,2}, y::SubArray{T,2}) where {T}
           s = 0.0::T
           for i in 1:size(x,2)
               s += @views dot_simd(x[:,i], y[:,i])
           end
           s
       end
dot_simd (generic function with 2 methods)

These views, because they are not across columns and because the underlying arrays are dense, are guaranteed to be contiguous.

Benchmarks on Ryzen:

julia> @benchmark dot_simd($X, $Y)
BenchmarkTools.Trial: 
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     1.234 μs (0.00% GC)
  median time:      1.236 μs (0.00% GC)
  mean time:        1.242 μs (0.00% GC)
  maximum time:     3.361 μs (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     10

julia> @benchmark dot_simd($Xv, $Yv)
BenchmarkTools.Trial: 
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     2.314 μs (0.00% GC)
  median time:      2.316 μs (0.00% GC)
  mean time:        2.337 μs (0.00% GC)
  maximum time:     6.536 μs (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     9

and with the new specialized method for matrix-views:

julia> @benchmark dot_simd($Xv, $Yv)
BenchmarkTools.Trial: 
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     1.630 μs (0.00% GC)
  median time:      1.659 μs (0.00% GC)
  mean time:        1.682 μs (0.00% GC)
  maximum time:     2.511 μs (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     10

EDIT:
Yikes! In @code_native dot_simd(im2_view, im2_view)
There is not a 4x unrolled block, but there is a vectorized 2x unrolled block, which I’d think really ought to be fine.

L224:
	vmovupd	-32(%rdx,%rbp,8), %ymm3
	vmovupd	(%rdx,%rbp,8), %ymm4
; ││└└└
; ││ @ simdloop.jl:73 within `macro expansion' @ float.jl:395
	vfmadd231pd	-32(%rdi,%rbp,8), %ymm3, %ymm0
	vfmadd231pd	(%rdi,%rbp,8), %ymm4, %ymm2
; ││ @ simdloop.jl:74 within `macro expansion'
; ││┌ @ int.jl:53 within `+'
	addq	$8, %rbp
	cmpq	%rbp, %rbx
	jne	L224

#3

Thank you @Elrod for your reply and insights. Especially for the detailed explanation of the code_native output.

I just benchmarked it on my PC:

n = 10000
im1 = rand(1.0:2.0, n, n)
im1_view = view(im1, :,:);

@btime dot_simd(im1, im1)
@btime dot_simd(im1_view, im1_view)
@btime dot_simd_new(im1_view, im1_view) # this is wwith your added dot_simd version

  43.473 ms (1 allocation: 16 bytes)
  42.917 ms (1 allocation: 16 bytes)
  42.961 ms (1 allocation: 16 bytes)

In all cases, it is comparable and there is no trouble with vectorization. Is this is what you meant by

even though both arrays are actually contiguous in this case

So for some reason, dot_simd is also vectorizing the views (I also checked and compared the code_native output). But why do I see such a discrepancy when I use dot_simd in the cross-correlation.

Maybe in this specific case, dot_simd is not the problem but something else? Sorry if I am being a hassle…


#4

OK, I added your dot_simd to crossCorrelation_raw_view():

Benchmark:

n = 100
im1 = rand(1.0:2.0, n, n)
im2 = copy(im1)

@btime func(im1, im2)
@btime func_view(im1, im2)
@btime correlate_image(im1, im2)

624.448 ms (79212 allocations: 2.96 GiB)
130.391 ms (39612 allocations: 4.75 MiB)
540.183 ms (46 allocations: 1.85 MiB)

It did solve the issue. Simd and views are compatible. But I am just wondering the same as above: Testing dot_simd with arrays or views gives the same performance. So why does it not keep that performance in the parent crossCorrelation_raw/_view functions?

Either way, This seems to work now and I am glad that it does. Hope I can translate it to some other simd functions I have. Thanks a lot again!


#5

There are two factors at play here.

  1. Memory bandwidth
  2. SubArrays are smart.

: has a different type from a:b. Therefore, when you create a view with : for the rows, the SubArray dispatch knows the memory is actually contiguous:

julia> n = 224;

julia> im1 = rand(1.0:2.0, n, n);

julia> im1_view = @view im1[:,:];

julia> im2_view = @view im1[1:end,:];

julia> IndexStyle(typeof(im1_view))
IndexLinear()

julia> IndexStyle(typeof(im2_view))
IndexCartesian()

In fact, when we use @code_native, we can see that calling on im1_view is vectorized, but that im2_view is not. im2_view is like the view in your function.
However, that change alone does not solve the issue. That is because, when you choose n=10000, the speed is constrained by memory bandwidth, not by CPU speed. Both are using the same amount of data, so it takes the same time to move memory around.

Shrinking n lets CPU time become the more important factor. I chose n=224 because 224 is the second largest number divisible by 32 and smaller than sqrt(512e3/8). The L2 cache on the computer I’m benchmarking on is 512e3, so this should fit into the cache of a single core. Second largest for a little head room (I think the 512 KiB is for data and instruction cache? Or does the instruction cache go straight to L1?).
This yields:

julia> @btime dot_simd($im1, $im1)
  6.124 μs (0 allocations: 0 bytes)
125230.0

julia> @btime dot_simd($im1_view, $im1_view)
  6.115 μs (0 allocations: 0 bytes)
125230.0

julia> @btime dot_simd_new($im1_view, $im1_view)
  6.811 μs (0 allocations: 0 bytes)
125230.0

julia> @btime dot_simd($im2_view, $im2_view)
  8.493 μs (0 allocations: 0 bytes)
125230.0

julia> @btime dot_simd_new($im2_view, $im2_view)
  6.795 μs (0 allocations: 0 bytes)
125230.0

Note the difference in types of the two views:

julia> typeof(im1_view)
SubArray{Float64,2,Array{Float64,2},Tuple{Base.Slice{Base.OneTo{Int64}},Base.Slice{Base.OneTo{Int64}}},true}

julia> typeof(im2_view)
SubArray{Float64,2,Array{Float64,2},Tuple{UnitRange{Int64},Base.Slice{Base.OneTo{Int64}}},false}

You could dispatch to the original vs the new dot_simd based on whether the last parameter is true or false.


#6

Thanks again for all the details. Turns out even more complicated then I anticipated. Even the small, but important, detail of the different types from : and a:b.