Optimizing iteration over slices of multiple matrices

question
array

#1

I’m trying to optimize some of my code. A common pattern is performing an operation column-wise or row-wise across multiple arrays. Here is a very simplified piece of code:

using BenchmarkTools
using LinearAlgebra

function foo!(result::AbstractArray, p1::AbstractArray, p2::AbstractArray)
    dp = dot(p1, p2)
    result .= dp .* p1 .+ p2
end

function bar!(result::AbstractArray, array::AbstractArray)
    p1 = array[:,1]
    for i in 1:size(array, 2)
        foo!(view(result, :, i), p1, view(array, :, i))
    end
end

function test2()
    array = randn(3, 1000)
    result = zeros(3, 1000)
    @benchmark bar!($result, $array)
end

I have a lot of foo!s and I really would like to keep them simple (they are also used in other contexts). Many of foo!s are very fast (like the one I’ve written here) so I’d like to remove allocations that happen in bar! but I can’t see how I can do this. Do you have any tips?


#2

If the dimension, 3 in your example, is always small, I would consider using an array of SVector from StaticArrays.jl instead of using a matrix for storage. It would make many operations more natural (vector operations) and would not allocate slices. You would probably see improved performance in foo as well.


#3

Hello,

As far as I am aware views() are not “free” but have usually an allocation overhead of a few bytes, see for example this stackoverflow thread. In you example I see 2001 allocations in the @benchmark output which would be consistent with allocating the space for the two views in each loop iteration in bar!() and p1 once. So this would be my guess.


#4

One of the dimensions is usually small but I don’t think SArray would work for the output array. Now I’m thinking about making a mutable SubArray (so I can just change the index during iteration).

It’s a bit irritating that this problem could be solved by basic pointer arithmetic in C.


#5

Yes, views do allocate and that’s the problem. I’ve prepared mutable SubArrays and I’m changing offset while iterating and it works great:

function bar!(result::AbstractArray, array::AbstractArray)
    p1 = array[:,1]

    rv = mview(result, :, 1)
    av = mview(array, :, 1)
    for i in 1:size(array, 2)
        rv.offset1 = i
        av.offset1 = i
        foo!(rv, p1, av)
    end
end

where mview is like view but returns a mutable SubArray. 3 allocations in total and it’s quite fast. But there are no mutable views in Base so there is probably something wrong with this solution?


#6
julia> using UnsafeArrays

julia> @inline function idot(x::AbstractArray{T},y::AbstractArray{T}) where T
           out = zero(T)
           @inbounds @simd for i ∈ eachindex(x,y)
               out += x[i] * y[i]
           end
           out
       end
idot (generic function with 1 method)

julia> @inline function foo!(result::AbstractArray, p1::AbstractArray, p2::AbstractArray)
           dp = idot(p1, p2)
           @inbounds @fastmath for i ∈ eachindex(result, p1, p2)
               result[i] = dp * p1[i] + p2[i]
           end
           result
       end
foo! (generic function with 1 method)

julia> function bar!(result::AbstractArray, array::AbstractArray)
           @uviews result array begin
               @views p1 = array[:,1]
               for i in 1:size(array, 2)
                   @views foo!(result[:,i], p1, array[:,i])
               end
           end
       end
bar! (generic function with 1 method)

julia> function test2()
           array = randn(3, 1000)
           result = zeros(3, 1000)
           @benchmark bar!($result, $array)
       end
test2 (generic function with 1 method)

julia> test2()
BenchmarkTools.Trial: 
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     4.032 μs (0.00% GC)
  median time:      4.039 μs (0.00% GC)
  mean time:        4.220 μs (0.00% GC)
  maximum time:     8.947 μs (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     7

This is 10x faster than the original code:

julia> using BenchmarkTools

julia> using LinearAlgebra

julia> function foo!(result::AbstractArray, p1::AbstractArray, p2::AbstractArray)
           dp = dot(p1, p2)
           result .= dp .* p1 .+ p2
       end
foo! (generic function with 1 method)

julia> function bar!(result::AbstractArray, array::AbstractArray)
           p1 = array[:,1]
           for i in 1:size(array, 2)
               foo!(view(result, :, i), p1, view(array, :, i))
           end
       end
bar! (generic function with 1 method)

julia> function test2()
           array = randn(3, 1000)
           result = zeros(3, 1000)
           @benchmark bar!($result, $array)
       end
test2 (generic function with 1 method)

julia> test2()
BenchmarkTools.Trial: 
  memory estimate:  93.86 KiB
  allocs estimate:  2001
  --------------
  minimum time:     42.978 μs (0.00% GC)
  median time:      45.242 μs (0.00% GC)
  mean time:        57.263 μs (17.11% GC)
  maximum time:     47.581 ms (99.82% GC)
  --------------
  samples:          10000
  evals/sample:     1

#7

The easiest thing you can do to avoid all allocations and maximize performance is to use loops, this gives you a 10X speedup.

using BenchmarkTools

function bar!(result::AbstractArray, array::AbstractArray)
   m,n = size(array)
   @views p1 = array[:,1]
   for i in 1:n
      dp = zero(eltype(array)) 
      @simd for j = 1:m
         dp += p1[j] * array[j,i]
      end 
      @simd for j = 1:m
         result[i,j] = dp * p1[j] + array[i,j]
      end
   end
end

function test3()
    array = randn(3, 1000)
    result = zeros(3, 1000)
    @benchmark bar!($result, $array)
end

The result:

julia> test3()
BenchmarkTools.Trial:
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     4.032 μs (0.00% GC)
  median time:      4.069 μs (0.00% GC)
  mean time:        4.086 μs (0.00% GC)
  maximum time:     5.718 μs (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     7

and the original version:

julia> test2()
BenchmarkTools.Trial:
  memory estimate:  93.86 KiB
  allocs estimate:  2001
  --------------
  minimum time:     39.515 μs (0.00% GC)
  median time:      43.877 μs (0.00% GC)
  mean time:        51.998 μs (14.24% GC)
  maximum time:     35.854 ms (99.75% GC)
  --------------
  samples:          10000
  evals/sample:     1

Edit
As @Elrod pointed out, using @simd has no benefits here because the entries are only 3, for multiples of 4, it can make difference, though.


#8

For a ridiculously fast version, try SArrays, you will get 30X speedup:

using BenchmarkTools, StaticArrays

function bar!(result::AbstractArray, array::AbstractArray)
   n = size(array)[1]
   p1 = array[1]
   for i in 1:n
      dp = dot(p1,array[i])
      result[i] = dp .* p1 .+ array[i]
   end
   reshape(reinterpret(Float64,result),3,:)
end

function test4()
    array = [@SVector randn(3) for i=1:1000]
    result = [@SVector zeros(3) for i=1:1000]
    @time bar!($result, $array)
end

with timing:

Julia > test4()

BenchmarkTools.Trial:
  memory estimate:  64 bytes
  allocs estimate:  2
  --------------
  minimum time:     1.257 μs (0.00% GC)
  median time:      1.257 μs (0.00% GC)
  mean time:        1.266 μs (0.00% GC)
  maximum time:     2.284 μs (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     10

#9

If you can transpose the arrays, you can get even better performance if you manage to get the code to vectorize:

julia> using UnsafeArrays

julia> @inline function idot(x::AbstractArray{T},y::AbstractArray{T}) where T
           out = zero(T)
           @inbounds @simd for i ∈ eachindex(x,y)
               out += x[i] * y[i]
           end
           out
       end
idot (generic function with 1 method)

julia> @inline function foo!(result::AbstractArray, p1::AbstractArray, p2::AbstractArray)
           dp = idot(p1, p2)
           @inbounds @fastmath for i ∈ eachindex(result, p1, p2)
               result[i] = dp * p1[i] + p2[i]
           end
           result
       end
foo! (generic function with 1 method)

julia> function bar1!(result::AbstractArray, array::AbstractArray)
           @uviews result array begin
               @views p1 = array[:,1]
               for i in 1:size(array, 2)
                   @views foo!(result[:,i], p1, array[:,i])
               end
           end
       end
bar1! (generic function with 1 method)

julia> function bar2!(result::AbstractArray, array::AbstractArray)
          m,n = size(array)
          @views p1 = array[:,1]
          @inbounds for i in 1:n
             dp = zero(eltype(array)) 
             @simd for j = 1:m
                dp += p1[j] * array[j,i]
             end 
             @simd for j = 1:m
                result[j,i] = dp * p1[j] + array[j,i]
             end
          end
       end
bar2! (generic function with 1 method)

julia> using Base.Cartesian: @nexprs

julia> function bar_t!(result::AbstractArray{T}, array::AbstractArray{T}) where T
           n, m = size(array)
           @views p1 = array[1,:]
           @inbounds @fastmath for i ∈ 0:8:(n-8)
               @nexprs 8 k -> dp_k = zero(T)
               for j ∈ 1:m
                   @nexprs 8 k -> dp_k += p1[j] * array[i+k,j]
               end
               for j ∈ 1:m
                   @nexprs 8 k -> result[i+k,j] = dp_k * p1[j] + array[i+k,j]
               end
           end
           @inbounds for i in n-(n%8)+1:n
              dp = zero(T) 
              for j = 1:m
                 dp += p1[j] * array[i,j]
              end 
              for j = 1:m
                 result[i,j] = dp * p1[j] + array[i,j]
              end
           end
       end
bar_t! (generic function with 1 method)

julia> N = 1000
1000

julia> array = randn(3, N);


 Pkg.update() complete 
julia> result = zeros(3, N);

julia> array_t = copy(array');

julia> result_t = zeros(N, 3);

julia> bar1!(result, array); result
3×1000 Array{Float64,2}:
 -2.61722   1.85986  -1.48172    0.819111   0.624513   2.92976  -3.0886   -1.97158   -0.80507  …  -2.2052   -4.68454  -1.46795  -0.317441   1.36027  -2.19459  -2.17556   1.87992  -1.32306   0.604871
  6.52334  -4.04671   4.89907   -1.44343   -4.97068   -4.16391   1.87026   0.585894   2.37889      4.1093    3.31065   1.81274   1.95511   -0.87607   2.51032   7.23454  -5.00863   4.08801   3.37566 
 -3.781     3.76811  -0.388579   0.903244   2.25944    1.50477  -2.76848   0.432073  -1.13699     -1.84506  -2.43655  -1.25644  -0.197228   1.33583  -1.08789  -3.9975    2.23508  -2.13335  -1.01191 

julia> bar2!(result, array); result
3×1000 Array{Float64,2}:
 -2.61722   1.85986  -1.48172    0.819111   0.624513   2.92976  -3.0886   -1.97158   -0.80507  …  -2.2052   -4.68454  -1.46795  -0.317441   1.36027  -2.19459  -2.17556   1.87992  -1.32306   0.604871
  6.52334  -4.04671   4.89907   -1.44343   -4.97068   -4.16391   1.87026   0.585894   2.37889      4.1093    3.31065   1.81274   1.95511   -0.87607   2.51032   7.23454  -5.00863   4.08801   3.37566 
 -3.781     3.76811  -0.388579   0.903244   2.25944    1.50477  -2.76848   0.432073  -1.13699     -1.84506  -2.43655  -1.25644  -0.197228   1.33583  -1.08789  -3.9975    2.23508  -2.13335  -1.01191 

julia> bar_t!(result_t, array_t); result_t'
3×1000 LinearAlgebra.Adjoint{Float64,Array{Float64,2}}:
 -2.61722   1.85986  -1.48172    0.819111   0.624513   2.92976  -3.0886   -1.97158   -0.80507  …  -2.2052   -4.68454  -1.46795  -0.317441   1.36027  -2.19459  -2.17556   1.87992  -1.32306   0.604871
  6.52334  -4.04671   4.89907   -1.44343   -4.97068   -4.16391   1.87026   0.585894   2.37889      4.1093    3.31065   1.81274   1.95511   -0.87607   2.51032   7.23454  -5.00863   4.08801   3.37566 
 -3.781     3.76811  -0.388579   0.903244   2.25944    1.50477  -2.76848   0.432073  -1.13699     -1.84506  -2.43655  -1.25644  -0.197228   1.33583  -1.08789  -3.9975    2.23508  -2.13335  -1.01191 

julia> using BenchmarkTools

julia> @benchmark bar1!($result, $array)
BenchmarkTools.Trial: 
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     4.035 μs (0.00% GC)
  median time:      4.047 μs (0.00% GC)
  mean time:        4.215 μs (0.00% GC)
  maximum time:     9.224 μs (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     7

julia> @benchmark bar2!($result, $array)BenchmarkTools.Trial: 
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     3.031 μs (0.00% GC)
  median time:      3.119 μs (0.00% GC)
  mean time:        3.230 μs (0.00% GC)
  maximum time:     7.301 μs (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     8

julia> @benchmark bar_t!($result_t, $array_t)
BenchmarkTools.Trial: 
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     1.352 μs (0.00% GC)
  median time:      1.389 μs (0.00% GC)
  mean time:        1.424 μs (0.00% GC)
  maximum time:     4.638 μs (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     10

EDIT:
I am surprised that simd helps with only 3 rows. It normally does loop unrolling that I’d have expected to hurt performance.


#10

I see, 3 entries will not benefit from @simd, at least 4, thanks. The idea of transpose is awesome. I’m excited by the level of performance Julia can deliver if one pays more attention to what they are coding. Your last version and my SArrays one are 3X faster than Intel Fortran, I can’t believe it!!


#11

Hmm.
@code_native shows that my code actually failed to vectorize how I intended it. I’m trying to improve a macro to get that to work.

My @vectorize macro needs a lot of work, but with a little finagling:

using SIMDPirates, SLEEFwrap, BenchmarkTools, Base.Cartesian

@generated function bar_tv2!(result::AbstractArray{T}, array::AbstractArray{T},::Val{NCOL}) where {T,NCOL}
    quote
        @nexprs $NCOL j -> p_j = @inbounds array[1,j]
        @vectorize $T for i ∈ 1:size(array,1)
            dp = zero(T)
            @nexprs $NCOL j -> dp += p_j * array[i,j]
            @nexprs $NCOL j -> result[i,j] = dp * p_j + array[i,j]
        end
    end
end

function bar2!(result::AbstractArray, array::AbstractArray)
   m,n = size(array)
   @views p1 = array[:,1]
   @inbounds for i in 1:n
      dp = zero(eltype(array)) 
      @simd for j = 1:m
         dp += p1[j] * array[j,i]
      end 
      @simd for j = 1:m
         result[j,i] = dp * p1[j] + array[j,i]
      end
   end
end


N = 1000
array = randn(3, N);
result = zeros(3, N);

array_t = copy(array');
result_t = zeros(N, 3);

This yields:

julia> bar_tv2!(result_t, array_t,Val(3)); result_t'
3×1000 LinearAlgebra.Adjoint{Float64,Array{Float64,2}}:
 -2.22506  -3.39461  -0.89401   1.37359  -1.27377   2.85605  -1.22525  -0.387945   3.23087  -0.718255  …   1.60807   -2.39425     0.113308   1.24666   -1.2275    -0.713399  -0.601649  -0.664115   0.665728
  1.60194   3.84549  -1.26712  -1.59311  -1.31202  -2.01284   1.95004  -0.276039  -4.76663  -0.187859      0.880181   2.02346    -1.44376    0.426184   0.570249  -1.27561    0.865626   0.9036    -0.543402
 -1.68125  -1.77578  -1.45913   0.29928  -1.20614   2.13511  -1.24661  -1.1421     2.60744  -0.425533     -0.961413   0.0212073   0.423453  -0.270527  -1.72279    0.281566  -1.08591    1.90826    0.918416

julia> bar2!(result, array); result
3×1000 Array{Float64,2}:
 -2.22506  -3.39461  -0.89401   1.37359  -1.27377   2.85605  -1.22525  -0.387945   3.23087  -0.718255  …   1.60807   -2.39425     0.113308   1.24666   -1.2275    -0.713399  -0.601649  -0.664115   0.665728
  1.60194   3.84549  -1.26712  -1.59311  -1.31202  -2.01284   1.95004  -0.276039  -4.76663  -0.187859      0.880181   2.02346    -1.44376    0.426184   0.570249  -1.27561    0.865626   0.9036    -0.543402
 -1.68125  -1.77578  -1.45913   0.29928  -1.20614   2.13511  -1.24661  -1.1421     2.60744  -0.425533     -0.961413   0.0212073   0.423453  -0.270527  -1.72279    0.281566  -1.08591    1.90826    0.918416


julia> @benchmark bar_tv2!($result_t, $array_t,Val(3))
BenchmarkTools.Trial: 
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     433.854 ns (0.00% GC)
  median time:      435.020 ns (0.00% GC)
  mean time:        455.177 ns (0.00% GC)
  maximum time:     668.051 ns (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     198

julia> @benchmark bar2!($result, $array)
BenchmarkTools.Trial: 
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     3.025 μs (0.00% GC)
  median time:      3.029 μs (0.00% GC)
  mean time:        3.158 μs (0.00% GC)
  maximum time:     7.316 μs (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     8

Now, 100x faster than the original!
Note, I’m benchmarking on a computer with avx512, so most aren’t likely to see the same benefit.

EDIT:
Cleaned up assmebly (Discourse’s syntax highlighting doesn’t agree with the auto-inserted comments):

julia> @code_native bar_tv2!(result_t, array_t,Val(3))
	.text
	pushq	%rbp
	pushq	%r15
	pushq	%r14
	pushq	%r13
	pushq	%r12
	pushq	%rbx
	subq	$56, %rsp
	movq	%rsi, 48(%rsp)
	movq	(%rsi), %r14
	movq	8(%rsi), %rax
	movq	(%rax), %r12
	movq	24(%rax), %rbx
	vmovsd	(%r12), %xmm0           # xmm0 = mem[0],zero
	vmovaps	%xmm0, 32(%rsp)
	vmovsd	(%r12,%rbx,8), %xmm0    # xmm0 = mem[0],zero
	vmovaps	%xmm0, 16(%rsp)
	movq	%rbx, %rbp
	shlq	$4, %rbp
	vmovsd	(%r12,%rbp), %xmm0      # xmm0 = mem[0],zero
	vmovapd	%xmm0, (%rsp)
	movq	%rbx, %r13
	sarq	$63, %r13
	shrq	$61, %r13
	addq	%rbx, %r13
	andq	$-8, %r13
	movq	(%r14), %r15
	movabsq	$steprange_last, %rax
	movl	$1, %edi
	movl	$8, %esi
	movq	%r13, %rdx
	callq	*%rax
	vmovapd	32(%rsp), %xmm10
	vmovapd	(%rsp), %xmm9
	vmovapd	16(%rsp), %xmm8
	testq	%rax, %rax
	jle	L288
	vbroadcastsd	%xmm10, %zmm0
	vbroadcastsd	%xmm8, %zmm1
	vbroadcastsd	%xmm9, %zmm2
	movq	24(%r14), %rcx
	movq	%rcx, %rdx
	shlq	$4, %rdx
	addq	$7, %rax
	vxorpd	%xmm3, %xmm3, %xmm3
	movq	%r12, %rsi
	movq	%r15, %rdi
	nop
L192:
	vmovupd	(%rsi), %zmm4
	vmovupd	(%rsi,%rbx,8), %zmm5
	vmovupd	(%rsi,%rbp), %zmm6
	vmovapd	%zmm4, %zmm7
	vfmadd213pd	%zmm3, %zmm0, %zmm7
	vfmadd231pd	%zmm5, %zmm1, %zmm7
	vfmadd231pd	%zmm6, %zmm2, %zmm7
	vfmadd231pd	%zmm0, %zmm7, %zmm4
	vmovupd	%zmm4, (%rdi)
	vfmadd231pd	%zmm1, %zmm7, %zmm5
	vmovupd	%zmm5, (%rdi,%rcx,8)
	vfmadd213pd	%zmm6, %zmm2, %zmm7
	vmovupd	%zmm7, (%rdi,%rdx)
	addq	$64, %rdi
	addq	$64, %rsi
	addq	$-8, %rax
	jne	L192
L288:
	movq	%rbx, %rcx
	subq	%r13, %rcx
	testq	%rcx, %rcx
	jle	L495
	vpbroadcastq	%rcx, %zmm0
	movabsq	$140535612763840, %rax  # imm = 0x7FD0FF46EEC0
	vpaddq	(%rax), %zmm0, %zmm0
	movq	%rbx, %rax
	subq	%rcx, %rax
	vpternlogd	$255, %zmm1, %zmm1, %zmm1
	vpcmpgtq	%zmm1, %zmm0, %k1
	vmovupd	(%r12,%rax,8), %zmm0 {%k1} {z}
	vpxor	%xmm1, %xmm1, %xmm1
	leaq	(%rbx,%rbx), %rcx
	leaq	(%rbx,%rax), %rdx
	vmovupd	(%r12,%rdx,8), %zmm2 {%k1} {z}
	addq	%rax, %rcx
	vmovupd	(%r12,%rcx,8), %zmm3 {%k1} {z}
	vbroadcastsd	%xmm10, %zmm4
	vfmadd231pd	%zmm0, %zmm4, %zmm1
	vbroadcastsd	%xmm8, %zmm5
	vfmadd231pd	%zmm2, %zmm5, %zmm1
	vbroadcastsd	%xmm9, %zmm6
	vfmadd231pd	%zmm3, %zmm6, %zmm1
	vfmadd213pd	%zmm0, %zmm1, %zmm4
	vmovupd	%zmm4, (%r15,%rax,8) {%k1}
	vfmadd213pd	%zmm2, %zmm1, %zmm5
	movq	24(%r14), %rcx
	leaq	(%rcx,%rax), %rdx
	vmovupd	%zmm5, (%r15,%rdx,8) {%k1}
	vfmadd213pd	%zmm3, %zmm6, %zmm1
	leaq	(%rax,%rcx,2), %rax
	vmovupd	%zmm1, (%r15,%rax,8) {%k1}
	movabsq	$140535856324616, %rax  # imm = 0x7FD10DCB6008
	addq	$56, %rsp
	popq	%rbx
	popq	%r12
	popq	%r13
	popq	%r14
	popq	%r15
	popq	%rbp
	vzeroupper
	retq
L495:
	movabsq	$140535856324616, %rax  # imm = 0x7FD10DCB6008
	addq	$56, %rsp
	popq	%rbx
	popq	%r12
	popq	%r13
	popq	%r14
	popq	%r15
	popq	%rbp
	vzeroupper
	retq

a big pile of zmm (512 bit) registers and vfmadd instructions.

EDIT:
@Seif_Shebl, did you try Intel Fortran on the version with transposed array and result (ie, Nx3 instead of 3xN)?
In earlier experimenting, both ifort and flang sometimes managed to correctly vectorize such for loops. Sometimes gfortran can do it to, but it often involves playing with flags like turning -fdisable-tree-cunrolli on or off. The number of loops I tried on isn’t too large, but that llvm-based flang worked on an example loop I tried earlier while Julia did not (without @vectorize) surprised me.
On that note, I’ve been meaning to try and find the time to play with Intel’s [Single Program Multiple Data] Program Compiler (ispc) compiler, which is also LLVM-based. Here is a fascinating series of blog posts about it by the creator. From the sound of it, this is the sort of thing it should be well optimized for – although the author wishes some bigger project would work to better support it.

My success has actually been pretty low with Julia - but the ability to write complicated transformative macros means it’s actually the most robustly capable of loop vectorization in practice (for me).

And SIMD intrinsics are so much easier in Julia, even without macros.

I really need to refactor the code base behind my @vectorize macro to make it easier to work with and extend. It is fairly limited at the moment. It would be easier to add features / support more types of loops with a different code organization.

EDIT:

Actually, both Flang and gfortran do a great job on this problem without any tricks.
The Fortran implementation:

module vectorization_test

implicit none

contains

subroutine bar(result, array, N)
    integer(8),                 intent(in)  ::  N
    real(8),  dimension(N,3),   intent(in)  ::  array
    real(8),  dimension(N,3),   intent(out) ::  result

    integer(8)                              ::  i
    real(8)                                 ::  p1(3), dp

    p1 = array(1,:)

    do concurrent (i = 1:N)
        dp = sum(p1 * array(i,:))
        result(i,:) = dp * p1 + array(i,:)
    end do

end subroutine bar

end module vectorization_test

and compiler versions (latest releases):

$ gfortran --version
GNU Fortran (GCC) 8.2.1 20181215 (Red Hat 8.2.1-6)
Copyright (C) 2018 Free Software Foundation, Inc.
This is free software; see the source for copying conditions.  There is NO
warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.

$ flang --version
clang version 7.0.1 (https://github.com/flang-compiler/flang-driver.git 24bd54da5c41af04838bbe7b68f830840d47fc03) (https://github.com/flang-compiler/llvm.git 84335cded2f2cf9de26a77194835094fc75f7bb7)
Target: x86_64-unknown-linux-gnu
Thread model: posix
InstalledDir: /usr/local/bin

Compiling with

$ gfortran -Ofast -march=native -mprefer-vector-width=512 -shared -fPIC vectorization_test.f90 -o libgfortran_vectorization_test.so
$ flang -Ofast -march=native -mprefer-vector-width=512 -shared -fPIC vectorization_test.f90 -o libflang_vectorization_test.so
julia> const test_dir = "/home/chriselrod/Documents/progwork/fortran"
"/home/chriselrod/Documents/progwork/fortran"

julia> const flang_lib = joinpath(test_dir, "libflang_vectorization_test.so")
"/home/chriselrod/Documents/progwork/fortran/libflang_vectorization_test.so"

julia> const gfort_lib = joinpath(test_dir, "libgfortran_vectorization_test.so")
"/home/chriselrod/Documents/progwork/fortran/libgfortran_vectorization_test.so"

julia> function flang_test(results, array, N)
           ccall(
               (:vectorization_test_bar_, flang_lib), # function and library
               Cvoid, # return type
               (Ptr{Float64},Ptr{Float64},Ptr{Int64}), # argument types
               results, array, N # arguments
           )
       end
flang_test (generic function with 1 method)

julia> function gfortran_test(results, array, N)
           ccall(
               (:__vectorization_test_MOD_bar, gfort_lib), # function and library
               Cvoid, # return type
               (Ptr{Float64},Ptr{Float64},Ptr{Int64}), # argument types
               results, array, N # arguments
           )
       end
gfortran_test (generic function with 1 method)

julia> Nr = Ref(N)
Base.RefValue{Int64}(1000)

julia> fill!(result_t, 0);

julia> flang_test(result_t, array_t, Nr); result_t' # matches the earlier Julia results
3×1000 LinearAlgebra.Adjoint{Float64,Array{Float64,2}}:
 -2.22506  -3.39461  -0.89401   1.37359  -1.27377   2.85605  -1.22525  -0.387945   3.23087  -0.718255  …   1.60807   -2.39425     0.113308   1.24666   -1.2275    -0.713399  -0.601649  -0.664115   0.665728
  1.60194   3.84549  -1.26712  -1.59311  -1.31202  -2.01284   1.95004  -0.276039  -4.76663  -0.187859      0.880181   2.02346    -1.44376    0.426184   0.570249  -1.27561    0.865626   0.9036    -0.543402
 -1.68125  -1.77578  -1.45913   0.29928  -1.20614   2.13511  -1.24661  -1.1421     2.60744  -0.425533     -0.961413   0.0212073   0.423453  -0.270527  -1.72279    0.281566  -1.08591    1.90826    0.918416

julia> fill!(result_t, 0);

julia> gfortran_test(result_t, array_t, Nr); result_t'
3×1000 LinearAlgebra.Adjoint{Float64,Array{Float64,2}}:
 -2.22506  -3.39461  -0.89401   1.37359  -1.27377   2.85605  -1.22525  -0.387945   3.23087  -0.718255  …   1.60807   -2.39425     0.113308   1.24666   -1.2275    -0.713399  -0.601649  -0.664115   0.665728
  1.60194   3.84549  -1.26712  -1.59311  -1.31202  -2.01284   1.95004  -0.276039  -4.76663  -0.187859      0.880181   2.02346    -1.44376    0.426184   0.570249  -1.27561    0.865626   0.9036    -0.543402
 -1.68125  -1.77578  -1.45913   0.29928  -1.20614   2.13511  -1.24661  -1.1421     2.60744  -0.425533     -0.961413   0.0212073   0.423453  -0.270527  -1.72279    0.281566  -1.08591    1.90826    0.918416

These get the same answer, yet are very fast:

julia> @benchmark gfortran_test($result_t, $array_t, $Nr)
BenchmarkTools.Trial: 
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     420.372 ns (0.00% GC)
  median time:      421.322 ns (0.00% GC)
  mean time:        436.942 ns (0.00% GC)
  maximum time:     647.241 ns (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     199

julia> @benchmark flang_test($result_t, $array_t, $Nr)
BenchmarkTools.Trial: 
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     419.518 ns (0.00% GC)
  median time:      420.523 ns (0.00% GC)
  mean time:        438.599 ns (0.00% GC)
  maximum time:     668.719 ns (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     199

My Flang is using LLVM-7. I don’t know what Flang’s front end is doing, but it seems like Julia ought to achieve the same performance and ease of writing code, without any fancy macros.

EDIT:
I’m less familiar with ifort, but it gets the same performance too:

julia> const ifort_lib = joinpath(test_dir, "libifort_vectorization_test.so")
"/home/chriselrod/Documents/progwork/fortran/libifort_vectorization_test.so"

julia> function ifort_test(results, array, N)
           ccall(
               (:vectorization_test_mp_bar_, ifort_lib), # function and library
               Cvoid, # return type
               (Ptr{Float64},Ptr{Float64},Ptr{Int64}), # argument types
               results, array, N # arguments
           )
       end
ifort_test (generic function with 1 method)

julia> fill!(result_t, 0);

julia> ifort_test(result_t, array_t, Nr); result_t'
3×1000 LinearAlgebra.Adjoint{Float64,Array{Float64,2}}:
 -2.22506  -3.39461  -0.89401   1.37359  -1.27377   2.85605  -1.22525  -0.387945   3.23087  -0.718255  …   1.60807   -2.39425     0.113308   1.24666   -1.2275    -0.713399  -0.601649  -0.664115   0.665728
  1.60194   3.84549  -1.26712  -1.59311  -1.31202  -2.01284   1.95004  -0.276039  -4.76663  -0.187859      0.880181   2.02346    -1.44376    0.426184   0.570249  -1.27561    0.865626   0.9036    -0.543402
 -1.68125  -1.77578  -1.45913   0.29928  -1.20614   2.13511  -1.24661  -1.1421     2.60744  -0.425533     -0.961413   0.0212073   0.423453  -0.270527  -1.72279    0.281566  -1.08591    1.90826    0.918416

julia> @benchmark ifort_test($result_t, $array_t, $Nr)
BenchmarkTools.Trial: 
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     421.477 ns (0.00% GC)
  median time:      422.523 ns (0.00% GC)
  mean time:        443.197 ns (0.00% GC)
  maximum time:     721.663 ns (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     199

I compiled with

$ ifort -fast -qopt-zmm-usage=high -shared -fPIC vectorization_test.f90 -o libifort_vectorization_test.so
$ ifort --version
ifort (IFORT) 19.0.1.144 20181018
Copyright (C) 1985-2018 Intel Corporation.  All rights reserved.

All three Fortran compilers produce code that takes just over 420 nanoseconds, while my best Julia version took about 435 nanoseconds. So that’s pretty good, but it needed a fancy code-transforming macro, while the Fortran code was written in a (comparatively) simple and naive manner.
Even if I go “all out” with macros from base Julia, I don’t get that performance:

julia> @generated function bar_tv3!(result::AbstractArray{T}, array::AbstractArray{T},::Val{NCOL}) where {T,NCOL}
           quote
               @nexprs $NCOL j -> p_j = @inbounds array[1,j]
               @inbounds @fastmath @simd ivdep for i ∈ 1:size(array,1)
                   dp = zero(T)
                   @nexprs $NCOL j -> dp += p_j * array[i,j]
                   @nexprs $NCOL j -> result[i,j] = dp * p_j + array[i,j]
               end
           end
       end
bar_tv3! (generic function with 1 method)

julia> @benchmark bar_tv3!($result_t, $array_t,Val(3))
BenchmarkTools.Trial: 
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     1.091 μs (0.00% GC)
  median time:      1.101 μs (0.00% GC)
  mean time:        1.142 μs (0.00% GC)
  maximum time:     2.841 μs (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     10

because it fails to vectorize across loop iterations. So, I’d call it a win for Fortran at the moment.


#12

See http://juliaarrays.github.io/StaticArrays.jl/stable/pages/api.html#Arrays-of-static-arrays-1 for conversion between static and normal arrays


#13

For the bennefit of the thread starter:

The code snippet given in that documentation throws a depreciation warning with 0.7 and therefore will probably just give an error with 1.0 (and no helpful hint at all). I guess this might be correct

function svectors(x::Matrix{Float64})
           @assert size(x,1) == 3
           reshape(reinterpret(SVector{3,Float64}, x),(size(x,2)))
       end

Edit:
On top reshape appears to be lazy:

julia> zz=zeros(3,2)
3×2 Array{Float64,2}:
 0.0  0.0
 0.0  0.0
 0.0  0.0

julia> szz=svectors(zz)
2-element reshape(reinterpret(SArray{Tuple{3},Float64,1,3}, ::Array{Float64,2}), 2) with eltype SArray{Tuple{3},Float64,1,3}:
 [0.0, 0.0, 0.0]
 [0.0, 0.0, 0.0]

julia> collect(szz)
2-element Array{SArray{Tuple{3},Float64,1,3},1}:
 [0.0, 0.0, 0.0]
 [0.0, 0.0, 0.0]

which might or might not be of interest.

Edit 2:
An unmerged pull request is probably the fix to the documentationn and better code anyway :slight_smile:


#14

Thank you, this is really helpful :heart:. I see that I can make my code much faster with these tricks.


#15

I really appreciate threads like these. This community is pretty unique.