# Optimizing iteration over slices of multiple matrices

I’m trying to optimize some of my code. A common pattern is performing an operation column-wise or row-wise across multiple arrays. Here is a very simplified piece of code:

``````using BenchmarkTools
using LinearAlgebra

function foo!(result::AbstractArray, p1::AbstractArray, p2::AbstractArray)
dp = dot(p1, p2)
result .= dp .* p1 .+ p2
end

function bar!(result::AbstractArray, array::AbstractArray)
p1 = array[:,1]
for i in 1:size(array, 2)
foo!(view(result, :, i), p1, view(array, :, i))
end
end

function test2()
array = randn(3, 1000)
result = zeros(3, 1000)
@benchmark bar!(\$result, \$array)
end
``````

I have a lot of `foo!`s and I really would like to keep them simple (they are also used in other contexts). Many of `foo!`s are very fast (like the one I’ve written here) so I’d like to remove allocations that happen in `bar!` but I can’t see how I can do this. Do you have any tips?

If the dimension, 3 in your example, is always small, I would consider using an array of SVector from StaticArrays.jl instead of using a matrix for storage. It would make many operations more natural (vector operations) and would not allocate slices. You would probably see improved performance in foo as well.

1 Like

Hello,

As far as I am aware views() are not “free” but have usually an allocation overhead of a few bytes, see for example this stackoverflow thread. In you example I see 2001 allocations in the `@benchmark` output which would be consistent with allocating the space for the two views in each loop iteration in `bar!()` and `p1` once. So this would be my guess.

One of the dimensions is usually small but I don’t think `SArray` would work for the output array. Now I’m thinking about making a mutable `SubArray` (so I can just change the index during iteration).

It’s a bit irritating that this problem could be solved by basic pointer arithmetic in C.

Yes, views do allocate and that’s the problem. I’ve prepared mutable `SubArray`s and I’m changing offset while iterating and it works great:

``````function bar!(result::AbstractArray, array::AbstractArray)
p1 = array[:,1]

rv = mview(result, :, 1)
av = mview(array, :, 1)
for i in 1:size(array, 2)
rv.offset1 = i
av.offset1 = i
foo!(rv, p1, av)
end
end
``````

where `mview` is like `view` but returns a mutable `SubArray`. 3 allocations in total and it’s quite fast. But there are no mutable views in Base so there is probably something wrong with this solution?

``````julia> using UnsafeArrays

julia> @inline function idot(x::AbstractArray{T},y::AbstractArray{T}) where T
out = zero(T)
@inbounds @simd for i ∈ eachindex(x,y)
out += x[i] * y[i]
end
out
end
idot (generic function with 1 method)

julia> @inline function foo!(result::AbstractArray, p1::AbstractArray, p2::AbstractArray)
dp = idot(p1, p2)
@inbounds @fastmath for i ∈ eachindex(result, p1, p2)
result[i] = dp * p1[i] + p2[i]
end
result
end
foo! (generic function with 1 method)

julia> function bar!(result::AbstractArray, array::AbstractArray)
@uviews result array begin
@views p1 = array[:,1]
for i in 1:size(array, 2)
@views foo!(result[:,i], p1, array[:,i])
end
end
end
bar! (generic function with 1 method)

julia> function test2()
array = randn(3, 1000)
result = zeros(3, 1000)
@benchmark bar!(\$result, \$array)
end
test2 (generic function with 1 method)

julia> test2()
BenchmarkTools.Trial:
memory estimate:  0 bytes
allocs estimate:  0
--------------
minimum time:     4.032 μs (0.00% GC)
median time:      4.039 μs (0.00% GC)
mean time:        4.220 μs (0.00% GC)
maximum time:     8.947 μs (0.00% GC)
--------------
samples:          10000
evals/sample:     7
``````

This is 10x faster than the original code:

``````julia> using BenchmarkTools

julia> using LinearAlgebra

julia> function foo!(result::AbstractArray, p1::AbstractArray, p2::AbstractArray)
dp = dot(p1, p2)
result .= dp .* p1 .+ p2
end
foo! (generic function with 1 method)

julia> function bar!(result::AbstractArray, array::AbstractArray)
p1 = array[:,1]
for i in 1:size(array, 2)
foo!(view(result, :, i), p1, view(array, :, i))
end
end
bar! (generic function with 1 method)

julia> function test2()
array = randn(3, 1000)
result = zeros(3, 1000)
@benchmark bar!(\$result, \$array)
end
test2 (generic function with 1 method)

julia> test2()
BenchmarkTools.Trial:
memory estimate:  93.86 KiB
allocs estimate:  2001
--------------
minimum time:     42.978 μs (0.00% GC)
median time:      45.242 μs (0.00% GC)
mean time:        57.263 μs (17.11% GC)
maximum time:     47.581 ms (99.82% GC)
--------------
samples:          10000
evals/sample:     1
``````
2 Likes

The easiest thing you can do to avoid all allocations and maximize performance is to use loops, this gives you a `10X` speedup.

``````using BenchmarkTools

function bar!(result::AbstractArray, array::AbstractArray)
m,n = size(array)
@views p1 = array[:,1]
for i in 1:n
dp = zero(eltype(array))
@simd for j = 1:m
dp += p1[j] * array[j,i]
end
@simd for j = 1:m
result[i,j] = dp * p1[j] + array[i,j]
end
end
end

function test3()
array = randn(3, 1000)
result = zeros(3, 1000)
@benchmark bar!(\$result, \$array)
end
``````

The result:

``````julia> test3()
BenchmarkTools.Trial:
memory estimate:  0 bytes
allocs estimate:  0
--------------
minimum time:     4.032 μs (0.00% GC)
median time:      4.069 μs (0.00% GC)
mean time:        4.086 μs (0.00% GC)
maximum time:     5.718 μs (0.00% GC)
--------------
samples:          10000
evals/sample:     7
``````

and the original version:

``````julia> test2()
BenchmarkTools.Trial:
memory estimate:  93.86 KiB
allocs estimate:  2001
--------------
minimum time:     39.515 μs (0.00% GC)
median time:      43.877 μs (0.00% GC)
mean time:        51.998 μs (14.24% GC)
maximum time:     35.854 ms (99.75% GC)
--------------
samples:          10000
evals/sample:     1
``````

Edit
As @Elrod pointed out, using `@simd` has no benefits here because the entries are only 3, for multiples of 4, it can make difference, though.

For a ridiculously fast version, try `SArrays`, you will get `30X` speedup:

``````using BenchmarkTools, StaticArrays

function bar!(result::AbstractArray, array::AbstractArray)
n = size(array)[1]
p1 = array[1]
for i in 1:n
dp = dot(p1,array[i])
result[i] = dp .* p1 .+ array[i]
end
reshape(reinterpret(Float64,result),3,:)
end

function test4()
array = [@SVector randn(3) for i=1:1000]
result = [@SVector zeros(3) for i=1:1000]
@time bar!(\$result, \$array)
end
``````

with timing:

``````Julia > test4()

BenchmarkTools.Trial:
memory estimate:  64 bytes
allocs estimate:  2
--------------
minimum time:     1.257 μs (0.00% GC)
median time:      1.257 μs (0.00% GC)
mean time:        1.266 μs (0.00% GC)
maximum time:     2.284 μs (0.00% GC)
--------------
samples:          10000
evals/sample:     10
``````
3 Likes

If you can transpose the arrays, you can get even better performance if you manage to get the code to vectorize:

``````julia> using UnsafeArrays

julia> @inline function idot(x::AbstractArray{T},y::AbstractArray{T}) where T
out = zero(T)
@inbounds @simd for i ∈ eachindex(x,y)
out += x[i] * y[i]
end
out
end
idot (generic function with 1 method)

julia> @inline function foo!(result::AbstractArray, p1::AbstractArray, p2::AbstractArray)
dp = idot(p1, p2)
@inbounds @fastmath for i ∈ eachindex(result, p1, p2)
result[i] = dp * p1[i] + p2[i]
end
result
end
foo! (generic function with 1 method)

julia> function bar1!(result::AbstractArray, array::AbstractArray)
@uviews result array begin
@views p1 = array[:,1]
for i in 1:size(array, 2)
@views foo!(result[:,i], p1, array[:,i])
end
end
end
bar1! (generic function with 1 method)

julia> function bar2!(result::AbstractArray, array::AbstractArray)
m,n = size(array)
@views p1 = array[:,1]
@inbounds for i in 1:n
dp = zero(eltype(array))
@simd for j = 1:m
dp += p1[j] * array[j,i]
end
@simd for j = 1:m
result[j,i] = dp * p1[j] + array[j,i]
end
end
end
bar2! (generic function with 1 method)

julia> using Base.Cartesian: @nexprs

julia> function bar_t!(result::AbstractArray{T}, array::AbstractArray{T}) where T
n, m = size(array)
@views p1 = array[1,:]
@inbounds @fastmath for i ∈ 0:8:(n-8)
@nexprs 8 k -> dp_k = zero(T)
for j ∈ 1:m
@nexprs 8 k -> dp_k += p1[j] * array[i+k,j]
end
for j ∈ 1:m
@nexprs 8 k -> result[i+k,j] = dp_k * p1[j] + array[i+k,j]
end
end
@inbounds for i in n-(n%8)+1:n
dp = zero(T)
for j = 1:m
dp += p1[j] * array[i,j]
end
for j = 1:m
result[i,j] = dp * p1[j] + array[i,j]
end
end
end
bar_t! (generic function with 1 method)

julia> N = 1000
1000

julia> array = randn(3, N);

Pkg.update() complete
julia> result = zeros(3, N);

julia> array_t = copy(array');

julia> result_t = zeros(N, 3);

julia> bar1!(result, array); result
3×1000 Array{Float64,2}:
-2.61722   1.85986  -1.48172    0.819111   0.624513   2.92976  -3.0886   -1.97158   -0.80507  …  -2.2052   -4.68454  -1.46795  -0.317441   1.36027  -2.19459  -2.17556   1.87992  -1.32306   0.604871
6.52334  -4.04671   4.89907   -1.44343   -4.97068   -4.16391   1.87026   0.585894   2.37889      4.1093    3.31065   1.81274   1.95511   -0.87607   2.51032   7.23454  -5.00863   4.08801   3.37566
-3.781     3.76811  -0.388579   0.903244   2.25944    1.50477  -2.76848   0.432073  -1.13699     -1.84506  -2.43655  -1.25644  -0.197228   1.33583  -1.08789  -3.9975    2.23508  -2.13335  -1.01191

julia> bar2!(result, array); result
3×1000 Array{Float64,2}:
-2.61722   1.85986  -1.48172    0.819111   0.624513   2.92976  -3.0886   -1.97158   -0.80507  …  -2.2052   -4.68454  -1.46795  -0.317441   1.36027  -2.19459  -2.17556   1.87992  -1.32306   0.604871
6.52334  -4.04671   4.89907   -1.44343   -4.97068   -4.16391   1.87026   0.585894   2.37889      4.1093    3.31065   1.81274   1.95511   -0.87607   2.51032   7.23454  -5.00863   4.08801   3.37566
-3.781     3.76811  -0.388579   0.903244   2.25944    1.50477  -2.76848   0.432073  -1.13699     -1.84506  -2.43655  -1.25644  -0.197228   1.33583  -1.08789  -3.9975    2.23508  -2.13335  -1.01191

julia> bar_t!(result_t, array_t); result_t'
-2.61722   1.85986  -1.48172    0.819111   0.624513   2.92976  -3.0886   -1.97158   -0.80507  …  -2.2052   -4.68454  -1.46795  -0.317441   1.36027  -2.19459  -2.17556   1.87992  -1.32306   0.604871
6.52334  -4.04671   4.89907   -1.44343   -4.97068   -4.16391   1.87026   0.585894   2.37889      4.1093    3.31065   1.81274   1.95511   -0.87607   2.51032   7.23454  -5.00863   4.08801   3.37566
-3.781     3.76811  -0.388579   0.903244   2.25944    1.50477  -2.76848   0.432073  -1.13699     -1.84506  -2.43655  -1.25644  -0.197228   1.33583  -1.08789  -3.9975    2.23508  -2.13335  -1.01191

julia> using BenchmarkTools

julia> @benchmark bar1!(\$result, \$array)
BenchmarkTools.Trial:
memory estimate:  0 bytes
allocs estimate:  0
--------------
minimum time:     4.035 μs (0.00% GC)
median time:      4.047 μs (0.00% GC)
mean time:        4.215 μs (0.00% GC)
maximum time:     9.224 μs (0.00% GC)
--------------
samples:          10000
evals/sample:     7

julia> @benchmark bar2!(\$result, \$array)BenchmarkTools.Trial:
memory estimate:  0 bytes
allocs estimate:  0
--------------
minimum time:     3.031 μs (0.00% GC)
median time:      3.119 μs (0.00% GC)
mean time:        3.230 μs (0.00% GC)
maximum time:     7.301 μs (0.00% GC)
--------------
samples:          10000
evals/sample:     8

julia> @benchmark bar_t!(\$result_t, \$array_t)
BenchmarkTools.Trial:
memory estimate:  0 bytes
allocs estimate:  0
--------------
minimum time:     1.352 μs (0.00% GC)
median time:      1.389 μs (0.00% GC)
mean time:        1.424 μs (0.00% GC)
maximum time:     4.638 μs (0.00% GC)
--------------
samples:          10000
evals/sample:     10

``````

EDIT:
I am surprised that simd helps with only 3 rows. It normally does loop unrolling that I’d have expected to hurt performance.

3 Likes

I see, 3 entries will not benefit from `@simd`, at least 4, thanks. The idea of transpose is awesome. I’m excited by the level of performance Julia can deliver if one pays more attention to what they are coding. Your last version and my SArrays one are `3X` faster than Intel Fortran, I can’t believe it!!

2 Likes

Hmm.
`@code_native` shows that my code actually failed to vectorize how I intended it. I’m trying to improve a macro to get that to work.

My `@vectorize` macro needs a lot of work, but with a little finagling:

``````using SIMDPirates, SLEEFwrap, BenchmarkTools, Base.Cartesian

@generated function bar_tv2!(result::AbstractArray{T}, array::AbstractArray{T},::Val{NCOL}) where {T,NCOL}
quote
@nexprs \$NCOL j -> p_j = @inbounds array[1,j]
@vectorize \$T for i ∈ 1:size(array,1)
dp = zero(T)
@nexprs \$NCOL j -> dp += p_j * array[i,j]
@nexprs \$NCOL j -> result[i,j] = dp * p_j + array[i,j]
end
end
end

function bar2!(result::AbstractArray, array::AbstractArray)
m,n = size(array)
@views p1 = array[:,1]
@inbounds for i in 1:n
dp = zero(eltype(array))
@simd for j = 1:m
dp += p1[j] * array[j,i]
end
@simd for j = 1:m
result[j,i] = dp * p1[j] + array[j,i]
end
end
end

N = 1000
array = randn(3, N);
result = zeros(3, N);

array_t = copy(array');
result_t = zeros(N, 3);
``````

This yields:

``````julia> bar_tv2!(result_t, array_t,Val(3)); result_t'
-2.22506  -3.39461  -0.89401   1.37359  -1.27377   2.85605  -1.22525  -0.387945   3.23087  -0.718255  …   1.60807   -2.39425     0.113308   1.24666   -1.2275    -0.713399  -0.601649  -0.664115   0.665728
1.60194   3.84549  -1.26712  -1.59311  -1.31202  -2.01284   1.95004  -0.276039  -4.76663  -0.187859      0.880181   2.02346    -1.44376    0.426184   0.570249  -1.27561    0.865626   0.9036    -0.543402
-1.68125  -1.77578  -1.45913   0.29928  -1.20614   2.13511  -1.24661  -1.1421     2.60744  -0.425533     -0.961413   0.0212073   0.423453  -0.270527  -1.72279    0.281566  -1.08591    1.90826    0.918416

julia> bar2!(result, array); result
3×1000 Array{Float64,2}:
-2.22506  -3.39461  -0.89401   1.37359  -1.27377   2.85605  -1.22525  -0.387945   3.23087  -0.718255  …   1.60807   -2.39425     0.113308   1.24666   -1.2275    -0.713399  -0.601649  -0.664115   0.665728
1.60194   3.84549  -1.26712  -1.59311  -1.31202  -2.01284   1.95004  -0.276039  -4.76663  -0.187859      0.880181   2.02346    -1.44376    0.426184   0.570249  -1.27561    0.865626   0.9036    -0.543402
-1.68125  -1.77578  -1.45913   0.29928  -1.20614   2.13511  -1.24661  -1.1421     2.60744  -0.425533     -0.961413   0.0212073   0.423453  -0.270527  -1.72279    0.281566  -1.08591    1.90826    0.918416

julia> @benchmark bar_tv2!(\$result_t, \$array_t,Val(3))
BenchmarkTools.Trial:
memory estimate:  0 bytes
allocs estimate:  0
--------------
minimum time:     433.854 ns (0.00% GC)
median time:      435.020 ns (0.00% GC)
mean time:        455.177 ns (0.00% GC)
maximum time:     668.051 ns (0.00% GC)
--------------
samples:          10000
evals/sample:     198

julia> @benchmark bar2!(\$result, \$array)
BenchmarkTools.Trial:
memory estimate:  0 bytes
allocs estimate:  0
--------------
minimum time:     3.025 μs (0.00% GC)
median time:      3.029 μs (0.00% GC)
mean time:        3.158 μs (0.00% GC)
maximum time:     7.316 μs (0.00% GC)
--------------
samples:          10000
evals/sample:     8
``````

Now, 100x faster than the original!
Note, I’m benchmarking on a computer with avx512, so most aren’t likely to see the same benefit.

EDIT:
Cleaned up assmebly (Discourse’s syntax highlighting doesn’t agree with the auto-inserted comments):

``````julia> @code_native bar_tv2!(result_t, array_t,Val(3))
.text
pushq	%rbp
pushq	%r15
pushq	%r14
pushq	%r13
pushq	%r12
pushq	%rbx
subq	\$56, %rsp
movq	%rsi, 48(%rsp)
movq	(%rsi), %r14
movq	8(%rsi), %rax
movq	(%rax), %r12
movq	24(%rax), %rbx
vmovsd	(%r12), %xmm0           # xmm0 = mem[0],zero
vmovaps	%xmm0, 32(%rsp)
vmovsd	(%r12,%rbx,8), %xmm0    # xmm0 = mem[0],zero
vmovaps	%xmm0, 16(%rsp)
movq	%rbx, %rbp
shlq	\$4, %rbp
vmovsd	(%r12,%rbp), %xmm0      # xmm0 = mem[0],zero
vmovapd	%xmm0, (%rsp)
movq	%rbx, %r13
sarq	\$63, %r13
shrq	\$61, %r13
andq	\$-8, %r13
movq	(%r14), %r15
movabsq	\$steprange_last, %rax
movl	\$1, %edi
movl	\$8, %esi
movq	%r13, %rdx
callq	*%rax
vmovapd	32(%rsp), %xmm10
vmovapd	(%rsp), %xmm9
vmovapd	16(%rsp), %xmm8
testq	%rax, %rax
jle	L288
movq	24(%r14), %rcx
movq	%rcx, %rdx
shlq	\$4, %rdx
vxorpd	%xmm3, %xmm3, %xmm3
movq	%r12, %rsi
movq	%r15, %rdi
nop
L192:
vmovupd	(%rsi), %zmm4
vmovupd	(%rsi,%rbx,8), %zmm5
vmovupd	(%rsi,%rbp), %zmm6
vmovapd	%zmm4, %zmm7
vmovupd	%zmm4, (%rdi)
vmovupd	%zmm5, (%rdi,%rcx,8)
vmovupd	%zmm7, (%rdi,%rdx)
jne	L192
L288:
movq	%rbx, %rcx
subq	%r13, %rcx
testq	%rcx, %rcx
jle	L495
movabsq	\$140535612763840, %rax  # imm = 0x7FD0FF46EEC0
movq	%rbx, %rax
subq	%rcx, %rax
vpternlogd	\$255, %zmm1, %zmm1, %zmm1
vpcmpgtq	%zmm1, %zmm0, %k1
vmovupd	(%r12,%rax,8), %zmm0 {%k1} {z}
vpxor	%xmm1, %xmm1, %xmm1
leaq	(%rbx,%rbx), %rcx
leaq	(%rbx,%rax), %rdx
vmovupd	(%r12,%rdx,8), %zmm2 {%k1} {z}
vmovupd	(%r12,%rcx,8), %zmm3 {%k1} {z}
vmovupd	%zmm4, (%r15,%rax,8) {%k1}
movq	24(%r14), %rcx
leaq	(%rcx,%rax), %rdx
vmovupd	%zmm5, (%r15,%rdx,8) {%k1}
leaq	(%rax,%rcx,2), %rax
vmovupd	%zmm1, (%r15,%rax,8) {%k1}
movabsq	\$140535856324616, %rax  # imm = 0x7FD10DCB6008
popq	%rbx
popq	%r12
popq	%r13
popq	%r14
popq	%r15
popq	%rbp
vzeroupper
retq
L495:
movabsq	\$140535856324616, %rax  # imm = 0x7FD10DCB6008
popq	%rbx
popq	%r12
popq	%r13
popq	%r14
popq	%r15
popq	%rbp
vzeroupper
retq
``````

a big pile of `zmm` (512 bit) registers and `vfmadd` instructions.

EDIT:
@Seif_Shebl, did you try Intel Fortran on the version with transposed `array` and `result` (ie, Nx3 instead of 3xN)?
In earlier experimenting, both `ifort` and `flang` sometimes managed to correctly vectorize such for loops. Sometimes `gfortran` can do it to, but it often involves playing with flags like turning `-fdisable-tree-cunrolli` on or off. The number of loops I tried on isn’t too large, but that llvm-based `flang` worked on an example loop I tried earlier while Julia did not (without `@vectorize`) surprised me.
On that note, I’ve been meaning to try and find the time to play with Intel’s [Single Program Multiple Data] Program Compiler (ispc) compiler, which is also LLVM-based. Here is a fascinating series of blog posts about it by the creator. From the sound of it, this is the sort of thing it should be well optimized for – although the author wishes some bigger project would work to better support it.

My success has actually been pretty low with Julia - but the ability to write complicated transformative macros means it’s actually the most robustly capable of loop vectorization in practice (for me).

And SIMD intrinsics are so much easier in Julia, even without macros.

I really need to refactor the code base behind my `@vectorize` macro to make it easier to work with and extend. It is fairly limited at the moment. It would be easier to add features / support more types of loops with a different code organization.

EDIT:

Actually, both Flang and gfortran do a great job on this problem without any tricks.
The Fortran implementation:

``````module vectorization_test

implicit none

contains

subroutine bar(result, array, N)
integer(8),                 intent(in)  ::  N
real(8),  dimension(N,3),   intent(in)  ::  array
real(8),  dimension(N,3),   intent(out) ::  result

integer(8)                              ::  i
real(8)                                 ::  p1(3), dp

p1 = array(1,:)

do concurrent (i = 1:N)
dp = sum(p1 * array(i,:))
result(i,:) = dp * p1 + array(i,:)
end do

end subroutine bar

end module vectorization_test
``````

and compiler versions (latest releases):

``````\$ gfortran --version
GNU Fortran (GCC) 8.2.1 20181215 (Red Hat 8.2.1-6)
Copyright (C) 2018 Free Software Foundation, Inc.
This is free software; see the source for copying conditions.  There is NO
warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.

\$ flang --version
clang version 7.0.1 (https://github.com/flang-compiler/flang-driver.git 24bd54da5c41af04838bbe7b68f830840d47fc03) (https://github.com/flang-compiler/llvm.git 84335cded2f2cf9de26a77194835094fc75f7bb7)
Target: x86_64-unknown-linux-gnu
InstalledDir: /usr/local/bin
``````

Compiling with

``````\$ gfortran -Ofast -march=native -mprefer-vector-width=512 -shared -fPIC vectorization_test.f90 -o libgfortran_vectorization_test.so
\$ flang -Ofast -march=native -mprefer-vector-width=512 -shared -fPIC vectorization_test.f90 -o libflang_vectorization_test.so
``````
``````julia> const test_dir = "/home/chriselrod/Documents/progwork/fortran"
"/home/chriselrod/Documents/progwork/fortran"

julia> const flang_lib = joinpath(test_dir, "libflang_vectorization_test.so")
"/home/chriselrod/Documents/progwork/fortran/libflang_vectorization_test.so"

julia> const gfort_lib = joinpath(test_dir, "libgfortran_vectorization_test.so")
"/home/chriselrod/Documents/progwork/fortran/libgfortran_vectorization_test.so"

julia> function flang_test(results, array, N)
ccall(
(:vectorization_test_bar_, flang_lib), # function and library
Cvoid, # return type
(Ptr{Float64},Ptr{Float64},Ptr{Int64}), # argument types
results, array, N # arguments
)
end
flang_test (generic function with 1 method)

julia> function gfortran_test(results, array, N)
ccall(
(:__vectorization_test_MOD_bar, gfort_lib), # function and library
Cvoid, # return type
(Ptr{Float64},Ptr{Float64},Ptr{Int64}), # argument types
results, array, N # arguments
)
end
gfortran_test (generic function with 1 method)

julia> Nr = Ref(N)
Base.RefValue{Int64}(1000)

julia> fill!(result_t, 0);

julia> flang_test(result_t, array_t, Nr); result_t' # matches the earlier Julia results
-2.22506  -3.39461  -0.89401   1.37359  -1.27377   2.85605  -1.22525  -0.387945   3.23087  -0.718255  …   1.60807   -2.39425     0.113308   1.24666   -1.2275    -0.713399  -0.601649  -0.664115   0.665728
1.60194   3.84549  -1.26712  -1.59311  -1.31202  -2.01284   1.95004  -0.276039  -4.76663  -0.187859      0.880181   2.02346    -1.44376    0.426184   0.570249  -1.27561    0.865626   0.9036    -0.543402
-1.68125  -1.77578  -1.45913   0.29928  -1.20614   2.13511  -1.24661  -1.1421     2.60744  -0.425533     -0.961413   0.0212073   0.423453  -0.270527  -1.72279    0.281566  -1.08591    1.90826    0.918416

julia> fill!(result_t, 0);

julia> gfortran_test(result_t, array_t, Nr); result_t'
-2.22506  -3.39461  -0.89401   1.37359  -1.27377   2.85605  -1.22525  -0.387945   3.23087  -0.718255  …   1.60807   -2.39425     0.113308   1.24666   -1.2275    -0.713399  -0.601649  -0.664115   0.665728
1.60194   3.84549  -1.26712  -1.59311  -1.31202  -2.01284   1.95004  -0.276039  -4.76663  -0.187859      0.880181   2.02346    -1.44376    0.426184   0.570249  -1.27561    0.865626   0.9036    -0.543402
-1.68125  -1.77578  -1.45913   0.29928  -1.20614   2.13511  -1.24661  -1.1421     2.60744  -0.425533     -0.961413   0.0212073   0.423453  -0.270527  -1.72279    0.281566  -1.08591    1.90826    0.918416
``````

These get the same answer, yet are very fast:

``````julia> @benchmark gfortran_test(\$result_t, \$array_t, \$Nr)
BenchmarkTools.Trial:
memory estimate:  0 bytes
allocs estimate:  0
--------------
minimum time:     420.372 ns (0.00% GC)
median time:      421.322 ns (0.00% GC)
mean time:        436.942 ns (0.00% GC)
maximum time:     647.241 ns (0.00% GC)
--------------
samples:          10000
evals/sample:     199

julia> @benchmark flang_test(\$result_t, \$array_t, \$Nr)
BenchmarkTools.Trial:
memory estimate:  0 bytes
allocs estimate:  0
--------------
minimum time:     419.518 ns (0.00% GC)
median time:      420.523 ns (0.00% GC)
mean time:        438.599 ns (0.00% GC)
maximum time:     668.719 ns (0.00% GC)
--------------
samples:          10000
evals/sample:     199
``````

My Flang is using LLVM-7. I don’t know what Flang’s front end is doing, but it seems like Julia ought to achieve the same performance and ease of writing code, without any fancy macros.

EDIT:
I’m less familiar with `ifort`, but it gets the same performance too:

``````julia> const ifort_lib = joinpath(test_dir, "libifort_vectorization_test.so")
"/home/chriselrod/Documents/progwork/fortran/libifort_vectorization_test.so"

julia> function ifort_test(results, array, N)
ccall(
(:vectorization_test_mp_bar_, ifort_lib), # function and library
Cvoid, # return type
(Ptr{Float64},Ptr{Float64},Ptr{Int64}), # argument types
results, array, N # arguments
)
end
ifort_test (generic function with 1 method)

julia> fill!(result_t, 0);

julia> ifort_test(result_t, array_t, Nr); result_t'
-2.22506  -3.39461  -0.89401   1.37359  -1.27377   2.85605  -1.22525  -0.387945   3.23087  -0.718255  …   1.60807   -2.39425     0.113308   1.24666   -1.2275    -0.713399  -0.601649  -0.664115   0.665728
1.60194   3.84549  -1.26712  -1.59311  -1.31202  -2.01284   1.95004  -0.276039  -4.76663  -0.187859      0.880181   2.02346    -1.44376    0.426184   0.570249  -1.27561    0.865626   0.9036    -0.543402
-1.68125  -1.77578  -1.45913   0.29928  -1.20614   2.13511  -1.24661  -1.1421     2.60744  -0.425533     -0.961413   0.0212073   0.423453  -0.270527  -1.72279    0.281566  -1.08591    1.90826    0.918416

julia> @benchmark ifort_test(\$result_t, \$array_t, \$Nr)
BenchmarkTools.Trial:
memory estimate:  0 bytes
allocs estimate:  0
--------------
minimum time:     421.477 ns (0.00% GC)
median time:      422.523 ns (0.00% GC)
mean time:        443.197 ns (0.00% GC)
maximum time:     721.663 ns (0.00% GC)
--------------
samples:          10000
evals/sample:     199
``````

I compiled with

``````\$ ifort -fast -qopt-zmm-usage=high -shared -fPIC vectorization_test.f90 -o libifort_vectorization_test.so
\$ ifort --version
ifort (IFORT) 19.0.1.144 20181018
``````

All three Fortran compilers produce code that takes just over 420 nanoseconds, while my best Julia version took about 435 nanoseconds. So that’s pretty good, but it needed a fancy code-transforming macro, while the Fortran code was written in a (comparatively) simple and naive manner.
Even if I go “all out” with macros from base Julia, I don’t get that performance:

``````julia> @generated function bar_tv3!(result::AbstractArray{T}, array::AbstractArray{T},::Val{NCOL}) where {T,NCOL}
quote
@nexprs \$NCOL j -> p_j = @inbounds array[1,j]
@inbounds @fastmath @simd ivdep for i ∈ 1:size(array,1)
dp = zero(T)
@nexprs \$NCOL j -> dp += p_j * array[i,j]
@nexprs \$NCOL j -> result[i,j] = dp * p_j + array[i,j]
end
end
end
bar_tv3! (generic function with 1 method)

julia> @benchmark bar_tv3!(\$result_t, \$array_t,Val(3))
BenchmarkTools.Trial:
memory estimate:  0 bytes
allocs estimate:  0
--------------
minimum time:     1.091 μs (0.00% GC)
median time:      1.101 μs (0.00% GC)
mean time:        1.142 μs (0.00% GC)
maximum time:     2.841 μs (0.00% GC)
--------------
samples:          10000
evals/sample:     10
``````

because it fails to vectorize across loop iterations. So, I’d call it a win for Fortran at the moment.

5 Likes

See http://juliaarrays.github.io/StaticArrays.jl/stable/pages/api.html#Arrays-of-static-arrays-1 for conversion between static and normal arrays

1 Like

For the bennefit of the thread starter:

The code snippet given in that documentation throws a depreciation warning with 0.7 and therefore will probably just give an error with 1.0 (and no helpful hint at all). I guess this might be correct

``````function svectors(x::Matrix{Float64})
@assert size(x,1) == 3
reshape(reinterpret(SVector{3,Float64}, x),(size(x,2)))
end
``````

Edit:
On top `reshape` appears to be lazy:

``````julia> zz=zeros(3,2)
3×2 Array{Float64,2}:
0.0  0.0
0.0  0.0
0.0  0.0

julia> szz=svectors(zz)
2-element reshape(reinterpret(SArray{Tuple{3},Float64,1,3}, ::Array{Float64,2}), 2) with eltype SArray{Tuple{3},Float64,1,3}:
[0.0, 0.0, 0.0]
[0.0, 0.0, 0.0]

julia> collect(szz)
2-element Array{SArray{Tuple{3},Float64,1,3},1}:
[0.0, 0.0, 0.0]
[0.0, 0.0, 0.0]
``````

which might or might not be of interest.

Edit 2:
An unmerged pull request is probably the fix to the documentationn and better code anyway

Thank you, this is really helpful . I see that I can make my code much faster with these tricks.

I really appreciate threads like these. This community is pretty unique.

1 Like