Understanding type stability

Philippe_Maincon · November 8, 2017, 8:13am

For my FEM work, I want to be able to multiply higher dimension arrays with each other, as in “cartensian tensors”. I know there are packages, I want to learn.

Function dots2 is specialised to multiply 3-Arrays with vectors. It is fast.

Function dots1 is more general, it sums over the last index of first argument and first index of last argument. It’s slow because it’s time-unstable, even though I shielded for instability in the function tenprod! A factor 4 or 5 loss in performance.

I want it general, and I want it fast. Why is my dots1 code below type unstable? Given Ta and Tb, I’d say all types are clearly defined. What did I miss?

: )

Philippe

function dots1(a::Array{Ta},b::Array{Tb}) where {Ta,Tb}
    Tc = promote_type(Ta,Tb)
    n  = 1
    sc = (size(a)[1:end-1]...,size(b)[1+1:end]...)
	c  = zeros(Tc,sc)
	r1 = CartesianRange(size(a)[1:end-n])
	r2 = CartesianRange(size(b)[1:n])
	r3 = CartesianRange(size(b)[n+1:end])
    tenprod!(c,a,b,r1,r2,r3)
    return c
end

function dots2(a::Array{T,3},b::Vector{T}) where {T}
    c = zeros(T,size(a,1),size(a,2))
    for k=indices(a,3),j=indices(a,2),i=indices(a,1)
        @inbounds @fastmath c[i,j] += a[i,j,k]*b[k]
    end
    return c
end
@noinline function tenprod!(c,a,b,r1,r2,r3)
	for i3 in r3,i2 in r2,i1 in r1
        @inbounds @fastmath c[i1,i3] += a[i1,i2]*b[i2,i3]
	end
end

function runit(b,c)
    for i = 1:100000
        a=dots1(b,c)
    end
end

b = randn(2,3,4)
c = randn(4)
B = randn(20,30,40)
C = randn(40)

# warm up
a=dots1(b,c)
a=dots2(b,c)
A=dots1(B,C)
A=dots2(B,C)
runit(b,c)

# go!
@time a=dots1(b,c)
@time a=dots2(b,c)

@time A=dots1(B,C)
@time A=dots2(B,C)

if false
Base.Profile.init(delay=0.000001)
Base.Profile.clear()
Base.Profile.clear_malloc_data()
tic()
@profile @time runit(b,c)
toc()
Base.Profile.print()
using ProfileView
ProfileView.view()
end

P.S.

I also tried

function dots1(a::Array{Ta,Na},b::Array{Tb,Nb}) where {Ta,Tb,Na,Nb}
    Tc = promote_type(Ta,Tb)
    n  = 1
    sc = (size(a)[1:end-1]...,size(b)[1+1:end]...)
	#c  = zeros(Tc,sc)
    c = Array{Tc,Na+Nb-2}(sc...)
    fill!(c,zero(Tc))
	r1 = CartesianRange(size(a)[1:end-n])
	r2 = CartesianRange(size(b)[1:n])
	r3 = CartesianRange(size(b)[n+1:end])
    tenprod!(c,a,b,r1,r2,r3)
    return c
end

which is as bad.

Elrod · November 8, 2017, 8:34am

What is N for c? That is, what is the dimension of the output?

@generated function dots3(a::Array{Ta,N1},b::Array{Tb,N2}) where {Ta,Tb,N1,N2}
    N3 = N1 + N2 - 2
    n  = 1
    N1mn = N1 - n
    N2mn = N2 - n
    quote
        Tc = promote_type(Ta,Tb)
        sc = Base.@ntuple $N3 i -> begin
            i < $N1 ? size(a)[i] : size(b)[i + 1 - $N1]
        end
        c  = zeros(Tc,sc)
        r1 = CartesianRange( (Base.@ntuple $N1mn i -> size(a)[i]) )
        r2 = CartesianRange( (Base.@ntuple $n i -> size(b)[i]) )
        r3 = CartesianRange( (Base.@ntuple $N2mn i -> size(b)[i+$n]) )
        tenprod!(c,a,b,r1,r2,r3)
        c
    end
end

julia> @benchmark dots1($B, $C)
@benchmark dots2($BBenchmarkTools.Trial: 
  memory estimate:  5.45 KiB
  allocs estimate:  16
  --------------
  minimum time:     40.516 μs (0.00% GC)
  median time:      51.496 μs (0.00% GC)
  mean time:        50.949 μs (0.41% GC)
  maximum time:     1.148 ms (92.82% GC)
  --------------
  samples:          10000
  evals/sample:     1

julia> @benchmark dots2($B, $C)
@benchmark dots3($b, $C)
BenchmarkTools.Trial: 
  memory estimate:  4.81 KiB
  allocs estimate:  1
  --------------
  minimum time:     9.939 μs (0.00% GC)
  median time:      10.310 μs (0.00% GC)
  mean time:        10.516 μs (0.00% GC)
  maximum time:     46.647 μs (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     1

julia> @benchmark dots3($B, $C)
BenchmarkTools.Trial: 
  memory estimate:  4.81 KiB
  allocs estimate:  1
  --------------
  minimum time:     37.190 μs (0.00% GC)
  median time:      48.972 μs (0.00% GC)
  mean time:        50.388 μs (0.00% GC)
  maximum time:     107.222 μs (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     1

Philippe_Maincon · November 8, 2017, 8:39am

Yet faster???
I will have to study your code very carefully.

Philippe

Philippe_Maincon · November 8, 2017, 8:47am

No wait,

You benchmark dots3($b, $C) (lowercase b).
Interestingly, dots3 is equaly fast as dots2 for small arrays a b c, but is still somewhat slow for A B C.

I still have to study and understand your code.

Elrod · November 8, 2017, 8:50am

Yikes! I replaced b with B, and you’re right – it’s as slow as dots1.
@code_warntype does confirm that dots3 is type stable, though. Because tenprod! is the workhorse, a the function calling it being slower didn’t make too much of a difference. Instead, a large part seems to be iterating over the Cartesian rangers:

julia> function loop1(r1,r2,r3)
           out = 0
           @inbounds for i3 in r3,i2 in r2,i1 in r1
               out += 1
           end
           out
       end
loop1 (generic function with 1 method)

julia> function loop2(a::Array{T,3},b::Vector{T}) where {T}
           c = zeros(T,size(a,1),size(a,2))
           out = 0
           @inbounds for k=indices(a,3),j=indices(a,2),i=indices(a,1)
               out += 1
           end
           return out
       end
loop2 (generic function with 1 method)

julia> n = 1;

julia> r1 = CartesianRange(size(B)[1:end-n]);

julia> r2 = CartesianRange(size(C)[1:n]);

julia> r3 = CartesianRange(size(C)[n+1:end]);

julia> loop2(B, C)
24000

julia> loop1(r1,r2,r3)
24000

julia> @benchmark loop2($B, $C)
BenchmarkTools.Trial: 
  memory estimate:  4.81 KiB
  allocs estimate:  1
  --------------
  minimum time:     562.843 ns (0.00% GC)
  median time:      616.103 ns (0.00% GC)
  mean time:        823.896 ns (21.42% GC)
  maximum time:     6.623 μs (82.31% GC)
  --------------
  samples:          10000
  evals/sample:     185

julia> @benchmark loop1($r1, $r2, $r3)
BenchmarkTools.Trial: 
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     19.186 μs (0.00% GC)
  median time:      19.246 μs (0.00% GC)
  mean time:        19.686 μs (0.00% GC)
  maximum time:     65.613 μs (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     1

We can safely blame tenprod! for the poor performance here:

D = Array{Float64}((size(B)[1:end-1]...,size(C)[1+1:end]...));

julia> @benchmark tenprod!($D,$B,$C,$r1,$r2,$r3)
BenchmarkTools.Trial: 
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     47.069 μs (0.00% GC)
  median time:      47.589 μs (0.00% GC)
  mean time:        48.637 μs (0.00% GC)
  maximum time:     91.642 μs (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     1

Anyway, I suggest you look at Base.Cartesian for developing multidimensional algorithms. @nloops is great.

As for what dots3 does:
The @generated means the function uses the parameters of the arguments to define a new expression. The expression it returns is what actually gets compiled into the function that runs.
All I did was move the original expression inside a quote block (which turns it into an expression) and have the function return that – no changes at that point.
Then I moved the type unstable parts – which were based on parameter arguments of the function – outside of the quote block. There were the sizes of the matrices and cartesian indexes. I then interpolated these sizes into the quoted expression, so that the code that gets generated is type stable.

Elrod · November 8, 2017, 9:26am

Looking at tenprod!, one simple change is swapping the order of iteration. That gets us most of the way there:

julia> function tenprod2!(c,a,b,r1,r2,r3)
               for i1 in r1, i2 in r2, i3 in r3
                   @inbounds c[i1,i3] += a[i1,i2]*b[i2,i3]
               end
       end
tenprod! (generic function with 1 method)

julia> @benchmark tenprod2!($D,$B,$C,$r1,$r2,$r3)
BenchmarkTools.Trial: 
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     15.589 μs (0.00% GC)
  median time:      15.850 μs (0.00% GC)
  mean time:        16.336 μs (0.00% GC)
  maximum time:     43.542 μs (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     1

julia> @generated function dots3(a::Array{Ta,N1},b::Array{Tb,N2}) where {Ta,Tb,N1,N2}
           N3 = N1 + N2 - 2
           n  = 1
           N1mn = N1 - n
           N2mn = N2 - n
           quote
               Tc = promote_type(Ta,Tb)
               sc = Base.@ntuple $N3 i -> begin
                   i < $N1 ? size(a)[i] : size(b)[i + 1 - $N1]
               end
               c  = zeros(Tc, sc)
               r1 = CartesianRange( (Base.@ntuple $N1mn i -> size(a)[i]) )
               r2 = CartesianRange( (Base.@ntuple $n i -> size(b)[i]) )
               r3 = CartesianRange( (Base.@ntuple $N2mn i -> size(b)[i+$n]) )
               tenprod2!(c,a,b,r1,r2,r3)
               c
           end
       end
dots3 (generic function with 1 method)

julia> @benchmark dots3($B, $C)
BenchmarkTools.Trial: 
  memory estimate:  4.81 KiB
  allocs estimate:  1
  --------------
  minimum time:     15.859 μs (0.00% GC)
  median time:      16.180 μs (0.00% GC)
  mean time:        16.673 μs (0.00% GC)
  maximum time:     61.777 μs (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     1

Philippe_Maincon · November 8, 2017, 10:11am

So you have two points
a) use @generated to generate code for each combination of input dimensions.
b) tenprod!: I made the wrong assumption of the order of the loops in a multi-variable for loop!

I will study Base.Cartesian and @nloops.

Thank you!

Elrod · November 8, 2017, 10:28am

Here is a version based on @nloops. Pretty close, now actually featuring the same matrices ;).

julia> @generated function dots4(a::Array{Ta,N1},b::Array{Tb,N2},::Val{n} = Val{1}() ) where {Ta,Tb,N1,N2,n}
           N4 = N1 + N2 - n
           N3 = N1 + N2 - 2n
           N1mn = N1 - n
           N2mn = N2 - n
           quote
               Tc = promote_type(Ta,Tb)
               sc = Base.@ntuple $N3 i -> begin
                   i <= $N1mn ? size(a)[i] : size(b)[i - $N1mn]
               end
               c  = zeros(Tc, sc)
               @inbounds @nloops $N4 i j -> begin
                   j <= $N1 ? indices(a,j) : indices(b,j - $N1mn)
               end begin
                   (@nref $N3 c j -> j <= $N1mn ? i_{j} : i_{j+$n} ) += 
                               (@nref $N1 a i) * (@nref $N2 b j -> i_{j+$N1mn})
               end
               c
           end
       end
dots4 (generic function with 2 methods)

julia> @benchmark dots4($B,$C)
BenchmarkTools.Trial: 
  memory estimate:  4.81 KiB
  allocs estimate:  1
  --------------
  minimum time:     10.269 μs (0.00% GC)
  median time:      10.630 μs (0.00% GC)
  mean time:        10.861 μs (0.00% GC)
  maximum time:     45.245 μs (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     1

julia> @benchmark dots2($B,$C)
BenchmarkTools.Trial: 
  memory estimate:  4.81 KiB
  allocs estimate:  1
  --------------
  minimum time:     9.929 μs (0.00% GC)
  median time:      10.380 μs (0.00% GC)
  mean time:        10.644 μs (0.00% GC)
  maximum time:     155.863 μs (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     1

julia> dots2(B,C)
20×30 Array{Float64,2}:
  -2.40947     3.6615      3.56994   -0.911009   0.0911455  …   2.18485   -0.949204   4.5201      -5.71501  
   4.68563     9.24024    10.3936    -1.57826    3.22617        4.09229    7.35908    8.64156     -0.705054 
   8.25224   -10.562       5.60107    2.90849   -5.29132        6.55552   -0.623461  -1.36057      0.626666 
   0.310542   -7.73264     5.22549    8.63678    1.49464       -0.924901   3.78898    4.14533      1.52636  
   9.78047     4.47329    -3.88946    4.35931    3.60309       -3.0165     1.04233   -3.14091     -2.4922   
  -3.63024    10.8072     -8.35788   -3.47152   -3.33291    …   1.77698   16.0327     3.29105     -9.57342  
 -15.8024     -1.71647     3.35925   -0.633621   2.25073       -0.826728  -0.899735   1.26402      8.13161  
  -3.52586     1.7255     -2.70327   -3.91099   -8.72436        6.12261   -4.55655   -2.62929      1.30809  
  -5.53543     2.64004    13.7924    -6.60311    5.5823         6.17945   -1.90693   -2.23296     -8.36336  
  -3.07699     3.21454    -2.80323    6.87718   -2.36181       -0.957616   7.7965     0.162212    -0.0444603
  -0.774692   -4.10606    -0.694363  -3.9598     6.64839    …   0.966243  -8.6036     6.34869    -14.4297   
   2.43154    -4.81324    -2.91524    1.74523    4.14951        5.15643    3.44661   -0.560517     1.29516  
  -4.86107     4.43378     9.55482    8.25905   -1.59396       -3.91577   -0.965676  -8.28942     -2.7246   
  -1.49863     0.0738993   9.96601    5.8386    -0.564399       9.96989    4.75849    3.38899      1.93741  
   1.97493     5.69018     8.42027    5.62482   -2.08138       -0.759205   4.35716   -1.66558      1.94865  
  -7.57431    -4.38863    -6.6241     1.61544    4.30493    …   4.26025   -2.79349   -7.3108      -5.56746  
  -3.10108    -2.45831    -3.9693    -0.306877   6.90586       -6.6463     4.08884    1.95554     -1.34872  
  -4.54627    -4.7708     -0.752273   1.99907   -3.94219        3.35114    3.15166   -0.0680876    3.64763  
  -0.679031   12.6421      5.34694    6.32281   -2.84106       -6.2259    -6.71613    1.57312      7.22193  
   4.61162    -3.18943     8.11373    4.95927   -4.14198       11.2687    -0.52962   -4.85911     -6.33192  

julia> dots4(B,C)
20×30 Array{Float64,2}:
  -2.40947     3.6615      3.56994   -0.911009   0.0911455  …   2.18485   -0.949204   4.5201      -5.71501  
   4.68563     9.24024    10.3936    -1.57826    3.22617        4.09229    7.35908    8.64156     -0.705054 
   8.25224   -10.562       5.60107    2.90849   -5.29132        6.55552   -0.623461  -1.36057      0.626666 
   0.310542   -7.73264     5.22549    8.63678    1.49464       -0.924901   3.78898    4.14533      1.52636  
   9.78047     4.47329    -3.88946    4.35931    3.60309       -3.0165     1.04233   -3.14091     -2.4922   
  -3.63024    10.8072     -8.35788   -3.47152   -3.33291    …   1.77698   16.0327     3.29105     -9.57342  
 -15.8024     -1.71647     3.35925   -0.633621   2.25073       -0.826728  -0.899735   1.26402      8.13161  
  -3.52586     1.7255     -2.70327   -3.91099   -8.72436        6.12261   -4.55655   -2.62929      1.30809  
  -5.53543     2.64004    13.7924    -6.60311    5.5823         6.17945   -1.90693   -2.23296     -8.36336  
  -3.07699     3.21454    -2.80323    6.87718   -2.36181       -0.957616   7.7965     0.162212    -0.0444603
  -0.774692   -4.10606    -0.694363  -3.9598     6.64839    …   0.966243  -8.6036     6.34869    -14.4297   
   2.43154    -4.81324    -2.91524    1.74523    4.14951        5.15643    3.44661   -0.560517     1.29516  
  -4.86107     4.43378     9.55482    8.25905   -1.59396       -3.91577   -0.965676  -8.28942     -2.7246   
  -1.49863     0.0738993   9.96601    5.8386    -0.564399       9.96989    4.75849    3.38899      1.93741  
   1.97493     5.69018     8.42027    5.62482   -2.08138       -0.759205   4.35716   -1.66558      1.94865  
  -7.57431    -4.38863    -6.6241     1.61544    4.30493    …   4.26025   -2.79349   -7.3108      -5.56746  
  -3.10108    -2.45831    -3.9693    -0.306877   6.90586       -6.6463     4.08884    1.95554     -1.34872  
  -4.54627    -4.7708     -0.752273   1.99907   -3.94219        3.35114    3.15166   -0.0680876    3.64763  
  -0.679031   12.6421      5.34694    6.32281   -2.84106       -6.2259    -6.71613    1.57312      7.22193  
   4.61162    -3.18943     8.11373    4.95927   -4.14198       11.2687    -0.52962   -4.85911     -6.33192

EDIT: I would remove the @inbounds from dots4 before experimenting with arguments of other sizes. I did not test this thoroughly.
Also, for reference:

julia> using TensorOperations

julia> @benchmark @tensor $D[i,j] := $B[i,j,k] * $C[k]
BenchmarkTools.Trial: 
  memory estimate:  5.75 KiB
  allocs estimate:  20
  --------------
  minimum time:     11.051 μs (0.00% GC)
  median time:      11.572 μs (0.00% GC)
  mean time:        12.144 μs (3.01% GC)
  maximum time:     1.902 ms (96.44% GC)
  --------------
  samples:          10000
  evals/sample:     1

julia> @benchmark @tensor $D[i,j] = $B[i,j,k] * $C[k]
BenchmarkTools.Trial: 
  memory estimate:  960 bytes
  allocs estimate:  19
  --------------
  minimum time:     10.600 μs (0.00% GC)
  median time:      10.921 μs (0.00% GC)
  mean time:        11.120 μs (0.00% GC)
  maximum time:     60.487 μs (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     1

julia> using Einsum

julia> @benchmark @einsum $D[i,j] := $B[i,j,k] * $C[k]
BenchmarkTools.Trial: 
  memory estimate:  15.78 KiB
  allocs estimate:  70
  --------------
  minimum time:     99.310 μs (0.00% GC)
  median time:      101.313 μs (0.00% GC)
  mean time:        104.696 μs (0.86% GC)
  maximum time:     2.203 ms (83.66% GC)
  --------------
  samples:          10000
  evals/sample:     1

julia> @benchmark @einsum $D[i,j] = $B[i,j,k] * $C[k]
BenchmarkTools.Trial: 
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     13.496 μs (0.00% GC)
  median time:      13.617 μs (0.00% GC)
  mean time:        13.976 μs (0.00% GC)
  maximum time:     37.391 μs (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     1

Looking forward to when Tensors.jl support third order tensors. The package promises speed, and supports automatic differentiation. Definitely worth keeping an eye on.

kristoffer.carlsson · November 8, 2017, 12:49pm

We didn’t know there was any demand for third order tensors. It shouldn’t be hard to implement

Also, note that Tensors.jl is only for tensors up to three dimensions (so maximum size 3x3x3x3).

Philippe_Maincon · November 8, 2017, 2:31pm

Thank you for input and codes. I will look into it.

I think tensors of arbitrary (well, in my case, up to 4) order and arbitrary dimension is interesting:
Think of Hooke’s tensor, sigma2= C4:epsilon2, or shape functions:
eps2 = B3*X2
And of course automatic differentiation will generate higher order tensors (d sigma / d X) …
Note the “simple” tensor products: the last n indices of left argument collapse with the n first indices of the right argument. Not as flexible as “Cartesian”, but sometimes enough, and much more readable.

kristoffer.carlsson · November 8, 2017, 2:46pm

Both hooke and dsigma / dx will yield fourth order tensors which are fully supported. With dimension, I mean the spatial dimension (which st least for engineering applications is unlikely to exceed 3).

In fact, Tensors.jl is the “backend” for the tensor computations in https://github.com/KristofferC/JuAFEM.jl and we are very happy with the performance of the package.

Philippe_Maincon · November 8, 2017, 2:50pm

OK got it. Well, tensors operate on vectors, or operations on operation on vectors, and vectors can be anything including, e.g. the degrees of freedom of a finite element.

Philippe_Maincon · November 8, 2017, 2:57pm

dots4 rules!

platawiec · November 8, 2017, 4:34pm

+1 for third-order tensors. Very useful in nonlinear optics

tim.holy · November 9, 2017, 1:37pm

I’d recommend Multidimensional algorithms and iteration first and Base.Cartesian only if you really, really need it.

Philippe_Maincon · November 16, 2017, 7:34am

Hi Tim,

“Iteration” code is without doubt a lot more readable than “Cartesian”. However in this case, it is slower, by a factor between 10 and 100, and “dots” has to be real fast.

On the good side: this is a well contained use of Cartesian. If tomorrow a readable and fast way of coding “dots” emerges, it will be easy to reimplement. I hope you will be able to make “iterations” that fast, that would be really nice!

Philippe

tim.holy · November 16, 2017, 10:12am

Looking more closely, I see you tried that, but that (as you noticed) you’re plagued by type-instability. The root of the problem is that CartesianRange(size(a)[1:end-n]) is not inferrable because of the 1:end-n. This then has a cascading effect because the call to tenprod! has to use dynamic dispatch, which is slow.

If you don’t need general n, then this

julia> function dots1a(a::Array{Ta},b::Array{Tb}) where {Ta,Tb}
           Tc = promote_type(Ta,Tb)
           szaf, szbt = Base.front(size(a)), Base.tail(size(b))
           sc = (szaf...,szbt...)
               c  = zeros(Tc,sc)
               r1 = CartesianRange(szaf)
               r2 = CartesianRange((size(b, 1),))
               r3 = CartesianRange(szbt)
           tenprod!(c,a,b,r1,r2,r3)
           return c
       end

gives much better performance (without making any other changes):

julia> using Compat, BenchmarkTools

julia> @btime dots1($b, $c)
  4.250 μs (16 allocations: 784 bytes)
2×3 Array{Float64,2}:
 -0.141967  -8.49691  1.66413
 -0.714566  -2.14979  2.13949

julia> @btime dots1a($b, $c)
  96.650 ns (1 allocation: 128 bytes)
2×3 Array{Float64,2}:
 -0.141967  -8.49691  1.66413
 -0.714566  -2.14979  2.13949

No macro magic and yet a 40x boost (for these small arrays).

The core idea is that you modify tuples only in ways that the compiler can infer. Unfortunately t[i:j] is not currently among the inferrable methods, so it’s better to use tools like ntuple(f, Val), Base.tail, Base.front, Base.IteratorsMD.split, and Base.fill_to_length.