Array Contraction, LoopVectorization & AD

marcobonici · December 17, 2023, 12:12am

Hi everyone,
I have been using LoopVectorization.jl with very satisfactory results to speedup some calculations.
Now, I have been hitting a problem: combining it with AD!
Although my use case is not super complicated (it is just an array contraction) I am not able to differentiate through it.
I read the documentation of Zygote, I am aware of the limitations when dealing with mutating functions.
I will consider the easier case of a matrix multiplication, but the same applies to more complex scenarios (the one I really actually care).

For instance, let us consider the following (that I just copied from LoopVectorization docs)

function mygemmavx!(C, A, B)
   for m ∈ axes(A,1), n ∈ axes(B,2)
       Cmn = zero(eltype(C))
       for k ∈ axes(A,2)
           Cmn += A[m,k] * B[k,n]
       end
       C[m,n] = Cmn
   end
end

I tried adding Zygote.Buffer in an allocating mygemmavx, but it is several orders of magnitudes slower.
I then tried with Tullio.jl. Although being a bit slower than LoopVectorization, it worked nicely on the gradients!

using Tullio, LoopVectorization, BenchmarkTools
mul(A, B) = @tullio C[i,k] := A[i,j] * B[j,k]
W = rand(100, 100); x = rand(100,100);
@benchmark sum(mul(W,x))

BenchmarkTools.Trial: 10000 samples with 1 evaluation.
 Range (min … max):  19.080 μs …   4.181 ms  ┊ GC (min … max):  0.00% … 95.39%
 Time  (median):     43.110 μs               ┊ GC (median):     0.00%
 Time  (mean ± σ):   47.226 μs ± 140.766 μs  ┊ GC (mean ± σ):  11.02% ±  3.68%

                                 ▁█▄▁                           
  ▂▂▂▂▁▂▂▂▄▃▃▄▄▃▄▅▄▃▃▃▃▂▂▂▂▂▂▂▃▅██████▆▄▃▃▃▃▃▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂ ▃
  19.1 μs         Histogram: frequency by time         63.5 μs <

 Memory estimate: 80.72 KiB, allocs estimate: 52.

@benchmark gradient(W -> sum(mul(W,x)), W)
BenchmarkTools.Trial: 10000 samples with 1 evaluation.
 Range (min … max):   74.694 μs …  16.137 ms  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     148.622 μs               ┊ GC (median):    0.00%
 Time  (mean ± σ):   161.961 μs ± 292.893 μs  ┊ GC (mean ± σ):  9.82% ± 6.28%

                    ▆█▁                                          
  ▁▁▂▂▂▂▄▄▃▃▄▄▅▅▄▅▆▆███▅▃▃▂▂▁▂▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁ ▂
  74.7 μs          Histogram: frequency by time          321 μs <

 Memory estimate: 242.59 KiB, allocs estimate: 159.

While standard * gave

@benchmark gradient(W -> sum(W*x), W)
BenchmarkTools.Trial: 10000 samples with 1 evaluation.
 Range (min … max):  41.687 μs …   3.389 ms  ┊ GC (min … max):  0.00% … 96.11%
 Time  (median):     51.237 μs               ┊ GC (median):     0.00%
 Time  (mean ± σ):   71.793 μs ± 206.830 μs  ┊ GC (mean ± σ):  18.93% ±  6.48%

  ▁▆█▆▆▅▃▂▁                   ▂                                ▂
  ██████████▇▇▅▇▆▅▆▆▅▆▅▆▆▅▆▆▅███▆▅▆▆▆▆▇▅▄▄▄▅▅▅▆▆▅▄▄▁▄▄▄▄▃▄▃▃▅▅ █
  41.7 μs       Histogram: log(frequency) by time       236 μs <

 Memory estimate: 236.11 KiB, allocs estimate: 35.

Tullio is within a factor of two of the standard * (which is actually great).
In case of more complicated contractions, Tullio still works but the gradient computation gets around 15 times slower than the forward pass.
So, my question is: is this the best we can do now in Julia to quickly contract arrays and obtain gradients? Is there any option I am missing/am I doing something wrong? Should I implement custom rules (for the weird tensor contraction, they should be straightforward to obtain, especially following the guide in ChainRules).

This is employed within Turing.jl, I want to do HMC, so quick gradients are a requirement for a good computational performance.
(Tagging @Elrod @mcabbott but insights from anyone are very welcome)
Thank you in advance,
Marco

mcabbott · December 17, 2023, 8:12am

Do you have some examples of your complicated contractions? If they are allowed by TensorOperations.jl, then you should also try that. It now has built-in support for taking gradients.

Otherwise your best bet is probably to write gradient rules, which themselves call Tullio. The macro can derive the expressions you need:

julia> @tullio C[i,k] := A[i,j] * B[j,k] verbose=1
┌ Info: symbolic gradients
│   inbody =
│    2-element Vector{Any}:
│     :(𝛥A[i, j] = 𝛥A[i, j] + 𝛥ℛ[i, k] * conj(B[j, k]))
└     :(𝛥B[j, k] = 𝛥B[j, k] + 𝛥ℛ[i, k] * conj(A[i, j]))

Then your rrule will contain something like back(dC) = (NoTangent(), @tullio(dA[i, j] := dC[i, k] * conj(B[j, k])), @tullio(dB[j, k] := dC[i, k] * conj(A[i, j]))).

This is not how Tullio’s gradients work. Instead, like the forward pass, it always writes one loop nest, writing into 𝛥A and 𝛥B simultaneously. This limits parallelism (can safely multi-thread only over j) and seems to play less well with LoopVectorization.jl. Unfortunately changing this would be quite messy, and I’m unlikely to get around to it.

marcobonici · December 17, 2023, 9:37am

Hi @mcabbott ,
Thank you for your answer.
So, a more complicated contraction is, for instance (I am trying also the package you suggested)

function tullio_conv(W,v)
    return @tullio C[i,k] := W[i,j,k,l] * v[j,l]
end

function tensor_conv(W,v)
    return @tensor C[i,k] := W[i,j,k,l] * v[j,l]
end

The benchmark reads

W = rand(2, 2, 37, 1400)
x = rand(2, 1400)
@benchmark sum(tullio_conv(W, x))

BenchmarkTools.Trial: 10000 samples with 1 evaluation.
 Range (min … max):  36.650 μs … 147.037 μs  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     37.334 μs               ┊ GC (median):    0.00%
 Time  (mean ± σ):   38.074 μs ±   2.253 μs  ┊ GC (mean ± σ):  0.00% ± 0.00%

    █▅                                                          
  ▁▃███▄▃▂▂▁▂▄▄▃▃▃▂▂▂▂▂▂▂▂▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁ ▂
  36.6 μs         Histogram: frequency by time         45.3 μs <

 Memory estimate: 688 bytes, allocs estimate: 2.

@benchmark sum(tensor_conv(W, x))

BenchmarkTools.Trial: 10000 samples with 1 evaluation.
 Range (min … max):  141.516 μs …   3.715 ms  ┊ GC (min … max):  0.00% … 92.18%
 Time  (median):     203.635 μs               ┊ GC (median):     0.00%
 Time  (mean ± σ):   297.734 μs ± 433.301 μs  ┊ GC (mean ± σ):  26.07% ± 15.99%

  ▆█▆▅▃▂                                                        ▂
  ███████▇▆▅▅▃▄▄▁▁▁▁▁▁▁▅▁▁▁▄▁▃▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▃▁▄▆▇▇▇██▇█▇▇ █
  142 μs        Histogram: log(frequency) by time       2.61 ms <

 Memory estimate: 1.59 MiB, allocs estimate: 137.

Regarding AD, only Tullio looks to be working

@benchmark gradient(W -> sum(tullio_conv(W, x)), W)

BenchmarkTools.Trial: 8735 samples with 1 evaluation.
 Range (min … max):  406.598 μs …   4.155 ms  ┊ GC (min … max):  0.00% … 86.13%
 Time  (median):     467.586 μs               ┊ GC (median):     0.00%
 Time  (mean ± σ):   570.568 μs ± 404.231 μs  ┊ GC (mean ± σ):  12.25% ± 13.93%

  ▃█▆▄▂   ▁▃▃▁                                                  ▁
  ██████▆▅█████▆▄▅▃▁▁▃▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▃▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▄▇▇▇████ █
  407 μs        Histogram: log(frequency) by time       2.65 ms <

 Memory estimate: 1.61 MiB, allocs estimate: 59.

Error stacktrace for TensorOperations

MethodError: no method matching StridedViews.StridedView(::FillArrays.Fill{Float64, 2, Tuple{Base.OneTo{Int64}, Base.OneTo{Int64}}})

Closest candidates are:
StridedViews.StridedView(::PermutedDimsArray{T, N, P}) where {T, N, P}
@ StridedViews ~/.julia/packages/StridedViews/dcnHM/src/stridedview.jl:51
StridedViews.StridedView(::Base.ReshapedArray)
@ StridedViews ~/.julia/packages/StridedViews/dcnHM/src/stridedview.jl:50
StridedViews.StridedView(::SubArray)
@ StridedViews ~/.julia/packages/StridedViews/dcnHM/src/stridedview.jl:49
…

Stacktrace:
[1] tensorcontract!(C::Array{Float64, 4}, pC::Tuple{NTuple{4, Int64}, Tuple{}}, A::FillArrays.Fill{Float64, 2, Tuple{Base.OneTo{Int64}, Base.OneTo{Int64}}}, pA::Tuple{Tuple{Int64, Int64}, Tuple{}}, conjA::Symbol, B::Matrix{Float64}, pB::Tuple{Tuple{}, Tuple{Int64, Int64}}, conjB::Symbol, α::VectorInterface.One, β::VectorInterface.Zero, #unused#::TensorOperations.Backend{:StridedBLAS})
@ TensorOperations ~/.julia/packages/TensorOperations/7VyQe/src/implementation/abstractarray.jl:63
[2] tensorcontract!(C::Array{Float64, 4}, pC::Tuple{NTuple{4, Int64}, Tuple{}}, A::FillArrays.Fill{Float64, 2, Tuple{Base.OneTo{Int64}, Base.OneTo{Int64}}}, pA::Tuple{Tuple{Int64, Int64}, Tuple{}}, conjA::Symbol, B::Matrix{Float64}, pB::Tuple{Tuple{}, Tuple{Int64, Int64}}, conjB::Symbol, α::VectorInterface.One, β::VectorInterface.Zero)
@ TensorOperations ~/.julia/packages/TensorOperations/7VyQe/src/implementation/abstractarray.jl:35
[3] (::TensorOperationsChainRulesCoreExt.var"#58#65"{Tuple{Tuple{Int64, Int64}, Tuple{}}, FillArrays.Fill{Float64, 2, Tuple{Base.OneTo{Int64}, Base.OneTo{Int64}}}, Array{Float64, 4}, Tuple{Tuple{Int64, Int64}, Tuple{Int64, Int64}}, Symbol, Matrix{Float64}, Tuple{Tuple{Int64, Int64}, Tuple{}}, Symbol, VectorInterface.One, Tuple{}, ProjectTo{AbstractArray, NamedTuple{(:element, :axes), Tuple{ProjectTo{Float64, NamedTuple{(), Tuple{}}}, NTuple{4, Base.OneTo{Int64}}}}}})()
@ TensorOperationsChainRulesCoreExt ~/.julia/packages/TensorOperations/7VyQe/ext/TensorOperationsChainRulesCoreExt.jl:99
[4] unthunk
@ ~/.julia/packages/ChainRulesCore/7MWx2/src/tangent_types/thunks.jl:204 [inlined]
[5] wrap_chainrules_output
@ ~/.julia/packages/Zygote/YYT6v/src/compiler/chainrules.jl:110 [inlined]
[6] map (repeats 4 times)
@ ./tuple.jl:276 [inlined]
[7] wrap_chainrules_output
@ ~/.julia/packages/Zygote/YYT6v/src/compiler/chainrules.jl:111 [inlined]
[8] ZBack
@ ~/.julia/packages/Zygote/YYT6v/src/compiler/chainrules.jl:211 [inlined]
[9] Pullback
@ ./In[35]:25 [inlined]
[10] (::Zygote.Pullback{Tuple{typeof(tensor_conv), Array{Float64, 4}, Matrix{Float64}}, Tuple{Zygote.Pullback{Tuple{typeof(scalartype), Array{Float64, 4}}, Tuple{Zygote.ZBack{ChainRules.var"#typeof_pullback#45"}, Zygote.Pullback{Tuple{typeof(scalartype), Type{Array{Float64, 4}}}, Tuple{Zygote.Pullback{Tuple{typeof(scalartype), Type{Float64}}, Tuple{}}}}}}, Zygote.ZBack{TensorOperationsChainRulesCoreExt.var"#pullback#63"{Matrix{Float64}, Tuple{Tuple{Int64, Int64}, Tuple{}}, Array{Float64, 4}, Tuple{Tuple{Int64, Int64}, Tuple{Int64, Int64}}, Symbol, Matrix{Float64}, Tuple{Tuple{Int64, Int64}, Tuple{}}, Symbol, VectorInterface.One, VectorInterface.Zero, Tuple{}, ProjectTo{Number, NamedTuple{(), Tuple{}}}, ProjectTo{Number, NamedTuple{(), Tuple{}}}, ProjectTo{AbstractArray, NamedTuple{(:element, :axes), Tuple{ProjectTo{Float64, NamedTuple{(), Tuple{}}}, Tuple{Base.OneTo{Int64}, Base.OneTo{Int64}}}}}, ProjectTo{AbstractArray, NamedTuple{(:element, :axes), Tuple{ProjectTo{Float64, NamedTuple{(), Tuple{}}}, Tuple{Base.OneTo{Int64}, Base.OneTo{Int64}}}}}, ProjectTo{AbstractArray, NamedTuple{(:element, :axes), Tuple{ProjectTo{Float64, NamedTuple{(), Tuple{}}}, NTuple{4, Base.OneTo{Int64}}}}}}}, Zygote.ZBack{TensorOperationsChainRulesCoreExt.var"#tensoralloc_contract_pullback#41"{Tuple{DataType, Tuple{Tuple{Int64, Int64}, Tuple{}}, Array{Float64, 4}, Tuple{Tuple{Int64, Int64}, Tuple{Int64, Int64}}, Symbol, Matrix{Float64}, Tuple{Tuple{Int64, Int64}, Tuple{}}, Symbol, Bool}}}, Zygote.Pullback{Tuple{typeof(TensorOperations.promote_contract), Type{Float64}, Type{Float64}}, Tuple{Zygote.var"#2017#back#204"{typeof(identity)}, Zygote.var"#2173#back#293"{Zygote.var"#291#292"{Tuple{Tuple{Nothing}, Tuple{Nothing, Nothing}}, Zygote.Pullback{Tuple{typeof(Base.promote_op), typeof(TensorOperations.tensorop), Type{Float64}, Type{Float64}}, Tuple{Zygote.var"#2017#back#204"{typeof(identity)}, Zygote.var"#2173#back#293"{Zygote.var"#291#292"{Tuple{Tuple{Nothing}, Tuple{Nothing, Nothing}}, Zygote.ZBack{ChainRules.var"#apply_type_pullback#42"{Tuple{DataType, DataType}}}}}, Zygote.Pullback{Tuple{typeof(Core.Compiler.return_type), typeof(TensorOperations.tensorop), Type{Tuple{Float64, Float64}}}, Tuple{typeof(Core.Compiler.return_type)}}}}}}}}, Zygote.Pullback{Tuple{typeof(scalartype), Matrix{Float64}}, Tuple{Zygote.ZBack{ChainRules.var"#typeof_pullback#45"}, Zygote.Pullback{Tuple{typeof(scalartype), Type{Matrix{Float64}}}, Tuple{Zygote.Pullback{Tuple{typeof(scalartype), Type{Float64}}, Tuple{}}}}}}}})(Δ::FillArrays.Fill{Float64, 2, Tuple{Base.OneTo{Int64}, Base.OneTo{Int64}}})
@ Zygote ~/.julia/packages/Zygote/YYT6v/src/compiler/interface2.jl:0
[11] Pullback
@ ./In[38]:1 [inlined]
[12] (::Zygote.Pullback{Tuple{var"#47#48", Array{Float64, 4}}, Tuple{Zygote.var"#2995#back#766"{Zygote.var"#760#764"{Matrix{Float64}}}, Zygote.var"#1990#back#194"{Zygote.var"#190#193"{Zygote.Context{false}, GlobalRef, Matrix{Float64}}}, Zygote.Pullback{Tuple{typeof(tensor_conv), Array{Float64, 4}, Matrix{Float64}}, Tuple{Zygote.Pullback{Tuple{typeof(scalartype), Array{Float64, 4}}, Tuple{Zygote.ZBack{ChainRules.var"#typeof_pullback#45"}, Zygote.Pullback{Tuple{typeof(scalartype), Type{Array{Float64, 4}}}, Tuple{Zygote.Pullback{Tuple{typeof(scalartype), Type{Float64}}, Tuple{}}}}}}, Zygote.ZBack{TensorOperationsChainRulesCoreExt.var"#pullback#63"{Matrix{Float64}, Tuple{Tuple{Int64, Int64}, Tuple{}}, Array{Float64, 4}, Tuple{Tuple{Int64, Int64}, Tuple{Int64, Int64}}, Symbol, Matrix{Float64}, Tuple{Tuple{Int64, Int64}, Tuple{}}, Symbol, VectorInterface.One, VectorInterface.Zero, Tuple{}, ProjectTo{Number, NamedTuple{(), Tuple{}}}, ProjectTo{Number, NamedTuple{(), Tuple{}}}, ProjectTo{AbstractArray, NamedTuple{(:element, :axes), Tuple{ProjectTo{Float64, NamedTuple{(), Tuple{}}}, Tuple{Base.OneTo{Int64}, Base.OneTo{Int64}}}}}, ProjectTo{AbstractArray, NamedTuple{(:element, :axes), Tuple{ProjectTo{Float64, NamedTuple{(), Tuple{}}}, Tuple{Base.OneTo{Int64}, Base.OneTo{Int64}}}}}, ProjectTo{AbstractArray, NamedTuple{(:element, :axes), Tuple{ProjectTo{Float64, NamedTuple{(), Tuple{}}}, NTuple{4, Base.OneTo{Int64}}}}}}}, Zygote.ZBack{TensorOperationsChainRulesCoreExt.var"#tensoralloc_contract_pullback#41"{Tuple{DataType, Tuple{Tuple{Int64, Int64}, Tuple{}}, Array{Float64, 4}, Tuple{Tuple{Int64, Int64}, Tuple{Int64, Int64}}, Symbol, Matrix{Float64}, Tuple{Tuple{Int64, Int64}, Tuple{}}, Symbol, Bool}}}, Zygote.Pullback{Tuple{typeof(TensorOperations.promote_contract), Type{Float64}, Type{Float64}}, Tuple{Zygote.var"#2017#back#204"{typeof(identity)}, Zygote.var"#2173#back#293"{Zygote.var"#291#292"{Tuple{Tuple{Nothing}, Tuple{Nothing, Nothing}}, Zygote.Pullback{Tuple{typeof(Base.promote_op), typeof(TensorOperations.tensorop), Type{Float64}, Type{Float64}}, Tuple{Zygote.var"#2017#back#204"{typeof(identity)}, Zygote.var"#2173#back#293"{Zygote.var"#291#292"{Tuple{Tuple{Nothing}, Tuple{Nothing, Nothing}}, Zygote.ZBack{ChainRules.var"#apply_type_pullback#42"{Tuple{DataType, DataType}}}}}, Zygote.Pullback{Tuple{typeof(Core.Compiler.return_type), typeof(TensorOperations.tensorop), Type{Tuple{Float64, Float64}}}, Tuple{typeof(Core.Compiler.return_type)}}}}}}}}, Zygote.Pullback{Tuple{typeof(scalartype), Matrix{Float64}}, Tuple{Zygote.ZBack{ChainRules.var"#typeof_pullback#45"}, Zygote.Pullback{Tuple{typeof(scalartype), Type{Matrix{Float64}}}, Tuple{Zygote.Pullback{Tuple{typeof(scalartype), Type{Float64}}, Tuple{}}}}}}}}}})(Δ::Float64)
@ Zygote ~/.julia/packages/Zygote/YYT6v/src/compiler/interface2.jl:0
[13] (::Zygote.var"#75#76"{Zygote.Pullback{Tuple{var"#47#48", Array{Float64, 4}}, Tuple{Zygote.var"#2995#back#766"{Zygote.var"#760#764"{Matrix{Float64}}}, Zygote.var"#1990#back#194"{Zygote.var"#190#193"{Zygote.Context{false}, GlobalRef, Matrix{Float64}}}, Zygote.Pullback{Tuple{typeof(tensor_conv), Array{Float64, 4}, Matrix{Float64}}, Tuple{Zygote.Pullback{Tuple{typeof(scalartype), Array{Float64, 4}}, Tuple{Zygote.ZBack{ChainRules.var"#typeof_pullback#45"}, Zygote.Pullback{Tuple{typeof(scalartype), Type{Array{Float64, 4}}}, Tuple{Zygote.Pullback{Tuple{typeof(scalartype), Type{Float64}}, Tuple{}}}}}}, Zygote.ZBack{TensorOperationsChainRulesCoreExt.var"#pullback#63"{Matrix{Float64}, Tuple{Tuple{Int64, Int64}, Tuple{}}, Array{Float64, 4}, Tuple{Tuple{Int64, Int64}, Tuple{Int64, Int64}}, Symbol, Matrix{Float64}, Tuple{Tuple{Int64, Int64}, Tuple{}}, Symbol, VectorInterface.One, VectorInterface.Zero, Tuple{}, ProjectTo{Number, NamedTuple{(), Tuple{}}}, ProjectTo{Number, NamedTuple{(), Tuple{}}}, ProjectTo{AbstractArray, NamedTuple{(:element, :axes), Tuple{ProjectTo{Float64, NamedTuple{(), Tuple{}}}, Tuple{Base.OneTo{Int64}, Base.OneTo{Int64}}}}}, ProjectTo{AbstractArray, NamedTuple{(:element, :axes), Tuple{ProjectTo{Float64, NamedTuple{(), Tuple{}}}, Tuple{Base.OneTo{Int64}, Base.OneTo{Int64}}}}}, ProjectTo{AbstractArray, NamedTuple{(:element, :axes), Tuple{ProjectTo{Float64, NamedTuple{(), Tuple{}}}, NTuple{4, Base.OneTo{Int64}}}}}}}, Zygote.ZBack{TensorOperationsChainRulesCoreExt.var"#tensoralloc_contract_pullback#41"{Tuple{DataType, Tuple{Tuple{Int64, Int64}, Tuple{}}, Array{Float64, 4}, Tuple{Tuple{Int64, Int64}, Tuple{Int64, Int64}}, Symbol, Matrix{Float64}, Tuple{Tuple{Int64, Int64}, Tuple{}}, Symbol, Bool}}}, Zygote.Pullback{Tuple{typeof(TensorOperations.promote_contract), Type{Float64}, Type{Float64}}, Tuple{Zygote.var"#2017#back#204"{typeof(identity)}, Zygote.var"#2173#back#293"{Zygote.var"#291#292"{Tuple{Tuple{Nothing}, Tuple{Nothing, Nothing}}, Zygote.Pullback{Tuple{typeof(Base.promote_op), typeof(TensorOperations.tensorop), Type{Float64}, Type{Float64}}, Tuple{Zygote.var"#2017#back#204"{typeof(identity)}, Zygote.var"#2173#back#293"{Zygote.var"#291#292"{Tuple{Tuple{Nothing}, Tuple{Nothing, Nothing}}, Zygote.ZBack{ChainRules.var"#apply_type_pullback#42"{Tuple{DataType, DataType}}}}}, Zygote.Pullback{Tuple{typeof(Core.Compiler.return_type), typeof(TensorOperations.tensorop), Type{Tuple{Float64, Float64}}}, Tuple{typeof(Core.Compiler.return_type)}}}}}}}}, Zygote.Pullback{Tuple{typeof(scalartype), Matrix{Float64}}, Tuple{Zygote.ZBack{ChainRules.var"#typeof_pullback#45"}, Zygote.Pullback{Tuple{typeof(scalartype), Type{Matrix{Float64}}}, Tuple{Zygote.Pullback{Tuple{typeof(scalartype), Type{Float64}}, Tuple{}}}}}}}}}}})(Δ::Float64)
@ Zygote ~/.julia/packages/Zygote/YYT6v/src/compiler/interface.jl:45
[14] gradient(f::Function, args::Array{Float64, 4})
@ Zygote ~/.julia/packages/Zygote/YYT6v/src/compiler/interface.jl:97
[15] var"##core#1252"()
@ Main ~/.julia/packages/BenchmarkTools/0owsb/src/execution.jl:489
[16] var"##sample#1253"(::Tuple{}, __params::BenchmarkTools.Parameters)
@ Main ~/.julia/packages/BenchmarkTools/0owsb/src/execution.jl:495
[17] _run(b::BenchmarkTools.Benchmark, p::BenchmarkTools.Parameters; verbose::Bool, pad::String, kwargs::Base.Pairs{Symbol, Integer, NTuple{4, Symbol}, NamedTuple{(:samples, :evals, :gctrial, :gcsample), Tuple{Int64, Int64, Bool, Bool}}})
@ BenchmarkTools ~/.julia/packages/BenchmarkTools/0owsb/src/execution.jl:99
[18] #invokelatest#2
@ ./essentials.jl:821 [inlined]
[19] invokelatest
@ ./essentials.jl:816 [inlined]
[20] #run_result#45
@ ~/.julia/packages/BenchmarkTools/0owsb/src/execution.jl:34 [inlined]
[21] run_result
@ ~/.julia/packages/BenchmarkTools/0owsb/src/execution.jl:34 [inlined]
[22] run(b::BenchmarkTools.Benchmark, p::BenchmarkTools.Parameters; progressid::Nothing, nleaves::Float64, ndone::Float64, kwargs::Base.Pairs{Symbol, Integer, NTuple{5, Symbol}, NamedTuple{(:verbose, :samples, :evals, :gctrial, :gcsample), Tuple{Bool, Int64, Int64, Bool, Bool}}})
@ BenchmarkTools ~/.julia/packages/BenchmarkTools/0owsb/src/execution.jl:117
[23] run (repeats 2 times)
@ ~/.julia/packages/BenchmarkTools/0owsb/src/execution.jl:117 [inlined]
[24] #warmup#54
@ ~/.julia/packages/BenchmarkTools/0owsb/src/execution.jl:169 [inlined]
[25] warmup(item::BenchmarkTools.Benchmark)
@ BenchmarkTools ~/.julia/packages/BenchmarkTools/0owsb/src/execution.jl:168
[26] top-level scope
@ ~/.julia/packages/BenchmarkTools/0owsb/src/execution.jl:393

From you message, I understood I am not doing anything wrong. Is this right, or am I missing something?

mcabbott · December 17, 2023, 9:51am

The error might be avoided by sum(abs2, ...) instead of sum.

Zygote’s rule for sum uses FillArrays, which this PR may remove. It’s an optimisation to save a little allocation which causes all kinds of issues (normally in toy problems like this). Here @tensor doesn’t understand this type of fake array, see e.g. this question.

marcobonici · December 17, 2023, 9:57am

Thanks, now is fixed.
It is slower than Tullio’s gradient.

@benchmark gradient(W -> sum(abs2, tullio_conv(W, x)), W)

BenchmarkTools.Trial: 3244 samples with 1 evaluation.
 Range (min … max):  966.187 μs …  17.276 ms  ┊ GC (min … max):  0.00% …  0.00%
 Time  (median):       1.160 ms               ┊ GC (median):     0.00%
 Time  (mean ± σ):     1.538 ms ± 903.611 μs  ┊ GC (mean ± σ):  21.29% ± 22.59%

  ▁▅██▇▆▄▃▁   ▂▂▁                             ▁▂▃▃▃▂▂▂▁         ▁
  █████████▇█▇█████▇▆▄▅▇▅▅▃▁▃▁▁▁▁▁▁▁▁▁▁▁▁▁▃▁▁▅██████████▇▇▆▆▃▆▆ █
  966 μs        Histogram: log(frequency) by time       3.88 ms <

 Memory estimate: 7.99 MiB, allocs estimate: 624.

Regarding the performance, is there anything (in principle) that might be done to improve it?
If the problem is “just” with parallelization, I might still be happy (if it doesn’t parallelize, I can just run more chains in parallel), but I would like to understand whether there is anything I can do better.

mcabbott · December 17, 2023, 11:22am

Ok. @tensor nees permutedims here, which is quite expensive. Things you might try are (1) writing rrules for Tullio, and perhaps bigger (2) re-ordering array indices to minimise permutations.

marcobonici · December 17, 2023, 5:40pm

Answering later with more precise benchmarks.

Array reordering puts TensorOperations.jl at the same level of Tullio for the last example.
Adding Zygote.@adjoints rules improves performance.

For the first scenario (the standard matrix multiplication) I get a considerable speedup.
Without adding the rule, I have

@benchmark gradient(W -> sum(abs2, mul(W,x)), W)
BenchmarkTools.Trial: 10000 samples with 1 evaluation.
 Range (min … max):   73.643 μs …  16.139 ms  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     161.584 μs               ┊ GC (median):    0.00%
 Time  (mean ± σ):   175.547 μs ± 237.904 μs  ┊ GC (mean ± σ):  7.80% ± 7.28%

                       ▁ ▁▂▄▅▅▆▄▆▆█▆▇█▇▅▃▂▁▁                     
  ▂▁▁▂▂▂▂▃▃▃▄▄▅▅▇▇▆▇▇▇██████████████████████▇▆▅▅▄▄▃▃▃▃▂▂▂▂▂▂▂▂▂ ▅
  73.6 μs          Histogram: frequency by time          245 μs <

 Memory estimate: 320.59 KiB, allocs estimate: 160.

After adding the rule

BenchmarkTools.Trial: 10000 samples with 1 evaluation.
 Range (min … max):  22.003 μs …   2.901 ms  ┊ GC (min … max):  0.00% … 94.00%
 Time  (median):     52.789 μs               ┊ GC (median):     0.00%
 Time  (mean ± σ):   59.898 μs ± 122.754 μs  ┊ GC (mean ± σ):  11.28% ±  5.44%

                     ▁▃▆▇▇█▅▃▃▁▁                                
  ▁▁▁▁▁▂▂▂▂▆▇█▇▆▇▇▅▅▆███████████▆▆▄▃▂▂▂▂▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁ ▃
  22 μs           Histogram: frequency by time          102 μs <

 Memory estimate: 160.44 KiB, allocs estimate: 80.

Later will do the same with the weirder contraction.
So, the lessons learnt are:

Remember the performance tips section of Julia docs (array indexing matters)
Write custom rules to boost performance.

Will update later with the more complex scenario.

Elrod · December 18, 2023, 9:16pm

It’s not productive/human time efficient to need to implement all these rules, but it is a fairly reliable way to get good performance.
SimpleChains.jl does this to hit its performance targets.

LoopVectorization.jl should automatically be splitting those into separate loops. If you have an example where it fails to do that, you could share it.
Although I’m unlikely to get around to addressing it, I could answer questions and point someone toward how to figure out what is going on/why/how to fix it.

Topic		Replies	Views
Speeding up Zygote autodiff for numerical loop Performance question	13	295	December 16, 2024
Zygote gradient results in slow Tuple getindex calls Performance zygote	6	892	May 20, 2021
Accelerate autodiff in Zygote.jl Performance zygote , autodiff	9	1081	June 15, 2022
Speed of vectorized vs for-loops using Zygote Performance zygote , tullio	20	2249	June 1, 2020
Zygote Performance (Again...) General Usage zygote , forwarddiff , tullio	17	1805	June 11, 2021

Array Contraction, LoopVectorization & AD

Related topics