On a different machine I tried with julia 1.5.2 and got very good runtime values with 3 residual blocks:
julia> include("/home/ulg/gher/abarth/Julia/share/test_zygote_perf.jl")
# forward pass (1st and 2nd call)
19.787238 seconds (51.32 M allocations: 2.547 GiB, 4.63% gc time)
  0.492617 seconds (439.07 k allocations: 21.413 MiB)
# backward pass (1st and 2nd call)
 28.111849 seconds (51.57 M allocations: 2.611 GiB, 3.74% gc time)
  0.069456 seconds (50.17 k allocations: 2.577 MiB)
That would be a factor of ~50 000 (between julia 1.7.0-rc1/1.6.1 and julia 1.5.2) for the 2nd call of the backward pass!
(Flux-0.12) pkg> st --manifest
Status `/home/users/a/b/abarth/.julia/environments/Flux-0.12/Manifest.toml`
  [621f4979] AbstractFFTs v1.0.1
  [1520ce14] AbstractTrees v0.3.4
  [79e6a3ab] Adapt v3.3.1
  [56f22d72] Artifacts v1.3.0
  [ab4f0b2a] BFloat16s v0.1.0
  [fa961155] CEnum v0.4.1
  [052768ef] CUDA v2.4.3
  [082447d4] ChainRules v0.7.70
  [d360d2e6] ChainRulesCore v0.9.45
  [944b1d66] CodecZlib v0.7.0
  [3da002f7] ColorTypes v0.11.0
  [5ae59095] Colors v0.12.8
  [bbf7d656] CommonSubexpressions v0.3.0
  [34da2185] Compat v3.37.0
  [e66e0078] CompilerSupportLibraries_jll v0.3.4+0
  [9a962f9c] DataAPI v1.9.0
  [864edb3b] DataStructures v0.18.10
  [163ba53b] DiffResults v1.0.3
  [b552c78f] DiffRules v1.3.1
  [ffbed154] DocStringExtensions v0.8.5
  [e2ba6199] ExprTools v0.1.6
  [1a297f60] FillArrays v0.11.9
  [53c48c17] FixedPointNumbers v0.8.4
  [587475ba] Flux v0.12.1
  [f6369f11] ForwardDiff v0.10.19
  [d9f16b24] Functors v0.2.5
  [0c68f7d7] GPUArrays v6.4.1
  [61eb1bfa] GPUCompiler v0.8.3
  [7869d1d1] IRTools v0.4.3
  [92d709cd] IrrationalConstants v0.1.0
  [692b3bcd] JLLWrappers v1.3.0
  [e5e0dc1b] Juno v0.8.4
  [929cbde3] LLVM v3.9.0
  [2ab3a3ac] LogExpFunctions v0.3.0
  [1914dd2f] MacroTools v0.5.8
  [e89f7d12] Media v0.5.0
  [e1d29d7a] Missings v1.0.2
  [872c559c] NNlib v0.7.19
  [77ba4419] NaNMath v0.3.5
  [05823500] OpenLibm_jll v0.7.1+0
  [efe28fd5] OpenSpecFun_jll v0.5.3+4
  [bac558e1] OrderedCollections v1.4.1
  [21216c6a] Preferences v1.2.2
  [189a3867] Reexport v1.2.2
  [ae029012] Requires v1.1.3
  [6c6a2e73] Scratch v1.1.0
  [a2af1166] SortingAlgorithms v1.0.1
  [276daf66] SpecialFunctions v1.6.2
  [90137ffa] StaticArrays v1.2.12
  [82ae8749] StatsAPI v1.0.0
  [2913bbd2] StatsBase v0.33.10
  [fa267f1f] TOML v1.0.3
  [a759f4b9] TimerOutputs v0.5.12
  [3bb67fe8] TranscodingStreams v0.9.6
  [a5390f91] ZipFile v0.9.4
  [83775a58] Zlib_jll v1.2.11+18
  [e88e6eb3] Zygote v0.6.12
  [700de1a5] ZygoteRules v0.2.1
  [2a0f44e3] Base64
  [ade2ca70] Dates
  [8bb1440f] DelimitedFiles
  [8ba89e20] Distributed
  [b77e0a4c] InteractiveUtils
  [76f85450] LibGit2
  [8f399da3] Libdl
  [37e2e46d] LinearAlgebra
  [56ddb016] Logging
  [d6f4376e] Markdown
  [a63ad114] Mmap
  [44cfe95a] Pkg
  [de0858da] Printf
  [9abbd945] Profile
  [3fa0cd96] REPL
  [9a3f8284] Random
  [ea8e919c] SHA
  [9e88b42a] Serialization
  [1a1011a3] SharedArrays
  [6462fe0b] Sockets
  [2f01184e] SparseArrays
  [10745b16] Statistics
  [8dfed614] Test
  [cf7118a7] UUIDs
  [4ec0a83e] Unicode
julia> versioninfo()
Julia Version 1.5.2
Commit 539f3ce943 (2020-09-23 23:17 UTC)
Platform Info:
  OS: Linux (x86_64-pc-linux-gnu)
  CPU: Intel(R) Xeon(R) Gold 6126 CPU @ 2.60GHz
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-9.0.1 (ORCJIT, skylake-avx512)
Environment:
  JULIA_REVISE_POLL = 1