# Strange gradient behavior related to function applications

I ran into a baffling problem.

I am using a sparse matrix sm as a weight matrix for a Dense unit.

It is 2000x2000 sparse matrix with 100000 non-zero Float32 entries

I created two Dense units:

b = zeros(Float32,2000)
dsm = Dense(sm, b, σ)
dsm1 = Dense(sm, b, identity)

I did the following gradient computation experiment:

v = rand(Float32, 2000)

the first benchmark yields:
BenchmarkTools.Trial: 648 samples with 1 evaluation.
Range (min … max): 4.885 ms … 14.790 ms ┊ GC (min … max): 0.00% … 50.23%
Time (median): 6.471 ms ┊ GC (median): 0.00%
Time (mean ± σ): 7.710 ms ± 2.547 ms ┊ GC (mean ± σ): 23.01% ± 22.54%

the second one yields:
BenchmarkTools.Trial: 59 samples with 1 evaluation.
Range (min … max): 82.632 ms … 96.847 ms ┊ GC (min … max): 0.00% … 5.40%
Time (median): 83.272 ms ┊ GC (median): 0.00%
Time (mean ± σ): 85.369 ms ± 3.387 ms ┊ GC (mean ± σ): 1.96% ± 2.75%

As you can see, the second one with identity function is far slower than the first one.

Just to make sure that this is not due to a peculiar implementation detail of Dense,
I did an equivalent but expanded benchmark test:

1. @benchmark gradient(() → sum(σ.(sm*v .+ b)), params(sm,b))
2. @benchmark gradient(() → sum(identity.(sm*v .+ b)), params(sm,b))
3. @benchmark gradient(() → sum(sm*v .+ b), params(sm,b))
4. @benchmark gradient(() → sum(id.(sm*v .+ b)), params(sm,b)) (Def: id(x) = x)

Again, case 1) is about a dozen times faster than case 2).
However, what is really strange is that case 4) is as fast as case 1) while case 3) is as slow as case 2)
case 2)~4) are essentially the same conceptually but 4) far outperforms 2) & 3)

The results are shown in the following:

@benchmark gradient(() → sum(identity.(sm*v .+ b)), params(sm,b))
BenchmarkTools.Trial: 59 samples with 1 evaluation.
Range (min … max): 82.749 ms … 90.370 ms ┊ GC (min … max): 0.00% … 5.81%
Time (median): 83.316 ms ┊ GC (median): 0.00%
Time (mean ± σ): 85.017 ms ± 2.540 ms ┊ GC (mean ± σ): 2.02% ± 2.86%

@benchmark gradient(() → sum(sm*v .+ b), params(sm,b))
BenchmarkTools.Trial: 59 samples with 1 evaluation.
Range (min … max): 82.547 ms … 94.709 ms ┊ GC (min … max): 0.00% … 5.62%
Time (median): 83.258 ms ┊ GC (median): 0.00%
Time (mean ± σ): 85.295 ms ± 3.072 ms ┊ GC (mean ± σ): 1.99% ± 2.80%

@benchmark gradient(() → sum(id.(sm*v .+ b)), params(sm,b))
BenchmarkTools.Trial: 629 samples with 1 evaluation.
Range (min … max): 5.031 ms … 13.410 ms ┊ GC (min … max): 0.00% … 48.90%
Time (median): 6.474 ms ┊ GC (median): 0.00%
Time (mean ± σ): 7.936 ms ± 2.449 ms ┊ GC (mean ± σ): 21.98% ± 21.90%

I also did it with a normal full matrix:

mm = Matrix(sm)

@benchmark gradient(() → sum(identity.(mm*v .+ b)), params(mm,b))
BenchmarkTools.Trial: 269 samples with 1 evaluation.
Range (min … max): 12.983 ms … 25.442 ms ┊ GC (min … max): 0.00% … 31.76%
Time (median): 18.970 ms ┊ GC (median): 19.15%
Time (mean ± σ): 18.618 ms ± 3.126 ms ┊ GC (mean ± σ): 18.53% ± 14.12%

@benchmark gradient(() → sum(mm*v .+ b), params(mm,b))
BenchmarkTools.Trial: 270 samples with 1 evaluation.
Range (min … max): 12.835 ms … 25.682 ms ┊ GC (min … max): 0.00% … 28.89%
Time (median): 19.019 ms ┊ GC (median): 19.10%
Time (mean ± σ): 18.560 ms ± 3.185 ms ┊ GC (mean ± σ): 18.67% ± 14.17%

@benchmark gradient(() → sum(id.(mm*v .+ b)), params(mm,b))
BenchmarkTools.Trial: 601 samples with 1 evaluation.
Range (min … max): 5.376 ms … 14.213 ms ┊ GC (min … max): 0.00% … 46.20%
Time (median): 6.843 ms ┊ GC (median): 0.00%
Time (mean ± σ): 8.299 ms ± 2.484 ms ┊ GC (mean ± σ): 21.02% ± 21.12%

It is not as bad as in the case of sm, but still using id is 2~3 times faster

using @btime yields a slightly different(shorter in my case) result than using @benchmark,
but does not change the fact that using id considerably speeds up the result

@btime gradient(() → sum(id.(mm*v .+ b)), params(mm,b))
5.265 ms (71 allocations: 15.32 MiB)

@btime gradient(() → sum(identity.(mm*v .+ b)), params(mm,b))
12.886 ms (68 allocations: 30.55 MiB)

@btime gradient(() → sum(mm*v .+ b), params(mm,b))
12.989 ms (67 allocations: 30.55 MiB)

I even tried sqrt instead of σ and id. It takes about the same time as the other two.

Finally, I converted all data(which are Float32 numbers) into Float64 numbers, and run the experiment
for the case of id, identity and no application. but the results are basically the same.

Is there any reason why this is the case?
It seems that “gradient” function has some interesting quirk.