Efficient way to chain a series of `diff` calls

My bad, I posted a bugged version of these functions - I have amended the previous post.

The below refers to the up-to-date functions:

Your ndiff3 shows marginal improvements under the same tests:

julia> @btime ndiff3(A, 10);
  44.090 μs (18 allocations: 686.45 KiB)

However, as mentioned, I would prefer not to add further dependencies, unless strictly necessary.

I have also tried a different option reported in that link (appropriately edited to support keyword arguments):

ndiff4(n) = ∘(ntuple(_ -> diff, n)...);

This was supposed to be faster for small n, but the results are still in the same order of magnitude:

julia> @btime ndiff4(10)(A, dims=2);
  47.632 μs (28 allocations: 761.03 KiB)