My bad, I posted a bugged version of these functions - I have amended the previous post.
The below refers to the up-to-date functions:
Your ndiff3 shows marginal improvements under the same tests:
julia> @btime ndiff3(A, 10);
44.090 μs (18 allocations: 686.45 KiB)
However, as mentioned, I would prefer not to add further dependencies, unless strictly necessary.
I have also tried a different option reported in that link (appropriately edited to support keyword arguments):
ndiff4(n) = ∘(ntuple(_ -> diff, n)...);
This was supposed to be faster for small n, but the results are still in the same order of magnitude:
julia> @btime ndiff4(10)(A, dims=2);
47.632 μs (28 allocations: 761.03 KiB)