Maybe I do the measurements wrong, but I’m surprised that partition performs so badly in this case.

using IterTools, BenchmarkTools
a=rand(1:10,10^5)
function looppart(a)
res=zeros(Int,20)
for (a1,a2) in partition(a,2,1)
res[a1+a2]+=1
end
res
end
function loopzip(a)
res=zeros(Int,20)
for (a1,a2) in zip((@view a[1:end-1]), @view a[2:end])
res[a1+a2]+=1
end
res
end
julia> @btime looppart(a);
16.412 ms (399998 allocations: 21.36 MiB)
julia> @btime loopzip(a);
71.300 μs (1 allocation: 224 bytes)

How do you upload the version with the optimized function, to do this test?
If it’s not too complicated, what are the changes made to make the function more efficient?
My question comes from trying to solve this problem

here are the new measures after loading the #master

julia> a=rand(1:10,10^5);
julia> function looppart(a)
res=zeros(Int,20)
for (a1,a2) in partition(a,2,1)
res[a1+a2]+=1
end
res
end
looppart (generic function with 1 method)
julia> function loopzip(a)
res=zeros(Int,20)
for (a1,a2) in zip((@view a[1:end-1]), @view a[2:end])
res[a1+a2]+=1
end
res
end
loopzip (generic function with 1 method)
julia> function loopzip1(a)
res=zeros(Int,20)
for (a1,a2) in zip(a, @view a[2:end])
res[a1+a2]+=1
end
res
end
loopzip1 (generic function with 1 method)
julia> function loopzip2(a)
res=zeros(Int,20)
for i in 1:10^5-1
res[a[i]+a[i+1]]+=1
end
res
end
loopzip2 (generic function with 1 method)
julia> @btime looppart(a);
64.000 μs (1 allocation: 224 bytes)
julia> @btime loopzip(a);
70.500 μs (1 allocation: 224 bytes)
julia> @btime loopzip1(a);
60.500 μs (1 allocation: 224 bytes)
julia> @btime loopzip2(a);
76.400 μs (1 allocation: 224 bytes)

I’m afraid it’s too difficult for me, especially if done in a very general/abstract way
However, I would like to understand at least the principle on which the improvements are based

Another way for this specific problem (not general partition iteration) is:

julia> function looppeek(a)
res=zeros(Int,20)
itr = Iterators.Stateful(a)
@inbounds for v1 in itr
isnothing(peek(itr)) && break
res[v1+peek(itr)]+=1
end
res
end
looppeek (generic function with 1 method)
julia> @btime looppeek($a);

It achieves the fast performance with only one allocation.

Looked a bit into the code, and it uses @generated functions to produce a function specific for the number of elements in the returned tuples (2 in this discussion). Generating code with known length of partitions allows the use of tuples (which must have compile-time known lengths and types) and avoids the allocations. Elegant code, which is worth looking at.

in this specific case, a simple for loop assisted by the @inbounds macro seems to do better than the others

a=rand(1:10,10^5)
julia> function looppeek(a)
res=zeros(Int,20)
itr = Iterators.Stateful(a)
@inbounds for v1 in itr
isnothing(peek(itr)) && break
res[v1+peek(itr)]+=1
end
res
end
looppeek (generic function with 1 method)
julia> function loopindex(a)
res=zeros(Int,20)
@inbounds for i in 1:10^5-1
res[a[i]+a[i+1]]+=1
end
res
end
loopindex (generic function with 1 method)
julia> @btime looppeek($a);
60.500 μs (1 allocation: 224 bytes)
julia> @btime loopzip(a);
71.100 μs (1 allocation: 224 bytes)
julia> @btime loopzip1(a);
60.400 μs (1 allocation: 224 bytes)
julia> @btime loopindex(a);
42.400 μs (1 allocation: 224 bytes)