Collecting homogenized vectors: modify iterator vs modify collector

Nemo is a computer algebra package for the Julia programming language. For a multivariated polynomial p there is an iterator exponent_vectors(p) whose values are vectors corresponding to the exponents of the terms of p (in some preset monomial order).
I want to collect all the exponents of a huge polynomial in the rows of a matrix and add a first column of ones (this is to compute the Newton polytope of p in Polymake.jl).
I tryed two approxes. The first is to create a new iterator with values the exponents of p with a pushed first component set to one and then collecting this new iterator. The second, create a matix of ones of the right size and collect the exponents of p in its [2:end] columns.
I observed a big performance diference, the scond is more or less two times faster than the first.
I wonder whether it is natural or maybe the problem is my naive implementation of the first approx, which, I think, is more intrincate.

struct Homogenize{I}
    itr::I
 end

homogenize(iter) = Homogenize(iter)

Base.length(h::Homogenize) = Base.length(h.itr)
Base.size(h::Homogenize) = Base.size(h.itr)
Base.eltype(::Type{Homogenize{I}}) where {I} = eltype(I)
Base.IteratorSize(::Type{Homogenize{I}}) where {I} = Base.IteratorSize(I)
Base.IteratorEltype(::Type{Homogenize{I}}) where {I} = Base.IteratorEltype(I)

Base.@propagate_inbounds function Base.iterate(h::Homogenize, state)
    n = iterate(h.itr, state...)
    n === nothing && return n
    return [oneunit(Base.eltype(n[1])); n[1]], n[2]
end

Base.@propagate_inbounds function Base.iterate(h::Homogenize)
    n = iterate(h.itr)
    n === nothing && return n
    return [oneunit(Base.eltype(n[1])); n[1]], n[2]
end
function collect_homoiter(iter)
    ## First iteration outside the loop to prealocate v
    (u, state) = iterate(iter)
    ## Prealocate v
    v = ones(eltype(u), size(u,1)+1, length(iter))
    # v = Matrix{eltype(u)}(undef, size(u,1)+1, length(iter))
    # v[1,:] .= one(eltype(u))
    v[2:end,1] = u
    next = iterate(iter, state)
    ## Start the loop
    while next !== nothing
        (u, state) = next
        v[2:end,state] = u
        next = iterate(iter, state)
    end
    return transpose(v)
end
julia> using BenchmarkTools

julia> M = rand(100,1000);

julia> @benchmark transpose(reduce(hcat,collect(homogenize(eachrow(M)))))

BenchmarkTools.Trial: 
  memory estimate:  1.63 MiB
  allocs estimate:  2404
  --------------
  minimum time:     554.539 μs (0.00% GC)
  median time:      590.871 μs (0.00% GC)
  mean time:        668.858 μs (7.42% GC)
  maximum time:     3.596 ms (73.20% GC)
  --------------
  samples:          7422
  evals/sample:     1

julia> @benchmark collect_homoiter(eachrow(M))
BenchmarkTools.Trial: 
  memory estimate:  786.92 KiB
  allocs estimate:  106
  --------------
  minimum time:     292.677 μs (0.00% GC)
  median time:      317.791 μs (0.00% GC)
  mean time:        338.971 μs (4.16% GC)
  maximum time:     1.498 ms (63.73% GC)
  --------------
  samples:          10000
  evals/sample:     1

julia> transpose(reduce(hcat,collect(homogenize(eachcol(M))))) == collect_homoiter(eachcol(M))
true

I think the problem with the first implementation is that you are allocating too many temporary arrays in the iterate function. One option would be to do

Base.@propagate_inbounds function Base.iterate(h::Homogenize, state)
    n = iterate(h.itr, state...)
    n === nothing && return n
    return pushfirst!(Vector(n[1]), oneunit(Base.eltype(n[1]))), n[2]
end

This does not work with your toy example (since once cannot push into the slices of a matrix), but I think in Nemo the exponent vectors are proper arrays, so this would work.

(But I don’t think it will beat the second implementation. There is nothing faster than preallocating and then filling, although the reduce(hcat, ..) is probably quite optimized in Base).

Thank you so much. There is actually some improvement. I adapted the toy example to pushfirst!, here are the new times. Maybe an aside question now is why is memory estimate bigger for pushfirst!?

julia> N=100; M = rand(N,10*N); C=[M[i,:] for i in 1:N];

julia> @benchmark collect_homoiter(C)
BenchmarkTools.Trial: 
  memory estimate:  782.19 KiB
  allocs estimate:  4
  --------------
  minimum time:     178.375 μs (0.00% GC)
  median time:      195.195 μs (0.00% GC)
  mean time:        213.096 μs (4.10% GC)
  maximum time:     982.328 μs (78.91% GC)
  --------------
  samples:          10000
  evals/sample:     1

## Using pushfirst!
julia> @benchmark transpose(reduce(hcat,collect(homogenize(C))))
BenchmarkTools.Trial: 
  memory estimate:  3.08 MiB
  allocs estimate:  603
  --------------
  minimum time:     316.035 μs (0.00% GC)
  median time:      354.262 μs (0.00% GC)
  mean time:        446.933 μs (15.52% GC)
  maximum time:     1.969 ms (61.69% GC)
  --------------
  samples:          10000
  evals/sample:     1

## Using [oneunit(Base.eltype(n[1])); n[1]] 
julia> @benchmark transpose(reduce(hcat,collect(homogenize(C))))
BenchmarkTools.Trial: 
  memory estimate:  1.62 MiB
  allocs estimate:  2203
  --------------
  minimum time:     369.206 μs (0.00% GC)
  median time:      398.459 μs (0.00% GC)
  mean time:        465.969 μs (9.88% GC)
  maximum time:     3.922 ms (78.63% GC)
  --------------
  samples:          10000
  evals/sample:     1

I am just guessing here, but pushfirst! calls something called growbeg which in turn calls the C function here: https://github.com/JuliaLang/julia/blob/0a77e07c4d964bcec172bde2918097111814a07e/src/array.c#L974. I can imagine that it allocates more space then just for one element (similar to how push! will make room for more than one element).

1 Like