Push!, and interfacing to the runtime library

performance

#1

When looking over my code_native, I recently saw that every push! costs me a call into the runtime library. This appears…excessive, at least for the standard case where the push can be accommodated without copying.

So, lets compare the push! and the fastpush!

First, lets look at what a vector actually is:

struct array_T
    data :: Ptr{Void}
    length:: UInt64
    flags::UInt16
    elsize::UInt16
    offset::UInt32
    nrows::UInt64
    ncols_maxsz::UInt64
    stuff1::UInt64
    stuff2::UInt64
    stuff3::UInt64
end

This gives us the fastpush:

@inline function fastpush!(A::Vector{T},val::T) where T
    if isbits(T)
        arr = unsafe_load(convert(Ptr{array_T}, pointer_from_objref(A)))
        if arr.length + arr.offset  < arr.ncols_maxsz       
            unsafe_store!(convert(Ptr{UInt64}, pointer_from_objref(A)), arr.length+1, 2)
            unsafe_store!(convert(Ptr{UInt64}, pointer_from_objref(A)), arr.length+1, 4)
            @inbounds A[arr.length+1] = val
        else
            push!(A,val)
        end
    else
        push!(A,val)
    end
    A
end

And lets run a race:

function pushtest(A,n)
    resize!(A, 0);
    for i = 1:n push!(A,i) end
    nothing
end

function fastpushtest(A,n)
    resize!(A, 0)
    for i = 1:n fastpush!(A,i) end
    nothing
end

function settest(A,n)
    resize!(A, n)
    for i = 1:n  A[i]=i end
    nothing
end

yielding:

Nm = 100_000_000; A = Vector{Int}(Nm); resize!(A, 0);
#after warmup:

@time settest(A,Nm); @time settest(A,Nm);
  0.356040 seconds (4 allocations: 160 bytes)
  0.326640 seconds (4 allocations: 160 bytes)

@time pushtest(A,Nm); @time pushtest(A,Nm);
  0.899436 seconds (4 allocations: 160 bytes)
  0.895555 seconds (4 allocations: 160 bytes)

@time fastpushtest(A,Nm); @time fastpushtest(A,Nm);
  0.577715 seconds (4 allocations: 160 bytes)
  0.606671 seconds (4 allocations: 160 bytes)

It would mayhaps be nicer if it was somehow possible to make llvm aware of julia’s runtime lib. For example, compile to llvm_IR (using clang or dragonegg?), and call into a nice library of llvm code including optimization-relevant metadata, and possibly inline current ccall-functions.

Not sure whether this performance problem is worth trying to fix (and how much more expensive it is if we lose important registers on the call). It is easy to avoid, though, by implementing the checks by hand (for tight loops that need to push a lot). In a perfect world, the optimizer would reduce to only a single check for unrolled pushing loops. In my applications, I do enough other things between push!/pop!, that this ends up being benign.


#2

Nice analysis. There’s been a certain amount of discussion about implementing Array in pure-Julia code, and that would avoid the problem you’re noting here. It almost certainly won’t happen for 1.0, in part because it introduces difficult bootstrap issues (you need arrays to run inference, but you have to infer the array code…). But it seems possible that your optimization might be acceptable as a PR, if you’d consider submitting it.


#3

I think using this code would be a terrible idea, in view of yuyichao’s comments in https://discourse.julialang.org/t/why-so-many-internals-in-c/6119, and in view of https://discourse.julialang.org/t/unsafe-store-sometimes-slower-than-arrays-setindex/7166.

That is, my example is just a PoC that at least 30% speedup for push! are available if we manage to inline the fast path (avoid the ccall, avoid checking array flags because type-inference tells us all). Same should apply for pop!.

We gain even more if we acknowledge that Vector and higher-dim Arrays are different:
Vectors need no more doubling of length=nrows, and tiny vectors (up to 4*sizeof(Ptr)) can be stored inline (they fit into the same cache-line! Miss avoided if we need to iterate over a large number of Vectors, many of which are small! This is actually something I do a lot: variable-arity trees where most nodes have very few or zero children). Alternatively, one could use the shrunk vector_T to store two of them per cache-line (but I think inline storage gives a bigger plus).

Higher-dim arrays free the offset for an additional 32 bits of info. Not sure what to do with that; maybe store ndims, so that nospecialize on the number of dimensions becomes possible? Maybe make them user-available for locks?

Unfortunately I don’t feel comfortable enough with the C base to even attempt to separate array and vector or do a clean improvement.

I think the most pragmatic way to get a fastpush would be to copy the approaches for setindex!-non-bitstype and length for grow_at_end and shrink_at_end (that is, there is a special case in code-gen that emits the fastpath for inlining, and if this fails we do a library call for rooting of the setindex!ed object).

PS. Also, this needs to be different on 32 bit systems.