When looking over my code_native, I recently saw that every push! costs me a call into the runtime library. This appears…excessive, at least for the standard case where the push can be accommodated without copying.
So, lets compare the push! and the fastpush!
First, lets look at what a vector actually is:
struct array_T
data :: Ptr{Void}
length:: UInt64
flags::UInt16
elsize::UInt16
offset::UInt32
nrows::UInt64
ncols_maxsz::UInt64
stuff1::UInt64
stuff2::UInt64
stuff3::UInt64
end
This gives us the fastpush:
@inline function fastpush!(A::Vector{T},val::T) where T
if isbits(T)
arr = unsafe_load(convert(Ptr{array_T}, pointer_from_objref(A)))
if arr.length + arr.offset < arr.ncols_maxsz
unsafe_store!(convert(Ptr{UInt64}, pointer_from_objref(A)), arr.length+1, 2)
unsafe_store!(convert(Ptr{UInt64}, pointer_from_objref(A)), arr.length+1, 4)
@inbounds A[arr.length+1] = val
else
push!(A,val)
end
else
push!(A,val)
end
A
end
And lets run a race:
function pushtest(A,n)
resize!(A, 0);
for i = 1:n push!(A,i) end
nothing
end
function fastpushtest(A,n)
resize!(A, 0)
for i = 1:n fastpush!(A,i) end
nothing
end
function settest(A,n)
resize!(A, n)
for i = 1:n A[i]=i end
nothing
end
yielding:
Nm = 100_000_000; A = Vector{Int}(Nm); resize!(A, 0);
#after warmup:
@time settest(A,Nm); @time settest(A,Nm);
0.356040 seconds (4 allocations: 160 bytes)
0.326640 seconds (4 allocations: 160 bytes)
@time pushtest(A,Nm); @time pushtest(A,Nm);
0.899436 seconds (4 allocations: 160 bytes)
0.895555 seconds (4 allocations: 160 bytes)
@time fastpushtest(A,Nm); @time fastpushtest(A,Nm);
0.577715 seconds (4 allocations: 160 bytes)
0.606671 seconds (4 allocations: 160 bytes)
It would mayhaps be nicer if it was somehow possible to make llvm aware of julia’s runtime lib. For example, compile to llvm_IR (using clang or dragonegg?), and call into a nice library of llvm code including optimization-relevant metadata, and possibly inline current ccall-functions.
Not sure whether this performance problem is worth trying to fix (and how much more expensive it is if we lose important registers on the call). It is easy to avoid, though, by implementing the checks by hand (for tight loops that need to push a lot). In a perfect world, the optimizer would reduce to only a single check for unrolled pushing loops. In my applications, I do enough other things between push!/pop!, that this ends up being benign.