I have a sorted array a, of positive integers (related to integer partitions), and need to repeatedly find the first index with a value >= k for various k uniformly distributed between 1 and a[end]. This is exactly the description of searchsortedfirst, but there doesn’t appear to be a way to take advantage of the fact that I know the values will be concentrated around √(2n), where n is the length of the array, which will in this case is 80. In the below sample, the target array I’m searching has typical behavior.
using BenchmarkTools
# The number of partitions of n with largest part at
# most k is stored at table[ (n*n-n)>>1 + k ]
struct PNums
table::Array{Int64,1}
end
function PNums(max_n::Int64)
pnum = Array{Int64,1}(undef, (max_n*max_n+max_n)>>1)
for n=1:max_n
noff = (n*n-n)>>1
pnum[ noff + 1 ] = 1
for k=2:n
pnum[ noff + k ] = pnum[ noff + k - 1 ] + pnum[ min(max(n-k,1),k) + ((n-k)*(n-k-1))>>1 ]
end
end
return PNums(pnum)
end
n = 80
target = PNums(n).table[(n*n-n)>>1+1:(n*n+n)>>1]
function f1(a::Array{Int64,1}, b::Array{Int64,1})
for k=1:a[end]
b[k]=searchsortedfirst(a,k)
end
end
dest = Array{Int64}(undef,target[end])
@btime f1(target,dest) # Gives 164.066 ms (0 allocations: 0 bytes)
I tried a few methods to take advantage of the known concentration of values, but my best result came from a 2-phase search, exponentially first to find a start point for a linear search.
function f2(a::Array{Int64,1}, b::Array{Int64,1})
for k=1:a[end]
i=1 ; @inbounds while i<<1 < 80 && a[ i<<1 ] < k i=i<<1 end
@inbounds while a[ i ] < k i+=1 end
b[k] = i
end
end
dest2 = Array{Int64}(undef,target[end])
@btime f2(target,dest2) # Gives 100.187 ms (0 allocations: 0 bytes)
A large portion of the speed-up seems related to the fact that 80 is hard coded into f2.
This leads to a few questions:
Is there a standard way to take advantage of value concentration for searching a sorted array?
Should there be a standard way?
Is there a simple way to do better than f2 in this particular case? What if I need a few (say 20) different length arrays?
How much of the improvement is just from eliminating the overhead of a function call?
I don’t know how close your example is to your actual use-case, but:
You seem to want something like [searchsortedfirst(haystack, needle) for needle in needles]. In this case, you should sort needles first, and then do a single mergesort-style iteration over both haystack and needle (depending on data statistics, you probably want to do this branchfree, i.e. via CMOV / ifelse). In your example, needles is already sorted.
Of course this only helps for bulk searches, i.e. if roughly length(haystack) + length(needles) < length(needles) * log(length(haystack)), i.e. if length(haystack)/log(length(haystack)) < length(needles).
Another comment: If this is just an artifact of setting up benchmarking code, then you should benchmark with needles shuffled, and long enough to stymie the sneaky branch predictor (1<<16 is reasonable).
I am not sure. Binary search is log2(n). If your Vector has length 1024, and you restrict the search to the first quarter of the Vector you will reduce the effort from log2(1024) == 10 to log2(256) == 8, in other words, 20%, and this percentage will be smaller and smaller the larger your Vector is, so the binary search has to be a hell of a bottleneck in your code to this to be worth.
I am not sure if I understand. I mean, you could not hardcode 80 and instead compute a threshold based on the distribution you observe. But for very small vector maybe this computation will end up offsetting any gains.
I think Julia compiler often inlines calls to small functions so I would not bet too much on saving time by avoiding function calls (by expanding the code by hand).
That’s an artifact of the benchmarking setup. In practice, I won’t have the needles available at the same time. They’ll be thrown at me one at a time, I just know that they’ll be uniformly distributed between 1 and target[end], in this case about 2^24. The time savings for f2 remained when I used randomly generated needles from this range.
I was comparing f2 with f3
function f3(a::Array{Int64,1}, b::Array{Int64,1} ; n::Int64=length(a))
for k=1:a[end]
i=1 ; @inbounds while i<<1 < n && a[ i<<1 ] < k i=i<<1 end
@inbounds while a[ i ] < k i+=1 end
b[k] = i
end
end
@btime f3(target,dest2)
The only difference being n as a parameter instead of a literal, and consistently seeing a 5% slowdown, but that’s minor compared to the difference between f1 and f2.
My vector is even smaller than that. In this application, 256 would be astronomically huge, so the asymptotics don’t really get a chance to come into play, and I’m running into the constant term for the binary search. Each loop of the linear search cuts out a branch and an assignment compared to the binary search, so I’m seeing a 30% speed improvement even though I’m averaging 9 loops instead of 7. It just feels wrong, and non-generic.
Well, yes and no. The difference is also that one function takes a keyword parameter and having one keyword parameter introduce more overhead than adding another positional parameter.
If think this makes a lot of sense if your vector is small and with a high probability of having the searched value closer to the beginning. You are also probably hitting both memory access (the linear access is easier to predict and may cause less cache misses than starting at the middle of the vector) and branch prediction (the binary search will have a more diverse sequences of results in its conditionals than the linear search that is just a lot of falses until a single true).
Knuth (TAOCP 3, 6.2.1) discusses this in some detail (the issue is about the trees being unbalanced). See the section about Interpolation search in particular.
This is a small word of caution from a tangentially related problem I recently had to tackle in a different language (Matlab). So take it with a grain of salt.
I had a similar problem where I needed to essentially do a repeated searchsortedfirst type of operation where there would often be cases when I knew that out of the ~50k values in the vector, I only needed to be searching between say positions 1,000 and 2,000 (this was variable but known in advance). I put a lot of optimization time into this. I saw basically no improvements in performance in the end over just doing the search over all values without the pre-check.