I would like to contribute some performance optimization of the julia base search function: _searchindex. I have prepared a benchmark package and experiments with several code variants here:
My current suggestion is function SearchBenchmarks.bloom_best, which improves runtime performance in almost all tested scenarios by 20% and more. I put all details and benchmark results in the package. Before preparing a pull request, I would like to discuss my suggestions in this topic, hopefully resulting in further improvements by the community.
In addition to the performance-driven changes, I suggest to add an API which separates search initialization and search kernel into two methods. In many applications, the same string (or byte sequence) is searched again and again. In such a scenario, separating initialization and kernel further improves performance, and allows to use algorithms which have an initialization too expensive for a “one-shot search”. In particular, I have an implementation of a Boyer-Moore search variant in mind, which can double search throughput in many scenarios, compared to my experiments with bloom filter based implementations like the current one in julia base.
As a teaser, here the runtime in mSec in relation to pattern size for searches with the julia base implementation and some variations (bloom_*) and my Boyer-Moore variant. All details are in the package.
I think that this is a great idea, but it may be better to put it in a package instead of base/stdlib. This will allow you to refine the API with breaking changes, and also decouple releases from Julia.
That said, if after some testing, the API is still compatible with base’s string search, it would be great to improve the performance there by replacing the implementation with this one.
Do you need a different API for this, or just a new type? e.g. our Regex type already separates search initialization from search, and is then used with the ordinary API like findfirst. Similarly, you could just wrap the string in a new type that does the initialization, and then pass it to the existing search functions (with new methods).
Right. But decoupling from Julia means its usage will be much smaller (only people actively looking for such an API will take notice of it and probably use it), and without a critical mass of maintainers chances are high that it will get incompatible with some later Julia versions and/or operating systems.
My current focus is improving Julia (base library), with as little changes to current Julia APIs as possible (anything else should be rejected by Julia maintainers). See following replies for my API suggestion.
To start with minimal impact on Julia base: only code optimization by replacing a few methods without changing its signature, no initialization/kernel separation. I suggest to replace only the “workhorse” in julia base search API. Am I right that it is located in the following two methods?
function _searchindex(s::ByteArray, t::ByteArray, i::Integer)
function _rsearchindex(s::ByteArray, t::ByteArray, k::Integer)
Pro: search benefit for all applications using Julia base API, without any code change
Contra: initialization effort per search remains
If agreed, I will supply the two methods as a drop-in replacement. What is the best way for it? I have no experience so far forking Julia base to make a pull request. Is it hassle-free (in Windows 10 64bit)?