Suggestion for more general/performant sentinel for find* functions

ScottPJones · January 28, 2018, 8:21pm

I’m concerned about the loss of type-stability in the find* etc. functions, changing everything to return nothing for the not found case.
While that is important for abstract arrays and associative structures, for strings, where there is a contract that AbstractStrings are always 1-based indexed, returning nothing instead of 0 just makes the code more difficult.
In my quick testing with some short strings and findfirst, the version using nothing is 3.3x slower.

Instead, why not add a function that returns the sentinel value needed for that particular type?

notfound(::Type{<:Any}) = nothing
notfound(::Type{<:AbstractString}) = 0
notfound(val::T) where {T} = notfound(T)

For arrays also, those could still define a sentinel value of zero, if they are not OffsetArrays, to get back the performance lost by the find* changes. Also, even for OffsetArrays, if typemin(Int) is not allowed as the first index, then that could be used as the sentinel, keeping the return values type-stable.

CC: @jeff.bezanson, @nalimilan

CC: @mbauman, @tim.holy

Note: I also think that seeing in code:

     ret = findfirst(a, b)
     ret == notfound(b) && ...

would be more readable than either the old way with 0, or the new way with nothing

mbauman · January 28, 2018, 8:54pm

Please do not ping multiple core contributors because you have language or API design opinions. Most of us are already active here and will read it in our own time.

~~I’m not terribly surprised there are~~ I wouldn’t be terribly surprised if there were performance regressions here, but I expect this to improve. The compiler team is fully supportive of performant small unions, and I don’t see any reason why Julia couldn’t figure out concrete types with the idiom you proposed ret == nothing && …. I have a hunch that we’ll slowly start calling inferred small unions “type stable.”

Most helpful here would be the actual code that you benchmarked as being 3.3x slower.

xiaodai · January 28, 2018, 8:55pm

I haven’t posted about it yet but having a fast find amd searchsortedfirst for strings is key to unlocking fast InternedString sorting

ScottPJones · January 28, 2018, 9:05pm

I pinged because this seemed pretty critical to me, and was surprised that a change with such large performance repercussions was merged without much more discussion, and because from other posts from Stefan understood that things might be really frozen even this week.

I understand perfectly why 0 is not a good sentinel for all cases, but that’s not a good reason to make a change that hurts performance in such a way, when other alternatives (such as my suggestion) are possible.

On v0.6.2:

julia> ff(z) = findfirst(z, "foobar") == 0
ff (generic function with 1 method)

julia> @btime ff('z')
  6.034 ns (0 allocations: 0 bytes)
false

julia> @code_native ff('z')
	.section	__TEXT,__text,regular,pure_instructions
Filename: REPL[19]
	pushq	%rbp
	movq	%rsp, %rbp
Source line: 1
	movabsq	$findnext, %rax
	movabsq	$4950739376, %rsi       ## imm = 0x1271649B0
	movl	$1, %edx
	callq	*%rax
	testq	%rax, %rax
	sete	%al
	popq	%rbp
	retq
	nopw	(%rax,%rax)

On master: (Version 0.7.0-DEV.3618 (2018-01-28 17:48 UTC), Commit 743d487 (0 days old master))

julia> ff(z) = findfirst(equalto(z), "foobar") == nothing
ff (generic function with 1 method)

julia> @btime ff('z')
  20.847 ns (0 allocations: 0 bytes)
true

julia> @code_native(ff('z'))
	.section	__TEXT,__text,regular,pure_instructions
; Function ff {
; Location: REPL[21]:1
; Function Type; {
; Location: REPL[21]:1
	pushq	%rbx
	subq	$32, %rsp
	movl	%edi, (%rsp)
;}
; Function findfirst; {
; Location: array.jl:1702
	movabsq	$findnext, %rax
	leaq	24(%rsp), %rbx
	leaq	(%rsp), %rsi
	movabsq	$4677124368, %rdx       ## imm = 0x116C74110
	movl	$1, %ecx
	movq	%rbx, %rdi
	callq	*%rax
;}
	movl	%edx, %ecx
	andb	$127, %cl
	cmpb	$2, %cl
	je	L76
	cmpb	$1, %cl
	jne	L105
	movabsq	$"==", %rax
	callq	*%rax
	jmp	L97
; Function findfirst; {
; Location: array.jl:1702
L76:
	testb	%dl, %dl
	cmovnsq	%rbx, %rax
;}
	movq	(%rax), %rdi
	movabsq	$"==", %rax
	callq	*%rax
L97:
	andb	$1, %al
	addq	$32, %rsp
	popq	%rbx
	retq
L105:
	movabsq	$jl_system_image_data, %rax
	movq	%rax, 8(%rsp)
	movabsq	$jl_system_image_data, %rax
	movq	%rax, 16(%rsp)
	movabsq	$jl_apply_generic, %rax
	leaq	8(%rsp), %rdi
	movl	$2, %esi
	callq	*%rax
	ud2
	nop
;}

That’s also a pretty large increase in code size.

mbauman · January 28, 2018, 9:14pm

You’re testing something nonsensical. The real tests are:

0.6:

julia> ff(z) = findfirst(x->x==z, "foobar") == 0
ff (generic function with 1 method)

julia> @btime ff($('z'))
  80.433 ns (0 allocations: 0 bytes)
true

0.7:

julia> ff(z) = findfirst(equalto(z), "foobar") == nothing
ff (generic function with 1 method)

julia> @btime ff($('z'))
  39.565 ns (0 allocations: 0 bytes)
true

ScottPJones · January 28, 2018, 9:40pm

You saw the message before I fixed an incorrect copy/paste.

mbauman · January 28, 2018, 9:42pm

It’s still an incorrect usage of the API. You want findfirst(x->x==z, "foobar") or findfirst("foobar", z) on 0.6.

ScottPJones · January 28, 2018, 9:44pm

Ah, forgot about the other breaking change to the order of arguments, I’ll retest when I get home.

joa-quim · January 28, 2018, 10:24pm

Yesterday while working on migrating a package to 0.7 I also had bad surprises with this

julia> typeof(findfirst(equalto('a'), "abc"))
Int64

julia> typeof(findfirst(equalto('d'), "abc"))
Nothing

julia> first(findfirst(equalto('d'), "abc"))
ERROR: MethodError: no method matching start(::Nothing)
Closest candidates are:
  start(::SimpleVector) at essentials.jl:550
  start(::Base.MethodList) at reflection.jl:659
  start(::ExponentialBackOff) at error.jl:171
  ...
Stacktrace:
 [1] first(::Nothing) at .\abstractarray.jl:198
 [2] top-level scope

ScottPJones · January 29, 2018, 3:38am

OK, a simpler example, on v0.6:

julia> @btime in($('z'), $("foobar"))
  12.568 ns (0 allocations: 0 bytes)
false
julia> @btime in($('f'), $("foobar"))
  12.206 ns (0 allocations: 0 bytes)
true

On v0.7:

julia> @btime in($('z'), $("foobar"))
  18.075 ns (0 allocations: 0 bytes)
false
julia> @btime in($('f'), $("foobar"))
  17.575 ns (0 allocations: 0 bytes)
true

So, not as bad as I’d originally thought, but still quite bad, around 44% slower.

I also find it interesting that the string code ends up wrapping things like this:

nothing_sentinel(i) = i == 0 ? nothing : i

The definition of in is actually very similar to the function I was benchmarking.

in(c::Char, s::AbstractString) = (findfirst(equalto(c),s)!==nothing)

nalimilan · January 29, 2018, 8:48am

A 44% percent penalty sounds perfectly reasonable as a temporary situation until the compiler improves, given that the API is simpler and more consistent by using nothing everywhere to represent the absence of a match.

The nothing_sentinel trick is only used 7 times, all in the same file. It’s also not strictly needed and is mostly due to the fact that the old code was designed around the assumption that 0 indicated no match. Since we’re short on time we kept it that way in order to get it merged quickly (see discussion on the PR.

ScottPJones · January 29, 2018, 12:34pm

The problem is that it will not be possible to remove the penalty completely, and it may be worse that than, in cases where the compiler will be unable to remove it.
Think of the case of creating a vector of the offsets of the first (or last, or whatever) occurrence of a substring in a set of strings, the compiler will never be able to remove the penalty of that case.

Just as you need to use zero(T) in places to make things type-stable, with my suggestion you just need to use notfound(T). What happens if nothing itself can be a member of the collection? It seems like the method of having a fixed return value of nothing doesn’t work in that case, but notfound(T) would work just fine.

nalimilan · January 29, 2018, 12:43pm

And what does notfound(Any) return?

ScottPJones · January 29, 2018, 12:57pm

You could always add other arguments to notfound, to specialize even further, for those sorts of edge cases,
For example, Dict{Any, Any} might return a type that is specifically not allowed in it, a NotFound() type.

Another possible approach would be to have a pair of functions, that operate on the return value,
found(T, ret) could return a boolean if it were found or not, and getloc(T, ret) would return the value.

For the vast majority of use cases (strings and 1-based arrays), the functions would be simply as follows:
found(::AbstractString, ret) = ret != 0, and getloc(::AbstractString, ret) = ret
and there would be zero performance penalty.

For cases where T is some collection of Any, the return value could be a tuple (foundflag, val), so that (false, nothing) would indicate that the value was not found, and (true, nothing), would indicate that the value was found, and was nothing.

That approach would always work, whether or not an in-band sentinel value was possible for the type T.

mbauman · January 29, 2018, 4:26pm

Again, I ask for benchmarks here. I’m not totally convinced that the performance regression in in is entirely related to Union splitting, particularly given that other code shows even larger speedups (which are also likely unrelated).

julia> using BenchmarkTools

julia> @btime in($('z'), $("foobar"))
  37.067 ns (0 allocations: 0 bytes)

julia> @eval Base begin
       # A poor man's "revert" — it leaves Base horribly broken, but it's a small test
       function findnext(pred::EqualTo{Char}, s::String, i::Integer)
           if i < 1 || i > sizeof(s)
               i == sizeof(s) + 1 && return 0
               throw(BoundsError(s, i))
           end
           @inbounds isvalid(s, i) || string_index_err(s, i)
           c = pred.x
           c ≤ '\x7f' && return _search(s, c % UInt8, i)
           while true
               i = _search(s, first_utf8_byte(c), i)
               i == 0 && return 0
               s[i] == c && return i
               i = next(s, i)[2]
           end
       end
       in(c::Char, s::AbstractString) = (findfirst(equalto(c),s)!==0)
       end

julia> @btime in($('z'), $("foobar"))
  35.578 ns (0 allocations: 0 bytes)
false

You could almost imagine foundflag being a switch that says whether val is an Int or Void. You could even call it a “type tag”. What’s this starting to sound like?

Yes, storing a Vector{Union{Int,Void}} is going to take up more space than just a Vector{Int}, but it’d be comparable to storing a Vector{Tuple{Bool,Int}}.

ScottPJones · January 29, 2018, 4:40pm

You seemed to have missed that I said that would only be for collections where nothing was a member of T for the collection of T.
In the vast majority of cases, you’d just be storing things of type T (strings, 1-based arrays, etc.), and that will take up substantially less space.
(as well as make doing things like counting the # of sentinel values in the result vector much faster, something that could be SIMD optimized easily).

mbauman · January 29, 2018, 5:45pm

So your proposal here would be to sometimes return an in-band element of the type of the indices, and sometimes to return a Tuple{Bool, X}? If I wanted to write generic code, I’d have to check to see if I got back a tuple? What about Dict((false, 1) => 1, 1=>2)? Yes, the status quo means you cannot find Dict(nothing=>3), but there is a virtue to simplicity.

ScottPJones · January 29, 2018, 6:00pm

Please read what I proposed above.
I proposed to have two functions, with methods based on the type of the collections.
You would never explicitly checking for a tuple, and things would be type stable.

For example:

    found(a, ret = findfirst(needle, a)) || return do_something_if_not_found()
    val = getval(ret)

For strings and 1-based arrays, ret would be simply an Int, found would check for != 0, and getval would be a no-op.

Very simple. Presumably the type of this Dict is of type Any?
If it finds the first index, it would return (true, (false, 1)), the second, it would return (true, 1), and if not found, it would return (false, nothing).

I wouldn’t consider that a virtue at all, that would break our code, where we were using nothing to indicate a JSON null, and at times used a Dict to count unique values in the JSON structure.
With my approach, a key of nothing would simply be returned as (true, nothing).

Topic		Replies	Views
Findfirst for Dicts with `nothing` keys General Usage	33	3409	March 4, 2018
The new output of findfirst is dangerous General Usage	21	4176	February 24, 2018
New findfirst behaviour - type unstable or no? General Usage	5	1182	March 1, 2018
New find* methods and type stability General Usage	6	1083	February 1, 2018
Findfirst() with Dict for which nothing is a valid key Internals & Design	2	845	February 18, 2018

Suggestion for more general/performant sentinel for find* functions

Related topics