Get first match in a dictionary of tuples


#1

I have a dictionary of tuples. Inside one of those tuples is one thing I care about. What’s the best way to find that one thing? How can I make it fast? I feel like I have some things to learn about associative collections.

Here’s a mockup of my code, where I have a dictionary of pet owners and their pets. I know there’s one and only one Cat, and I’d like to find that cat. I have some code to do it, but it feels really bulky compared to the rest of my code.

abstract type Pet end
struct Dog <: Pet; name::String; end
struct Cat <: Pet; name::String; end
struct Ox  <: Pet; name::String; end

# Create an example dictionary.
pets = Dict("Cindy" => (Dog("Fido"), Cat("Captain Snuggly Pants")),
            "Tom"   => (Ox("Socks"), Dog("Chewbacca")))

# Find the cat.
cat = Cat("empty cat") # Defined so that cat has global scope.
for these_pets in values(pets) # Get tuple of pets for each entry.
    cats = filter(e -> isa(e, Cat), these_pets) # Look for a Cat.
    if !isempty(cats) # If we found one, accept it and break.
        cat = first(cats)
        break
    end
end

I can definitely make the code shorter by vcating all of those tuples together into a literal collection and then calling first(filter(p -> isa(p, Cat), ...)) on it, but that seems hugely wasteful. Time is critical here; there will be many pet owners and many pets, the whole thing will need to run many times on different collections, and who wouldn’t want to find Captain Snuggly Pants as soon as possible?


#2

How about (cat is a function in base, so I just changed it to kat):

kat = Cat("Nemo")
for val in values(pets)
    katind = findfirst(v->isa(v, Cat), val)
    !iszero(katind) && (kat = val[katind]; break)
end

Timing:

julia> @btime begin
       kat = Cat("Nemo")
       for val in values($pets)
           katind = findfirst(v->isa(v, Cat), val)
           !iszero(katind) && (kat = val[katind]; break)
       end
       end
  203.837 ns (0 allocations: 0 bytes)

#3

Possibly something like

function first_matching(pred, iter)
    state = start(iter)
    while !done(iter, state)
        item, next_state = next(iter, state)
        pred(item) && return Nullable(item)
        state = next_state
    end
    return Nullable(state, false)
end

first_matching(p -> isa(p, Cat), Base.Iterators.flatten(values(pets)))

#4

Thanks @DNF and @Tamas_Papp!

As written, @DNF’s answer is by far the fastest. However, it has more hard-coded stuff than @Tamas_Papp’s. Modifying Papp’s so as to compare apples to apples (building in the flatten and isa(...) and leaving out the Nullables), I have:

function first_matching(pets, pet_type)
    iter = Base.Iterators.flatten(pets)
    state = start(iter)
    while !done(iter, state)
        item, next_state = next(iter, state)
        isa(item, pet_type) && return item
        state = next_state
    end
    return pet_type("[empty pet]")
end

and in fact this modified version is the fastest so far. It comes in at 26us, where’s @DNF’s comes in at 32us. So you both contributed to the fastest answer. (My original one comes in at a terrifying 700us and is sufficient evidence that I am still a beginner in Julia.)

Interestingly (to me), the flatten call has essentially no cost. I wrote another version where instead of using flatten, I used a loop for iter in pets; (same as above); end;, and this saved no time.

Thanks again, all. This is going to be one of the functions called most frequently by my program, and speed is critical!


#5

That’s odd. I timed mine at 200 nanoseconds.


#6

I think this is a cleaner version of your answer, and I time it at about 1.3 us.

function first_matching_flatten(pets, PetType)
    for pet in Base.Iterators.flatten(pets)
        pet isa PetType && return pet
    end
    return PetType("none")
end
@btime first_matching_flatten($pets, Cat)

As for DNF’s example, I get roughly 200ns as well, but if I turn it into a function it is much slower 1.6 us instead. I don’t understand what is going on there.

println("DNF")
@btime begin
       kat = Cat("Nemo")
       for val in values($pets)
           katind = findfirst(v->isa(v, Cat), val)
           !iszero(katind) && (kat = val[katind]; break)
       end
end
println("DNF function")
function first_matching_dnf(pets, PetType)
    kat = PetType("none")
    for val in values(pets)
        katind = findfirst(v->isa(v, Cat), val)
        !iszero(katind) && (kat::PetType = val[katind]; break)
    end
    kat
end
@btime first_matching_dnf($pets, Cat)

In case you don’t know @btime comes from BenchmarkTools.jl and will give different timings from @time. The BenchmarkTools timings will generally avoid penalties associated with global scope.


#7

Thanks for the @btime heads up. In fact, I didn’t know and had just been using @time.

I would definitely be curious to understand why your timing changes when DNF’s answer is made into a function. I would think it would be faster as a function. Maybe something having to do with pets being Any in the function?


#8

Using @btime from the BenchmarkTools, I see that there is a cost to that flatten and that this is in fact fastest so far:

function first_pet_of_type_3(pets, pet_type)
    for these_pets in values(pets)
        for pet in these_pets
            if pet isa pet_type
                return pet
            end
        end
    end
    return pet_type("[no pet]")
end

I’m getting 1.2us for this. Replacing the outer loop with flatten(values(pets)) makes the function take 4.5us.


#9

My timings are decidedly different:

function findcat(pets)
    for val in values(pets)
        katind = findfirst(v->isa(v, Cat), val)
        !iszero(katind) && return val[katind]
    end
    return Cat("Nemo")
end
julia> @btime findcat($pets)
  185.125 ns (0 allocations: 0 bytes)
Cat("Captain Snuggly Pants")
julia> @btime first_pet_of_type_3($pets, Cat)
  1.741 μs (4 allocations: 128 bytes)
Cat("Captain Snuggly Pants")

#10

BTW, I thought you wanted something less bulky in terms of code length, and presumed it would be inline, hence the hardcoding.


#11

@DNF, right on all counts. Especially when the type (Cat) is hard-coded, the code is fastest. Making the type an input slows it down, but my tests are now showing your function above to be the fastest (twice as fast as it was earlier). Not sure what changed.


#12

One thing I changed is to not pre-assign the output with the default (Cat("Nemo")). Then, if a cat is actually found, you avoid that assignment.