Fast string comparisons for strings of fixed length

I want to leave here an interesting benchmark and see if someone comes with a new approach that provides faster results.

The problem is:

Given two arrays list and blocklist, find which elements in list also belong to blocklist and return them. The elements of both arrays are constructed with 10 ASCII symbols (for example “aBcD012BdF”).

Followup problem :

I want to propose the same benchmark but asuming there are less symbols. That is, the elements of both arrays are constructed with 10 symbols [A-Z][0-9] (for example “ABCD012BDF”). This is very similar to the previous problem but the number of symbols is reduced to be a subset of ASCII, hence, potentially, one could store the same information using less bits.


using BenchmarkTools, InlineStrings, JSON, Random

n_list = 10_000_000
n_blocklist = 100_000
list = [randstring(10) for i in 1:n_list];
blocklist = list[1:n_blocklist];

#### 1) Simple solution ####


function check_list_in_blocklist(list, blocklist)
    blocklist_set = Set{String}(blocklist)
    result = Array{String}([])
    for l in list
        if l in blocklist_set
            push!(result,l)
        end
    end
    return result
end

@btime check_list_in_blocklist(list, blocklist)


#### 2) InlineString solution ####

function check_list_in_blocklist_inlinestring(list, blocklist)
    blocklist_set = Set{InlineString15}(blocklist)
    result = Array{InlineString15}([])
    for l in list
        if l in blocklist_set
            push!(result,l)
        end
    end
    return result
end

list_inline = InlineString15.(list)
blocklist_inline = InlineString15.(blocklist);
@btime check_list_in_blocklist_inlinestring(list_inline, blocklist_inline)


#### 3)InlineString with custom hash solution ####
struct ASCII_ID
    x::InlineString15
end
 
Base.:(==)(x::ASCII_ID, y::ASCII_ID) = x.x == y.x
aword(x::ASCII_ID) = (bswap(reinterpret(UInt128, x.x)) % UInt)
Base.hash(x::ASCII_ID, h::UInt64) = hash(aword(x),h)
 
function check_list_in_blocklist(list, blocklist)
    blocklist_set = Set{ASCII_ID}(blocklist)
    result = Array{ASCII_ID}([])
    for l in list
        if l in blocklist_set
            push!(result,l)
        end
    end
    return result
end
 
list_ascii_id = ASCII_ID.(list)
blocklist_ascii_id = ASCII_ID.(blocklist);
@btime check_list_in_blocklist(list_ascii_id, blocklist_ascii_id)

This benchmark produces

  271.374 ms (19 allocations: 4.08 MiB)
  254.197 ms (19 allocations: 7.91 MiB)
  129.662 ms (19 allocations: 7.91 MiB)

Note that even though the last solution is the fastest, the cost of casting the elments from list is big. Therefore, if the data is not already in that format second and third solution would actually be slower, besides beeing more complex.

@btime list_inline = InlineString15.(list)
 423.740 ms (4 allocations: 152.59 MiB)

I want to thank @oxinabox and @kristoffer.carlsson for the discussions about this in Slack.

As usual, making things more specialized and unsafe makes things faster. The code uses a custom open-addressing hash table. The size is hard coded for the case in question, but can be reasonably automatically calculated.
Note that it doesn’t use @inbounds, which could also help.
This solution benchmarks faster on my computer, but it is best to benchmark in the OP’s computer to relate to post timings.

       function check_list_in_blocklist_fast(list, blocklist)
           n_list, n_blocklist = length(list), length(blocklist)
           hashtable = zeros(UInt8, 1<<19)
           indtable = zeros(Int, 1<<19)
           mask = (UInt64(1)<<19)-1
           for i in 1:n_blocklist
               h = hash(unsafe_load(reinterpret(Ptr{UInt64}, pointer(blocklist[i]))))
               loc = (h & mask)+1
               val = ((h >> 19) & 0x7F)+1
               while hashtable[loc] != 0
                   loc += 1
                   loc == (1<<19) && (loc = 1)
               end
               hashtable[loc] = val
               indtable[loc] = i
           end
           preres, preres2 = Int[], Int[]
           for i in 1:n_list
               h = hash(unsafe_load(reinterpret(Ptr{UInt64}, pointer(list[i]))))
               loc = (h & mask)+1
               val = ((h >> 19) & 0x7F)+1
               while hashtable[loc] != 0
                   if hashtable[loc] == val
                       push!(preres, i)
                       push!(preres2, indtable[loc])
                   end
                   loc += 1
                   loc == (1<<19) && (loc = 1)
               end
           end
           [ list[preres[i]] for i in 1:length(preres) if list[preres[i]]==blocklist[preres2[i]]]
       end

Use with:

check_list_in_blocklist_fast(list, blocklist)

Again, several safety checks omitted and the function can fail to terminate if hashtable over subscribed.

Very interesting Dan, thank you for stopping by and leaving your version. It seems it does not need casting the string type and produces an interesting speedup!

  278.953 ms (19 allocations: 4.08 MiB)
  265.552 ms (19 allocations: 7.91 MiB)
  133.321 ms (19 allocations: 7.91 MiB)
  76.317 ms (38 allocations: 9.98 MiB)   # Dan solution

I want to comment/ask a couple of details (if you or anyone else has time!):

  • You mention making things more specialized and unsafe makes things faster why is your solution unsafe?

    • Here we assume list and blocklist allways verify the conditions written above (the elements will always be 10 ASCII chars long).
  • When you use 1<<19 you could replace it with 2^19 right ?

    • I tested the speed of both and they seem to be the same (2^19 seems more common)
  • It seems you use the mask mask = (UInt64(1)<<19)-1 which ends up beeing 0x000000000007ffff ? why did you choose the mask of this size? is it because when loc = (h & mask)+1 is computed it ensures that, loc is at most 2^19 ? This seems to ensure that hashtable[loc] will always be a valid access.

  • You mention the size 2^19 is hardcoded but it can be reasonably calculated. I guess this calculation could depend on the length of blocklist (not on the length of list). Therefore, replacing the values of 2^19 by n_blocklist*factor (where factor=5 for example ) we would ensure all possible interesting strings can be stored, right? this seems to be safe.

  • I am not sure I understand this logic: it seems that if the position loc is already occupied it starts a linear scan increasing loc on line 2, but why do you have line 3?

      while hashtable[loc] != 0                     # line 1
         loc += 1                                   # line 2
         loc == (1<<19) && (loc = 1)                # line 3
      end
    

The unsafe casting depends on specific string type (perhaps should be specified by type conversion/specification). If the strings are all the same, quadratic time possible. If hashtable fills due to some bug, function doesn’t terminate. And probably some more glitches - I had a few when writing the function already.

Yes.

Yes.

Yes, but making it a power of 2 would be easier to manage.

This line makes the buffer circular (wraps around), for obvious reasons.

Perhaps you can add the @inbounds specifiers to improve speed. Enlarging the val entries in hashtable might also speed up things (UInt8 chosen without optimization).

Separating some of the hashtable code into smaller functions might make the function more readable.

Another optimization might be to use UInt64 for hashtable and indtable combined - lower bits used for hashtable and upper 32-bits for indtable. This layout saves memory access time.

This benchmark doesn’t include the prep time for InlineStrings.

1 Like