# Fast string comparisons for strings of fixed length

I want to leave here an interesting benchmark and see if someone comes with a new approach that provides faster results.

The problem is:

Given two arrays `list` and `blocklist`, find which elements in `list` also belong to `blocklist` and return them. The elements of both arrays are constructed with 10 ASCII symbols (for example “aBcD012BdF”).

Followup problem :

I want to propose the same benchmark but asuming there are less symbols. That is, the elements of both arrays are constructed with 10 symbols [A-Z][0-9] (for example “ABCD012BDF”). This is very similar to the previous problem but the number of symbols is reduced to be a subset of ASCII, hence, potentially, one could store the same information using less bits.

``````
using BenchmarkTools, InlineStrings, JSON, Random

n_list = 10_000_000
n_blocklist = 100_000
list = [randstring(10) for i in 1:n_list];
blocklist = list[1:n_blocklist];

#### 1) Simple solution ####

function check_list_in_blocklist(list, blocklist)
blocklist_set = Set{String}(blocklist)
result = Array{String}([])
for l in list
if l in blocklist_set
push!(result,l)
end
end
return result
end

@btime check_list_in_blocklist(list, blocklist)

#### 2) InlineString solution ####

function check_list_in_blocklist_inlinestring(list, blocklist)
blocklist_set = Set{InlineString15}(blocklist)
result = Array{InlineString15}([])
for l in list
if l in blocklist_set
push!(result,l)
end
end
return result
end

list_inline = InlineString15.(list)
blocklist_inline = InlineString15.(blocklist);
@btime check_list_in_blocklist_inlinestring(list_inline, blocklist_inline)

#### 3)InlineString with custom hash solution ####
struct ASCII_ID
x::InlineString15
end

Base.:(==)(x::ASCII_ID, y::ASCII_ID) = x.x == y.x
aword(x::ASCII_ID) = (bswap(reinterpret(UInt128, x.x)) % UInt)
Base.hash(x::ASCII_ID, h::UInt64) = hash(aword(x),h)

function check_list_in_blocklist(list, blocklist)
blocklist_set = Set{ASCII_ID}(blocklist)
result = Array{ASCII_ID}([])
for l in list
if l in blocklist_set
push!(result,l)
end
end
return result
end

list_ascii_id = ASCII_ID.(list)
blocklist_ascii_id = ASCII_ID.(blocklist);
@btime check_list_in_blocklist(list_ascii_id, blocklist_ascii_id)
``````

This benchmark produces

``````  271.374 ms (19 allocations: 4.08 MiB)
254.197 ms (19 allocations: 7.91 MiB)
129.662 ms (19 allocations: 7.91 MiB)
``````

Note that even though the last solution is the fastest, the cost of casting the elments from `list` is big. Therefore, if the data is not already in that format second and third solution would actually be slower, besides beeing more complex.

``````@btime list_inline = InlineString15.(list)
423.740 ms (4 allocations: 152.59 MiB)
``````

As usual, making things more specialized and unsafe makes things faster. The code uses a custom open-addressing hash table. The size is hard coded for the case in question, but can be reasonably automatically calculated.
Note that it doesn’t use @inbounds, which could also help.
This solution benchmarks faster on my computer, but it is best to benchmark in the OP’s computer to relate to post timings.

``````       function check_list_in_blocklist_fast(list, blocklist)
n_list, n_blocklist = length(list), length(blocklist)
hashtable = zeros(UInt8, 1<<19)
indtable = zeros(Int, 1<<19)
for i in 1:n_blocklist
val = ((h >> 19) & 0x7F)+1
while hashtable[loc] != 0
loc += 1
loc == (1<<19) && (loc = 1)
end
hashtable[loc] = val
indtable[loc] = i
end
preres, preres2 = Int[], Int[]
for i in 1:n_list
val = ((h >> 19) & 0x7F)+1
while hashtable[loc] != 0
if hashtable[loc] == val
push!(preres, i)
push!(preres2, indtable[loc])
end
loc += 1
loc == (1<<19) && (loc = 1)
end
end
[ list[preres[i]] for i in 1:length(preres) if list[preres[i]]==blocklist[preres2[i]]]
end
``````

Use with:

``````check_list_in_blocklist_fast(list, blocklist)
``````

Again, several safety checks omitted and the function can fail to terminate if hashtable over subscribed.

Very interesting Dan, thank you for stopping by and leaving your version. It seems it does not need casting the string type and produces an interesting speedup!

``````  278.953 ms (19 allocations: 4.08 MiB)
265.552 ms (19 allocations: 7.91 MiB)
133.321 ms (19 allocations: 7.91 MiB)
76.317 ms (38 allocations: 9.98 MiB)   # Dan solution
``````

I want to comment/ask a couple of details (if you or anyone else has time!):

• You mention `making things more specialized and unsafe makes things faster` why is your solution unsafe?

• Here we assume list and blocklist allways verify the conditions written above (the elements will always be 10 ASCII chars long).
• When you use `1<<19` you could replace it with `2^19` right ?

• I tested the speed of both and they seem to be the same (`2^19` seems more common)
• It seems you use the mask `mask = (UInt64(1)<<19)-1` which ends up beeing `0x000000000007ffff` ? why did you choose the mask of this size? is it because when `loc = (h & mask)+1` is computed it ensures that, `loc` is at most `2^19` ? This seems to ensure that `hashtable[loc] ` will always be a valid access.

• You mention the size `2^19` is hardcoded but it can be reasonably calculated. I guess this calculation could depend on the length of `blocklist` (not on the length of `list`). Therefore, replacing the values of `2^19` by `n_blocklist*factor` (where `factor=5` for example ) we would ensure all possible interesting strings can be stored, right? this seems to be safe.

• I am not sure I understand this logic: it seems that if the position `loc` is already occupied it starts a linear scan increasing `loc` on line 2, but why do you have line 3?

``````  while hashtable[loc] != 0                     # line 1
loc += 1                                   # line 2
loc == (1<<19) && (loc = 1)                # line 3
end
``````

The `unsafe` casting depends on specific string type (perhaps should be specified by type conversion/specification). If the strings are all the same, quadratic time possible. If hashtable fills due to some bug, function doesn’t terminate. And probably some more glitches - I had a few when writing the function already.

Yes.

Yes.

Yes, but making it a power of 2 would be easier to manage.

This line makes the buffer circular (wraps around), for obvious reasons.

Perhaps you can add the `@inbounds` specifiers to improve speed. Enlarging the `val` entries in `hashtable` might also speed up things (UInt8 chosen without optimization).

Separating some of the `hashtable` code into smaller functions might make the function more readable.

Another optimization might be to use `UInt64` for `hashtable` and `indtable` combined - lower bits used for `hashtable` and upper 32-bits for `indtable`. This layout saves memory access time.

This benchmark doesn’t include the prep time for InlineStrings.

1 Like