I want to leave here an interesting benchmark and see if someone comes with a new approach that provides faster results.
The problem is:
Given two arrays list
and blocklist
, find which elements in list
also belong to blocklist
and return them. The elements of both arrays are constructed with 10 ASCII symbols (for example “aBcD012BdF”).
Followup problem :
I want to propose the same benchmark but asuming there are less symbols. That is, the elements of both arrays are constructed with 10 symbols [A-Z][0-9] (for example “ABCD012BDF”). This is very similar to the previous problem but the number of symbols is reduced to be a subset of ASCII, hence, potentially, one could store the same information using less bits.
using BenchmarkTools, InlineStrings, JSON, Random
n_list = 10_000_000
n_blocklist = 100_000
list = [randstring(10) for i in 1:n_list];
blocklist = list[1:n_blocklist];
#### 1) Simple solution ####
function check_list_in_blocklist(list, blocklist)
blocklist_set = Set{String}(blocklist)
result = Array{String}([])
for l in list
if l in blocklist_set
push!(result,l)
end
end
return result
end
@btime check_list_in_blocklist(list, blocklist)
#### 2) InlineString solution ####
function check_list_in_blocklist_inlinestring(list, blocklist)
blocklist_set = Set{InlineString15}(blocklist)
result = Array{InlineString15}([])
for l in list
if l in blocklist_set
push!(result,l)
end
end
return result
end
list_inline = InlineString15.(list)
blocklist_inline = InlineString15.(blocklist);
@btime check_list_in_blocklist_inlinestring(list_inline, blocklist_inline)
#### 3)InlineString with custom hash solution ####
struct ASCII_ID
x::InlineString15
end
Base.:(==)(x::ASCII_ID, y::ASCII_ID) = x.x == y.x
aword(x::ASCII_ID) = (bswap(reinterpret(UInt128, x.x)) % UInt)
Base.hash(x::ASCII_ID, h::UInt64) = hash(aword(x),h)
function check_list_in_blocklist(list, blocklist)
blocklist_set = Set{ASCII_ID}(blocklist)
result = Array{ASCII_ID}([])
for l in list
if l in blocklist_set
push!(result,l)
end
end
return result
end
list_ascii_id = ASCII_ID.(list)
blocklist_ascii_id = ASCII_ID.(blocklist);
@btime check_list_in_blocklist(list_ascii_id, blocklist_ascii_id)
This benchmark produces
271.374 ms (19 allocations: 4.08 MiB)
254.197 ms (19 allocations: 7.91 MiB)
129.662 ms (19 allocations: 7.91 MiB)
Note that even though the last solution is the fastest, the cost of casting the elments from list
is big. Therefore, if the data is not already in that format second and third solution would actually be slower, besides beeing more complex.
@btime list_inline = InlineString15.(list)
423.740 ms (4 allocations: 152.59 MiB)
I want to thank @oxinabox and @kristoffer.carlsson for the discussions about this in Slack.