I have always lazily assumed that IdDict would always be faster if keying by identity is desired, but it seems like this is not the case:
julia> function benchmarkdict(DType, ks)
d = DType(k => k for k in ks)
@benchmark $d[x] setup=(x = rand($ks))
end
benchmarkdict (generic function with 1 method)
julia> benchmarkdict(Dict, 1:10)
BenchmarkTools.Trial:
memory estimate: 0 bytes
allocs estimate: 0
--------------
minimum time: 7.900 ns (0.00% GC)
median time: 15.000 ns (0.00% GC)
mean time: 12.836 ns (0.00% GC)
maximum time: 106.400 ns (0.00% GC)
--------------
samples: 10000
evals/sample: 1000
julia> benchmarkdict(IdDict, 1:10)
BenchmarkTools.Trial:
memory estimate: 0 bytes
allocs estimate: 0
--------------
minimum time: 18.737 ns (0.00% GC)
median time: 27.756 ns (0.00% GC)
mean time: 25.097 ns (0.00% GC)
maximum time: 109.118 ns (0.00% GC)
--------------
samples: 10000
evals/sample: 998
For objects where one would expect IdDict to be significantly faster it is ofc:
julia> benchmarkdict(Dict, [randn(100) for _ in 1:10])
BenchmarkTools.Trial:
memory estimate: 0 bytes
allocs estimate: 0
--------------
minimum time: 880.769 ns (0.00% GC)
median time: 900.000 ns (0.00% GC)
mean time: 945.038 ns (0.00% GC)
maximum time: 3.644 μs (0.00% GC)
--------------
samples: 10000
evals/sample: 52
julia> benchmarkdict(IdDict, [randn(100) for _ in 1:10])
BenchmarkTools.Trial:
memory estimate: 0 bytes
allocs estimate: 0
--------------
minimum time: 12.325 ns (0.00% GC)
median time: 19.238 ns (0.00% GC)
mean time: 18.848 ns (0.00% GC)
maximum time: 114.729 ns (0.00% GC)
--------------
samples: 10000
evals/sample: 998
Is there an easy to formulate rule or recommendation here? Primitive keys => Use Dict? Hash function faster than xxx => use Dict? Mutable keys => use Dict?
As a side note, is there a faster way than Dict to access (I don’t care about speed of storing) large structs by integer keys. Keys are unfortunately big numbers and sparse so I don’t think an Array is an option.
“unfortunately keys are big numbers”
How big is the largest possible key? Roughly, how many keys at most?
How large are your structs likely to be?
With an IdDict, all keys are UInt64 values (because that is the type of an objectid).
With a Dict, keys may share a single type, but that is not required, they may have different types – whatever makes sense in the context of that dictionary. A mapping of names to addresses might be kept as Dict with the names (as Strings) being the keys and a struct that holds an address might be the type used for its values. That flexibility does not follow with using an IdDict.
There a a few places where I needs this lookup and they vary a little.
Case1 : keys seem to be in 10 digit range and there are not many of them in use at the time, typically single digit number of keys. The objects are somewhat large, a couple of megabytes.
Case 2: keys are a little bit smaller, typically in the 4 digit range and there is normally a low double digit number of them. Objects are typically large, up to gigabytes but typically 10-100:ds of megabytes.
More background with what the whole thing is supposed to do is in here if you are in a reading mood
Another mistake in previous post: I think numbers were off by a magnitude as I lazily just took ndigits for giving the exponent so 1e10 means 10 digit integer. Corrected now.
(a) With each separate run of your software,
you have a relatively small number of these large files.
(b) With each file is associated an integer.
(1) that number is determined by the file's content
(2) that number is assigned after the content is reviewed
Is it possible for two different files to be given the same number?
Almost. The number comes from a protocol header and the file is fetched from a database using the number and cached in a Dict which is where the dict lookup comes in.
For a single datastream I can’t imagine that headers would have vastly different such integers.
In many cases one needs to process several streams (in extreme cases up to 1000ds), typically in file-dump format. Even in this case I expect the number of unique encoutered file ids to be relatively small.
Are you saying two different files can have the same protocol header number?
Or, are you saying the opposite, that all protocol headers have distinct numbers?
I have read your other notes
More fundamentally, you get handed these files.
What is it that you do with them: Summarize? Analyze? Reorganize? …?
In “1000ds” what is “ds”?
You can manage the many to few (numbers come from a large possible set, and are assigned to a dramatically smaller number of distinct files) quandry by some careful, colorful folding of integers.
You want to reduce the span of possible numbers to obtain a smaller, more task opportune set. However, any remapping of protocol numbers must be reversible [a careful folding] and you need to guard against mapping distinct values together indistinguishably [a careful, colorful folding].
Fortunately, much is known about ways (there are more than one) of accomplishing that.
Rather than first jump into that – let’s better understand your task from a design perspective.
There may be commonality of purpose that proves to simplify choice or clarify workflow.
Are you saying two different files can have the same protocol header number?
Uhm, they can have the same but it can also be different. If I set out to process a thousand files I would not expect there to be a thousand different versions of the spec. It could very well be more than one though.
What is it that you do with them: Summarize? Analyze? Reorganize? …?
Sorry if I’m dragging you into a wild goose chase here, I though my question had a narrow scope
Main purpose is to make the data available so it can be analyzed and/or summarized. It is a generic tool with the single purpose of as quickly as possible translating this opaque and crazy format and providing the data to the user for whatever purpose.
Reason for asking about faster options is that I got a about 15% speedup when changing all IdDicts to Dicts which was a bit surprising but seems to be in line with the simpler benchmarks above (the program obvously does more things than dict-lookup).
For others to help with the data: how best to cull, catch, consider, catalogue and codify the content; and to assist with a better understanding of how to bring forward more usefulness and data facility …
it helps to share more of the thought, as you are doing
Thanks for all your time! This sounds like how a Dict handles the keys, and maybe it requires some substantial effort then to beat it. I was thinking someone would just post a link to IntDict.jl or SparseFastArrays.jl
I must say I’m really happy with my current performance as I have managed to take task which used to take 10-100 minutes and do it in 1-10 seconds. I guess I’m just being a bit greedy now that those 10 seconds feel like an eternity as well as a bit inspired by that free 15% speedup.
Fully agree! But it is also a balance of not giving too much information at once as it becomes difficult to parse out the issues.
There are a couple of more things which I think can be made faster which don’t have anything to do with the cache-lookups, but I’m thinking that is best handled in a separate topic and I’m gonna do some experiments myself first to be able to provide a better MWE.