Performance of IdDict vs Dict

DrChainsaw · March 20, 2021, 10:04am

I have always lazily assumed that IdDict would always be faster if keying by identity is desired, but it seems like this is not the case:

julia> function benchmarkdict(DType, ks)
           d = DType(k => k for k in ks)
           @benchmark $d[x] setup=(x = rand($ks))
       end
benchmarkdict (generic function with 1 method)

julia> benchmarkdict(Dict, 1:10)
BenchmarkTools.Trial: 
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     7.900 ns (0.00% GC)
  median time:      15.000 ns (0.00% GC)
  mean time:        12.836 ns (0.00% GC)
  maximum time:     106.400 ns (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     1000

julia> benchmarkdict(IdDict, 1:10)
BenchmarkTools.Trial: 
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     18.737 ns (0.00% GC)
  median time:      27.756 ns (0.00% GC)
  mean time:        25.097 ns (0.00% GC)
  maximum time:     109.118 ns (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     998

For objects where one would expect IdDict to be significantly faster it is ofc:

julia> benchmarkdict(Dict, [randn(100) for _ in 1:10])
BenchmarkTools.Trial: 
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     880.769 ns (0.00% GC)
  median time:      900.000 ns (0.00% GC)
  mean time:        945.038 ns (0.00% GC)
  maximum time:     3.644 μs (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     52

julia> benchmarkdict(IdDict, [randn(100) for _ in 1:10])
BenchmarkTools.Trial: 
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     12.325 ns (0.00% GC)
  median time:      19.238 ns (0.00% GC)
  mean time:        18.848 ns (0.00% GC)
  maximum time:     114.729 ns (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     998

Is there an easy to formulate rule or recommendation here? Primitive keys => Use Dict? Hash function faster than xxx => use Dict? Mutable keys => use Dict?

As a side note, is there a faster way than Dict to access (I don’t care about speed of storing) large structs by integer keys. Keys are unfortunately big numbers and sparse so I don’t think an Array is an option.

JeffreySarnoff · March 20, 2021, 10:12am

“unfortunately keys are big numbers”
How big is the largest possible key? Roughly, how many keys at most?
How large are your structs likely to be?

With an IdDict, all keys are UInt64 values (because that is the type of an objectid).
With a Dict, keys may share a single type, but that is not required, they may have different types – whatever makes sense in the context of that dictionary. A mapping of names to addresses might be kept as Dict with the names (as Strings) being the keys and a struct that holds an address might be the type used for its values. That flexibility does not follow with using an IdDict.

DrChainsaw · March 20, 2021, 10:25am

Thanks,

There a a few places where I needs this lookup and they vary a little.

Case1 : keys seem to be in 10 digit range and there are not many of them in use at the time, typically single digit number of keys. The objects are somewhat large, a couple of megabytes.

Case 2: keys are a little bit smaller, typically in the 4 digit range and there is normally a low double digit number of them. Objects are typically large, up to gigabytes but typically 10-100:ds of megabytes.

More background with what the whole thing is supposed to do is in here if you are in a reading mood

JeffreySarnoff · March 20, 2021, 10:29am

?Case 1 “there are quite few of them in use at the time, typically single digit number of keys”
typo? quite a few of them sounds like more than 9

DrChainsaw · March 20, 2021, 10:30am

Hehe, poor english on my side “quite few” was ment as “not so many”. Will edit post to reflect this

DrChainsaw · March 20, 2021, 10:36am

Another mistake in previous post: I think numbers were off by a magnitude as I lazily just took ndigits for giving the exponent so 1e10 means 10 digit integer. Corrected now.

JeffreySarnoff · March 20, 2021, 10:50am

iI this accurate (so far as it goes)?

(a) With each separate run of your software, 
      you have a relatively small number of these large files.
(b) With each file is associated an integer.
     (1)  that number is determined by the file's content 
     (2)  that number is assigned after the content is reviewed

Is it possible for two different files to be given the same number?

DrChainsaw · March 20, 2021, 11:00am

Almost. The number comes from a protocol header and the file is fetched from a database using the number and cached in a Dict which is where the dict lookup comes in.

For a single datastream I can’t imagine that headers would have vastly different such integers.

In many cases one needs to process several streams (in extreme cases up to 1000ds), typically in file-dump format. Even in this case I expect the number of unique encoutered file ids to be relatively small.

JeffreySarnoff · March 20, 2021, 11:06am

Are you saying two different files can have the same protocol header number?
Or, are you saying the opposite, that all protocol headers have distinct numbers?

I have read your other notes

More fundamentally, you get handed these files.
What is it that you do with them: Summarize? Analyze? Reorganize? …?

In “1000ds” what is “ds”?

You can manage the many to few (numbers come from a large possible set, and are assigned to a dramatically smaller number of distinct files) quandry by some careful, colorful folding of integers.
You want to reduce the span of possible numbers to obtain a smaller, more task opportune set. However, any remapping of protocol numbers must be reversible [a careful folding] and you need to guard against mapping distinct values together indistinguishably [a careful, colorful folding].

Fortunately, much is known about ways (there are more than one) of accomplishing that.
Rather than first jump into that – let’s better understand your task from a design perspective.
There may be commonality of purpose that proves to simplify choice or clarify workflow.

DrChainsaw · March 20, 2021, 11:42am

Are you saying two different files can have the same protocol header number?

Uhm, they can have the same but it can also be different. If I set out to process a thousand files I would not expect there to be a thousand different versions of the spec. It could very well be more than one though.

What is it that you do with them: Summarize? Analyze? Reorganize? …?

Sorry if I’m dragging you into a wild goose chase here, I though my question had a narrow scope

Main purpose is to make the data available so it can be analyzed and/or summarized. It is a generic tool with the single purpose of as quickly as possible translating this opaque and crazy format and providing the data to the user for whatever purpose.

Reason for asking about faster options is that I got a about 15% speedup when changing all IdDicts to Dicts which was a bit surprising but seems to be in line with the simpler benchmarks above (the program obvously does more things than dict-lookup).

In “1000ds” what is “ds”?

Thousands, sorry for sloppy terminology.

JeffreySarnoff · March 20, 2021, 12:18pm

no worries

For others to help with the data: how best to cull, catch, consider, catalogue and codify the content; and to assist with a better understanding of how to bring forward more usefulness and data facility …
it helps to share more of the thought, as you are doing

DrChainsaw · March 20, 2021, 12:23pm

Thanks for all your time! This sounds like how a Dict handles the keys, and maybe it requires some substantial effort then to beat it. I was thinking someone would just post a link to IntDict.jl or SparseFastArrays.jl

I must say I’m really happy with my current performance as I have managed to take task which used to take 10-100 minutes and do it in 1-10 seconds. I guess I’m just being a bit greedy now that those 10 seconds feel like an eternity as well as a bit inspired by that free 15% speedup.

DrChainsaw · March 20, 2021, 12:30pm

Fully agree! But it is also a balance of not giving too much information at once as it becomes difficult to parse out the issues.

There are a couple of more things which I think can be made faster which don’t have anything to do with the cache-lookups, but I’m thinking that is best handled in a separate topic and I’m gonna do some experiments myself first to be able to provide a better MWE.

fer2tdt · May 19, 2023, 5:18am

Hi, I find a nice and useful speedup in these update.

Topic		Replies	Views
`IdDict{UInt64, Float64}` is much slower than `Dict{UInt64, Float64}` for retrieving values Performance dictionary	2	110	November 27, 2024
Using `DataType` as key in `Dict` General Usage	10	708	October 23, 2020
Performance of using strings as keys in Dict (vs. integers) General Usage question	8	4796	July 24, 2017
Unexpected allocations when accessing IdDict Performance	9	638	August 30, 2021
Dict with integer keys and fastest insertion? Performance dictionary , data_structures , hash , algorithm	8	254	January 25, 2025

Performance of IdDict vs Dict

Related topics