(Efficiently) construct a Dictionary from a Set

In our package we have some data structures that act as wrappers around Set{Int} to keep track of various index sets. We now want to insert a ‘weight’ for each item in the set, ie construct a Dict{Int,Float64}, and we want to do this without having to carry each index twice in memory.

Currently, we just create a new dictionary with the appropriate weights, for simplicity let’s say it looks like

set = Set{Int}(rand(Int, 10^4))
newdict = Dict{Int,Float64}(ind => complicatedfunction(ind) for ind in set)

but this means that we are storing each Int twice. Our goal therefore is to remove the duplicate storage.

So the question: what is the most effficient way to do this?

I figure since a Set{Int} is stored as a Dict{Int,Nothing} in memory, there must be an efficient way. Maybe instead of using Set{Int} we use some sort of Dict{Int,Union{Float64,64bitPointer}} which points to some dummy pointer object? But as I understand, using abstract types like Union are inefficient.

You could reverse the order by initially creating the weight Dict and get its keys when you need a Set. Maybe store NaN for an unknown weight, but storing unnecessary weights will “waste” memory. Otherwise, anything you do to try to reuse the values of the Set as keys of a Dict will likely use more memory than the duplication you seek to prevent. If your Ints can fit in Int32 (or smaller) you can simply save half (or more) of the memory.

Are you truly concerned that duplicating the Ints will exceed your available memory, or do you want to avoid the possibility of inconsistency, or is it just a matter of principle?

One more consideration, computer memory is cheaper than your time to create and maintain a complex solution.

I like the idea of reversing the problem, this was my first idea too. The issue is that the Dicts only need to be constructed in a protion of our algorithms, meaning our storage would be doubled for the majority of our usecases.

For low-dimensional problems, storing data as Int32 works well. But once we want to solve large PDEs for example, Int32 won’t fit anymore.

I suppose it’s mostly a matter of principle, and maybe a bit to reduce GC pressure. These Dicts have to be produced at every stage of an iteration.

Frankly, if it were up to me, I would leave the duplicated memory storage. But my professor wants this project to be perfect, since it’s sort of his magnum opus.

I don’t know If this is feasible, but you could define all your behaviour on an abstract supertype and have two concrete Implementations. One, that uses only a Set and one that uses a Dict from the start. And then you determine wether the weights are needed in the beginning and create the appropriate concrete type.
But type stability could become a problem with this approach.

Here’s a function which converts a set to a dictionary without copying the underlying data. It unfortunately depends on internal representations of these data structures in Julia 1.7.3, so only use it as a last resort and always test it to see if it still works with any other version of Julia.

function set_to_dict(s::Set{KeyType}, ValueType, mapping_function) where {KeyType}
    d = s.dict
    new_vals = Vector{ValueType}(undef, length(d.slots))
    for i in 1:length(d.slots)
        if Base.isslotfilled(d, i)
            new_vals[i] = mapping_function(d.keys[i])
        end
    end
    Dict{KeyType, ValueType}(d.slots, d.keys, new_vals, d.ndel, d.count, d.age, d.idxfloor, d.maxprobe)
end

Here’s a simple test:

julia> s = delete!(Set([1,2,3]), 3) # important to test if the function can handle sets with deletion markers
Set{Int64} with 2 elements:
  2
  1

julia> @assert set_to_dict(s, Float64, a->a/2) == Dict([1=>1/2, 2=>2/2])

Updating the dictionary will leave the original set in a broken state since some underlying data are shared. So treat the dictionary as read-only unless it’s OK to destroy the original set.

1 Like

Thank you! This is exactly the type of construction I was hoping would exist. I’ve been trying to understand the source code for Dicts and didn’t quite get it, but I had a feeling you might be able to do something like this!

As one with many years experience maintaining code, using this seems like a contradiction of the goal.

1 Like

Note, that would be Int32 on 32-bit platforms (which you can maybe safely ignore), but I would suggest Dict{UInt32, Float16} or even Dict{UInt32, some-8-bit type}` e.g. from:

if you’re this concerned about size. Note the keys and vals (at least in Base Dict), are implemented as separate arrays, and no rule states both should be same size e.g. 32-bit: https://github.com/JuliaLang/julia/blob/3653d3d05ffb8a652b12ff22e550ec8fe07edd1d/base/dict.jl#L57

I suppose you need the weight 0.0, which you have in fixed-point, but also 1.0, which I’m not sure you have, only approximate? And likely no NaN, so just keep that in mind. And all values in-between might not map cleanly back and forth from floats, but at least you don’t waste bits on values outside of [0.0, 1.0] which you would never need to represent…

Do you need the Set at all (can’t the info you get from it be inferred from your Dict, and all operations still fast enough?)? I see Set is actually implemented with dict::Dict{T,Nothing} (and is that the most compact form possible?): https://github.com/JuliaLang/julia/blob/4bc549ed397a5e919caca6bd3adc7ff199717ea3/base/set.jl#L40

There are other Dicts available in that and other packages (also Sets elsewhere?), LittleDict is elsewhere (faster for such):

RobinDict (implemented with Robin Hood Hashing)
SwissDict (inspired from SwissTables)
Dictionaries with Defaults […]
Sorted Dict, Sorted Multi-Dict and Sorted Set

You likely don’t need ordered (or you do? I would want it as a default for Julia, still conflicted about it, and others). Ordered dict isn’t chosen in Julia as a default, since it could be slower. But I don’t know how space-efficient any of the alternatives are. I suspect there’s a correlation though between fastest and most space-efficient.

The Dictionaries.jl package implements its own dict and set types, and they support this kind of efficient construction.

3 Likes

Note that if you start with the Dict instead of the Set, you can recover a set-like object using the keys function. This function does not copy the data.

julia> d = Dict(1=>"a",2=>"b")
Dict{Int64, String} with 2 entries:
  2 => "b"
  1 => "a"

julia> s = keys(d)
KeySet for a Dict{Int64, String} with 2 entries. Keys:
  2
  1

julia> isa(s, AbstractSet{Int})
true

julia> 1 in s
true

julia> 3 in s
false

Haha yes, but you have to understand my professor grew up in a time when computers only had only a few MB of RAM.

.

Unfortunately that’s another question that’s up in the air, we might end up switching to OrderedDicts from OrderedCollections.jl, again due to memory duplication problems. It seems the OrderedDict doesn’t implement the same inner constructors so the method of @greatpet might not work… maybe I’ll open a PR adding that inner constructor.

OrderedDicts.jl currently have a few open bugs, and is seemingly not well maintained. Just FYI. See Issues · JuliaCollections/OrderedCollections.jl · GitHub

Going off topic, it’s also a little concerning that both Dictionaries.jl and UnorderedCollections.jl have to make ccall to the undocumented (as far as I know) C function jl_arrayunset defined in julia.h. The basic problem seems to be that if the keys (or values) of a dictionary are of a mutable type and are stored in some underlying array, deleting the key/value pair requires you to tell the Julia runtime that a certain element of the underlying array should be reset to undef, and therefore can be garbage-collected. Anyone trying to implement a fundamental data structure from scratch will run into this problem.

Maybe Base should export and document a few of these essential methods, so that users and package writers don’t have to depend on this kind of internal details? Of course I trust the authors of these established packages since they’re experts and possibly Base devs themselves, but the situation is at least a little unsatisfactory aesthetically. Implementing a low-level data structure may not be a common use case for Julia, but is still something to be supported.

2 Likes

I think I understand what the need is, and I’m not sure I know well everything I write about.
But something comes to mind along the lines of the following expressions (in any case we need to fix the case error, for now only put as a placeholder)

s=Set([1,2,3])


w=rand(3)

Base.getindex(s::Set,i)= i ∈ s ? nothing : error

ulia> [s[i]=w[i] for i in s]
ERROR: MethodError: no method matching setindex!(::Set{Int64}, ::Float64, ::Int64)

the problem is to define an appropriate set index function (and I don’t know if it is possible in the given conditions)



s=Set([1,2,3])

Base.getindex(s::Set,i::Int)= i ∈ s ? 1.0 : nothing
Base.setindex!(s::Set{Int64},w::Float64, i::Int)=Dict(s[i]=>w)

w=rand(3)

[s[i]=w[i] for i in s]

If you want to save space and my proposal for 8-bit fixed point isn’t good enough, there’s also a different number format (that’s simple, while I’ve not yet seen it implemented in Julia, and building on it the new format dynamic quantization, I just discovered):

Dynamic Tree Quantization

By moving the indicator bit, numbers can have a large exponent 10^-7 or precision as high as 1/63. […]
Dynamic tree quantization is strictly defined to quantize numbers in the range [-1.0, 1.0]

See chapter 1.3 and fig 2 at:

in your case, the sign bit could be sacrificed, for better precision up to 1/127.

So, 8-bit precision is absolutely being used, even for gradients, and the method in the paper is nice building on the above number format:

In this paper, we develop the first optimizers that use 8-bit statistics […] As a result, our 8-bit optimizers maintain 32-bit performance with a small fraction of the memory footprint on a range of tasks, including 1.5B parameter language modeling, GLUE finetuning, ImageNet classification, WMT’14 machine translation, MoCo v2 contrastive ImageNet pretraining+finetuning, and RoBERTa pretraining, without changes to the original optimizer hyperparameters. We open-source our 8-bit optimizers as a drop-in replacement that only requires a two-line code change.
[…]
For GPUs, this makes 8-bit optimizers faster than regular 32-bit optimizers, as we show in Section 3.
[…]

2.2 Dynamic Quantization

In this work, we extend dynamic tree quantization Section 1.3 for non-signed input tensors by re-purposing the sign bit. Since the second Adam state is strictly positive, the sign bit is not needed. Instead of just removing the sign bit, we opt to extend dynamic tree quantization with a fixed bit for the fraction. This extension is motivated by the observation that the second Adam state varies around 3-5 orders of magnitude during the training of a language model. In comparison, dynamic tree quantization already has a range of 7 orders of magnitude. We refer to this quantization as dynamic quantization to distinguish it from dynamic tree quantization in our experiments.

Note also, there is a package for logarithmic number format, e.g. ULogFloat16 (could be extended to ULogFloat8) that would be better than Float32 (and might be better than Float16, but, for you, then I’m not sure, at least some 8-bit format would be better). That package stores a sign (could be omitted), and can support 0.0 (even though log(0) isn’t defined), by special-casing.