Fast Dict value modification by accessing Dict.vals


#1

I have a use case that needs to blindly change the values stored in a Dict without worrying about the keys and do so quickly. The solution is effective:

function dict_value_shift!(d::Dict{Ti,Tv},shift::Tv) where Ti where Tv <: Signed
    # In version v1.2+ this can use hasfield(Dict,:vals) instead
    @assert isdefined(d,:vals) "If this fails implimentation of Dict has changed"
    @inbounds for n in 1:length(d.vals)
        d.vals[n]-=shift
    end
end

function dict_value_shift!(d::Dict{Ti,Vector{Tv}},shift::Tv) where Ti where Tv <: Signed
    # In version v1.2+ this can use hasfield(Dict,:vals) instead
    @assert isdefined(d,:vals) "If this fails implimentation of Dict has changed"
    for n in 1:length(d.vals)
        if isassigned(d.vals,n)
            vec=d.vals[n]
            @inbounds for i in 1:length(vec)
                vec[i]-=shift
            end
        end
    end
end

Now we could iterate through Dict, but when you get to 10000+ element this is more than 10x faster.

It would be great if there was an interface in Julia that allowed this ( something line nonzeros in SparseArrays), but in the absence of that how bad of an idea is this?

Is the Dict type likely to ever lose the vals field or something equivalent?

Some people seem hesitant to rely on a hidden field.


#2

I’m not a fan of using internal fields like this for professional projects. If the performance gain is 5 %, 10 %, I’d say it’s probably not worth it. For a 10x performance gain, and assuming that this method also makes up a significant portion of your overall runtime, I’d say go ahead!

I agree, it’d be great with an interface that allowed this. There are other dictionary-based operations that can be sped up considerably by exploiting the internal structure, most notably a combined set/update, you can see how this is done in StatsBase for example.

One thing I’d change in your implementation is to write a couple of unit tests for this instead of relying on @assert. This is a safer approach IMO, since you’d get it tested during build for every commit, and your test can ensure that all values are actually shifted. Whether vals is removed in a future version is one issue, but it could also be that it’s still there but has a different meaning.


#3

For context, here is the underlying discussion: https://github.com/JuliaOpt/LinQuadOptInterface.jl/pull/95

The naive implementation is

function dict_value_shift!(d::Dict{Ti, Tv}, shift::Tv) where {Ti, Tv <: Signed}
    for (key, value) in d
        dict[key] -= shift
    end
end

#4

I think that in this use case, element counts >>1000 are not uncommon.
2x @ n=1000 -> 25x @ n=1_000_000


#5

Note that this will fail with a Dict{Int,Signed}:

julia> d = Dict(1=>2, 3=>Int8(4))
Dict{Int64,Signed} with 2 entries:
  3 => 4
  1 => 2

julia> function dict_value_shift!(d::Dict{Ti,Tv},shift::Tv) where Ti where Tv <: Signed
           # In version v1.2+ this can use hasfield(Dict,:vals) instead
           @assert isdefined(d,:vals) "If this fails implimentation of Dict has changed"
           @inbounds for n in 1:length(d.vals)
               d.vals[n]-=shift
           end
       end
dict_value_shift! (generic function with 1 method)

julia> dict_value_shift!(d, 1)
ERROR: UndefRefError: access to undefined reference
Stacktrace:
 [1] getindex at ./array.jl:729 [inlined]
 [2] dict_value_shift!(::Dict{Int64,Signed}, ::Int64) at ./REPL[5]:5
 [3] top-level scope at none:0

Or even just BigInts:

julia> d = Dict(1=>big(2), 3=>big(4))
Dict{Int64,BigInt} with 2 entries:
  3 => 4
  1 => 2

julia> dict_value_shift!(d, big(1))
ERROR: UndefRefError: access to undefined reference
Stacktrace:
 [1] getindex at ./array.jl:729 [inlined]
 [2] dict_value_shift!(::Dict{Int64,BigInt}, ::BigInt) at ./REPL[5]:5
 [3] top-level scope at none:0

Also you have a typo in @inbounds for i in 1:length(vals) in the vector method. You want vec there.

I’d just define it for Int or a small set of concrete integer types (with an explicit metaprogramming loop).


#6

@mbauman thank you the typo has been fixed in the original post.

In the actual application, the value type is a concrete type Int. I just had it be Signed so that people didn’t worry about negative issues.


#7

In what form do you need the mapping after this is done — just a table of key => value pairs, or O(1) random lookup?

I agree with @bennedich about accessing internals; depending on your use case perhaps you can just collect the contents and work on Vectors.

Otherwise, the right way to solve this would be introducing an interface that just modifies the values, eg mapvalues!(f, ::Dict) and perhaps a non-modifying mapvalues which creates a new Dict. This would make a nice PR to Base.


#8

@Tamas_Papp I am in the process of figuring out how to modify Base on my computer.

But the solution I came up with is as follows, and the magic of Julia means that non-mutating version is actually twice as fast as the fast for loop.
The mutating version is about 6x slower than the typesafe version if anyone has any recommendations on how to remedy that.

"""
    mapdict!(f, dict) -> dict
Takes the function f(value) and transforms the stored values with essentially value=f(value).

If the value returned by f(value) is of a differnet type than the input function used
must take the for dict = map(f,dict)  which will mutate dict in a memory efficent manner
"""
function mapdict!(f, d::Dict)
    isempty(d) && return d
    f_return_type = typeof(f(first(d)[2]))
    _mapdict!(f_return_type,f,d)
end
"""
    mapdict(f, dict) -> dict
Takes the function f(value) and returns a new Dict with the values transfor with new_value=f(value).

If the value returned by f(value) is of a differnet type than the input function used
must take the for dict = map(f,dict)  which will mutate dict in a memory efficent manner
"""
function mapdict(f, d::Dict)
    isempty(d) && return d
    f_return_type = typeof(f(first(d)[2]))
    _mapdict(f_return_type,f,d)
end



# This is the typesafe version
function _mapdict!(::Type{TV}, f, d::Dict{TK,TV}) where TV where TK
    _mapdict_apply!(f, d, d.vals)
end
function _mapdict(::Type{TV}, f, d::Dict{TK,TV}) where TV where TK
    new_d=Dict(d)
    _mapdict_apply!(f, new_d, new_d.vals)
end

# Mutating verion
function _mapdict!(::Type{TVnew}, f, d::Dict{TK,TV}) where TV where TK where TVnew
    L = length(d.vals)
    new_vals = Vector{TVnew}(undef,L)

    new_d = Dict{TK, TVnew}(d.slots, d.keys, new_vals, d.ndel, d.count, d.age, d.idxfloor, d.maxprobe)

    _mapdict_apply!(f, new_d, d.vals)
    return new_d
end
function _mapdict(::Type{TVnew}, f, d::Dict{TK,TV}) where TV where TK where TVnew
    L = length(d.vals)
    new_vals = Vector{TVnew}(undef,L)

    new_d = Dict{TK, TVnew}(copy(d.slots), copy(d.keys), new_vals, d.ndel, d.count, d.age, d.idxfloor, d.maxprobe)

    _mapdict_apply!(f, new_d, d.vals)
    return new_d
end

function _mapdict_apply!(f,d::Dict{TK,TV}, old_vals::Vector{OV}) where TV where TK where OV
    L = length(d.vals)
    i = d.idxfloor
    vals = d.vals
    Base.@inbounds while i < L
        Base.isslotfilled(d, i) && (vals[i] = f(old_vals[i]))
        i += 1
    end
    return d
end


#9

I think it could be much simpler, and the in-place version should just assume that the result type can be contained.

There should be no need to calculate f_return_type. The only concern is perhaps not mapping the unused values in the vals vector.


#10

The original simple version I had was:

function mapdict!(f, d::Dict)
    L = length(d.vals)
    i = d.idxfloor
    vals = d.vals
    Base.@inbounds while i < L
        Base.isslotfilled(d, i) && (vals[i] = f(vals[i]))
        i += 1
    end
end

Is only about 10% faster without the functionality.


#11

I also don’t think that there is a way to get to the unused values without them first being assigned in a proper way. Dict uses undef in its constructor so I think that it is safe in both locations.

You are right there is no need to f_return_type if we don’t have to worry about the mutating version. But even if we said mapdict! had to be non-mutating, the copy version would still want to dispatch whether it mutates or not.


#12

My main concern is that you cannot decide the type just by invoking the function on the first element. A correct implementation would work for

f(i::Integer) = i + (isodd(i) ? 1 : 1.0)

#13

So this looks a bit more complicated but it should cover input functions which aren’t type safe.

For the type-safe condition there is only about 5% over head for these changes.

The mutating version is still like 6x slower than the non mutating version, the type broadening path is then about 2x slower than that.

"""
    mapdict!(f, dict) -> dict
Takes the function f(value) and transforms the stored values with essentially value=f(value).

If the value returned by f(value) is of a differnet type than the input function used
must take the for dict = map(f,dict)  which will mutate dict in a memory efficent manner
"""
function mapdict!(f, d::Dict{K,V}) where K where V
    isempty(d) && return d
    f_return_type = typeof(f(first(d)[2]))
    if f_return_type == V
        new_d=d
    else
        L=length(d.vals)
        new_vals = Vector{f_return_type}(undef,L)
        new_d = Dict{K, f_return_type}(d.slots, d.keys, new_vals, d.ndel, d.count, d.age, d.idxfloor, d.maxprobe)
    end
    ret =  _mapdict_apply!(f, new_d,d.vals) # This function will return a Tuple if it isn't typestable
    ret isa Dict && return ret# If ret is a Dict the function complete in a typesafe way
    # If it didn't we have to broaden the type
    ret= _mapdict_unioning!(f,d,ret)
    return ret
end



"""
    mapdict(f, dict) -> dict
Takes the function f(value) and returns a new Dict with the values transfor with new_value=f(value).
"""
function mapdict(f, d::Dict{K,V}) where K where V
    isempty(d) && return d

    f_return_type = typeof(f(first(d)[2]))
    if f_return_type == V
        new_d=Dict(d)
    else
        L=length(d.vals)
        new_vals = Vector{f_return_type}(undef,L)
        new_d = Dict{K, f_return_type}(copy(d.slots), copy(d.keys), new_vals, d.ndel, d.count, d.age, d.idxfloor, d.maxprobe)
    end
    ret =  _mapdict_apply!(f, new_d,d.vals)
    ret isa Dict && return ret # If ret is a Dict the function complete in a typesafe way
    # If it didn't we have to broaden the type
    ret= _mapdict_unioning!(f,d,ret)
    return ret
end



@inline function _mapdict_apply!(f,d::Dict{K,V}, old_vals::Vector{Vold},i_start=d.idxfloor) where V where K where Vold
    L = length(d.vals)
    type_test(v)=false
    if isconcretetype(V)
        @inline type_test(v)= typeof(v)===V
    else
        @inline type_test(v)= typeof(v) <: V
    end

    i = i_start
    vals = d.vals
    @inbounds while i < L
        (Base.isslotfilled(d, i) || (i+=1; continue )) && #This first line is to check the slot and iterate if it isn't used
        (new_val = f(old_vals[i]); true) && type_test(new_val) && # This line is getting the new val checking the type
        (vals[i] = new_val; true) && (i += 1; true) && continue #this line is finishing the iteration
        # If anything fails above we return a tuple with the new type,dict, and the current index so we don't reapply the function
        return (typeof(new_val),d,i)
    end
    return d
end


# This function will keep broadening the new type for 5 iterations then assume the output should be any
function _mapdict_unioning!(f, old_d::Dict{K,Vold},ret::Tuple{DataType,Dict{K,Vnew},Int}) where K where Vnew where Vold
    num_types_before_any = 5
    type_counter=0
    union_type=Vnew
    d=ret[2]
    while ret isa Tuple && type_counter <= num_types_before_any
        type_counter+=1
        new_type=ret[1]
        union_type=type_counter < num_types_before_any ? Union{union_type,new_type} : Any
        d = Dict{K, union_type}(d.slots, d.keys, convert(Vector{union_type}, d.vals), d.ndel, d.count, d.age, d.idxfloor, d.maxprobe)
        ret=_mapdict_apply!(f, d, old_d.vals,ret[3])
    end
    if ret isa Dict
        return ret
    else
        #Function should never get here because Any should catch everything
        error("mapdict could not converge on function output type")
    end
end

#14

For reference the PR:
https://github.com/JuliaLang/julia/pull/31223

Has been created to cover this.


#15

And is now merged. The new API is really nice and clean, for example:

map!(x -> x + 1, values(dict))

Thanks for the great work, @ndinsmore!


#16

I wonder if a function for mapping just the values of a Dict would make sense, with the same effect as

mapvalues(f, dict) = Dict(zip(keys(dict), map(f, values(dict))))

but avoiding rebuilding the table.

It would be straightforward, except for the fact that it is tricky to figure out what to map unused slots in the vals vector to — they may not even be valid values.


#17

I’m not sure we even need a different API. We can just specialize

Dict(z::Zip{<:Tuple{KeySet,Any}})

to copy the table structure of z.is[1].dict and put the the values from z.is[2] into the the table in the slots of the dict. The API would just be writing this:

Dict(zip(keys(d), f(x) for x in values(d)))

It’s also more general since there’s no reason why the second argument has to have anything to do with the original Dict—it’s just anything that provides a value for each key in the dictionary. The only thing that’s required to avoid rebuilding the hash table is to know that you’re zipping the keys of a dict with something else and passing it to the Dict constructor.