Large number of memory allocations using Dict with custom key type

As part of the code I’m working on I need to find the unique elements of a collection of small integer arrays and assign a unique identifier to each one of them. To do this I could do something like the following:

arrays = [[1,2], [3,4,5], [3,1,2], [1,2]]
U = Dict{Vector{Int64}, Int64}()
for arr in arrays
    get!(U, arr, length(U)+1)
end

So that I would have [1,2] => 1, [3,4,5] => 2, [3,1,2] => 3. In order to avoid the allocation of a large number of small arrays, what I want to do instead is to write all the data into one large buffer and then represent the small arrays as pairs of numbers indicating the position in the buffer. For this I have a version of the following code:

struct CustomView
    data::Vector{UInt32}
    rb::UInt32
    re::UInt32
end

function Base.hash(x::CustomView, h::UInt64)
    h = hash(x.re - x.rb, h)
    for i=x.re:-1:x.rb
        h = hash(x.data[i], h)
    end
    return h
end

function Base.isequal(x::CustomView, y::CustomView)
    if (x.re - x.rb) != (y.re - y.rb)
        return false
    end
    for (v1, v2) in zip(x.data[x.rb:x.re], y.data[y.rb:y.re])
        if v1 != v2
            return false
        end
    end
    return true
end

Now I can create the array data = [1,2,3,4,5,3,1,2,1,2] and replace the small arrays above with CustomView(data, 1, 2), CustomView(data, 3, 5), CustomView(data, 6, 8), CustomView(data, 9, 10). The dictionaries for CustomView work correctly. However, they seem to perform a lot of memory allocations that I would think should not be necessary, and perform slower than comparable methods. Consider the following test:

using Random

function test_dicts_tuple(n::Integer, data)
    U = Dict{NTuple{3,UInt32}, UInt32}()
    sizehint!(U, n)
    for i=1:n
        ind = 3 * (i-1) + 1
        tp = (data[ind], data[ind+1], data[ind+2])
        get!(U, tp, length(U)+1)
    end
    return U
end

function test_dicts_view(n::Integer, data)
    U = Dict{CustomView, UInt32}()
    sizehint!(U, n)
    for i=1:n
        ind = 3 * (i-1) + 1
        vw = CustomView(data, ind, ind+2)
        get!(U, vw, length(U)+1)
    end
    return U
end

function test_dicts(n::Integer, m::Integer)
    data = Vector{UInt32}(rand(1:m, n*3))
    U1 = @time test_dicts_tuple(n, data)
    U2 = @time test_dicts_view(n, data)
    @show length(U1), length(U2)
end

let
    test_dicts(1000, 100)
    for n in [1000, 10000, 100000, 1000000, 10000000]
        test_dicts(n, 1000)
        test_dicts(n, 5000)
        test_dicts(n, 10000)
    end
end

I won’t post the entire output here, but the very last test case produces the following on my machine:

  0.858638 seconds (7 allocations: 272.008 MiB, 0.84% gc time)
  1.197626 seconds (172.67 k allocations: 343.033 MiB, 4.10% gc time)
(length(U1), length(U2)) = (9999948, 9999948)

The pattern in allocation numbers is consistent. Using tuples, only 7 allocations are performed, while using my CustomView requires hundreds of thousands. I cannot reconcile this with my mental picture of how the code above should work, so I would appreciate some help here.

This doesn’t address your specific question, but maybe the non-allocating types in SmallCollections.jl can help you, for example SmallVector or PackedVector, or even SmallBitSet if it’s OK to convert the vectors to sets.

Also, there may be alternatives to a dictionary: fasthash, say for SmallVector{8,Int8}, is faster than a dictionary look-up, so one might use the hash as the identifier of a vector.

One could also use the bit representation of the FixedVector underlying the SmallVector:

julia> f(v::AbstractSmallVector) = bits(v.b);

julia> v = SmallVector{8,Int8}(1:3)
3-element SmallVector{8, Int8}:
 1
 2
 3

julia> f(v)
0x0000000000030201

This is a bit of a hack, but it’s extremely fast, and it works as long as all vector elements are non-zero.

Aside: It’s not clear to me how this data type is different from a regular view (except it has fewer features).

Your isequal is just making a lot of unnecessary copies in

for (v1, v2) in zip(x.data[x.rb:x.re], y.data[y.rb:y.re])

If you replace that by

for (v1, v2) in zip(@view(x.data[x.rb:x.re]), @view(y.data[y.rb:y.re]))

(or iterate manually over x.rb:x.re) , I get

  0.938994 seconds (7 allocations: 272.000 MiB, 3.26% gc time)
  1.231973 seconds (7 allocations: 336.000 MiB, 11.27% gc time)
(length(U1), length(U2)) = (9999954, 9999954)

as the last test case’s output.

Why not just store the end indices of the view as dict keys?

When I was working on this piece of code initially (which was a few months ago) the built-in SubArray was the first thing I tried. I had some issues with it back then (ensuring type stability, IIRC) and since my use-case was simple enough I just added this small type. I should probably go back and see what went wrong there though.

This is correct, and it seems that I have an older version of the same code that does precisely this.

I know this goes beyond my original question, but I’m still bothered by the speed difference between the two functions, and the fact that all those little allocations don’t seem to have much of an effect. Do you have any idea on why that might be?

How would I identify the small arrays by their contents doing this?

Why do you need to identify them by their contents?

Well, that’s just the problem I’m trying to solve. As in the example in my original post, the two [1,2] (sub-)arrays should both have the same ID.

One probably relevant difference between Dict{NTuple{3, UInt32}, UInt32} and Dict{CustomView, UInt32} is that in the former case the length of each key is known to be precisely 3, so that you don’t need loops in hash or isequal.

I’m also not sure. I suppose the compiler could reason that x.data[x.rb:x.re] will only be locally used, so maybe it can be placed on the stack (though without size information that seems unlikely from my understanding), or perhaps we can reuse memory. But I would then expect the number of allocations reported by @time to be also lower.