Looking for a data structure that

I’m looking for a data structure that can insert, delete and mutate (given key) in order log n, and look at the max element in order 1.

This is either a SortedDict or one of the trees in DataStructures.jl. My issue is that those always (?) seem to allocate on insertion, which leads me to believe that they’re implemented as a node with key, value, and pointer to other nodes.

Instead, my use case is that I know that “most of the times” this tree will have of order 10 elements, and I would rather allocate an array of size 20, leave it mostly empty, and allocate more only if/when needed. Googling around I think this is called an “array implementation of a binary tree”. This http://mishadoff.com/blog/dfs-on-binary-tree-array/ looks to have all the information I need to implement it myself, but if someone has already done it, even better!

Does any of this make sense? And if so, does anyone know of such an implementation in Julia?

This smells like the place for a linked list. I think you will get better advice if you can specify a little bit further the actual application.

A linked list will allocate every time you insert. I probably misunderstood your answer…?

That will depend on what you know in advance about your data. For example if you know the total number of elements of the all trees in advance (or a reasonable upper limit for that), you could allocate that. But it is hard to tell without more information about the actual application. For example if you are dealing with a single tree of at most 20 elements, the approach is probably different than if you are dealing with thousands of those.

The application would be that these data structures would hold orderbooks Order book - Wikipedia

In my case I would have of order 10,000 of these, each with order 10 elements.

I just ran an experiment comparing SortedDict with Vector with sort! (for insertion) and deleteat! (for deletion). The later wins but not by much. I think that’s because Vector probably deallocates when it becomes small. So a possible alternative to the “array implementation of a binary tree” would be if I could tell Vector to never shrink. Do you know if that’s possible?

1 Like

Is it possible to you to provide a minimal working example? If you can, I am sure that you will get good advice. If not from me from someone else.

Very broadly speaking, my approach would be to allocate vectors of 10*10,000 elements and store all data there, with associated vectors with indices for classification, and would only reallocate as function of the total number of elements. But I am guessing here, that depends a lot on what you want to do with the data. I would also avoid adding and deleting elements, instead replacing them by zeros, or nothing.

So I’ve learned about sizehint!.
As it turns out, DataStructures.SortedDict[1] are just wrappers for DataStructures.BalancedTree23[2] which use Vector as data backend. So I thought, maybe I can just sizehint! those vectors. I defined:

function Base.sizehint!(balancedtree::DataStructures.BalancedTree23, n::Integer)
    sizehint!(balancedtree.data, n)
    sizehint!(balancedtree.tree, n)
end
Base.sizehint!(sorteddict::SortedDict, n::Integer) = sizehint!(sorteddict.bt, n)

Surprisingly, this doesn’t affect the number of allocations…? Here’s some sample code, with and without sizehint! on the SortedDicts:

k = Float64.(1:40);
v = Float64.(1:40);
function test_alloc()
    d = SortedDict()
    for i in 1:length(k)
        d[k[i]] = v[i]
    end
    d
end    
function test_alloc2()
    d = SortedDict()
    sizehint!(d, 100)
    for i in 1:length(k)
        d[k[i]] = v[i]
    end
    d
end    
@time d1 = test_alloc();
@time d2 = test_alloc2();

On my system this shows

0.000112 seconds (809 allocations: 25.297 KiB)
0.000100 seconds (805 allocations: 27.953 KiB)

However, the sizehint! seems to have worked correctly:

ccall(:jl_array_size, Int, (Any, UInt), d1.bt.data, 1) # 64
ccall(:jl_array_size, Int, (Any, UInt), d2.bt.data, 1) # 100
julia> versioninfo()
Julia Version 1.5.3
Commit 788b2c77c1 (2020-11-09 13:37 UTC)
Platform Info:
  OS: Linux (x86_64-pc-linux-gnu)
  CPU: Intel(R) Core(TM) i5-4570 CPU @ 3.20GHz
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-9.0.1 (ORCJIT, haswell)
Environment:
  JULIA_EDITOR = atom  -a
  JULIA_NUM_THREADS = 4

Does anyone have any idea why we see allocations in this case?

1 Like

I will shamelessly offer my package TrackingHeaps as an alternative. However, I am forced to point out it is very similar to the DataStructures.MutableBinaryHeap. See this discussion for the differences. Both structures use vectors inside them and allow support sizehint! (however, they just delegate the sizehint! to the internal Vectors and I think Base makes no guarantees about how it works).

I think @lmiq suggestion about linked lists is a little off. Your description of what you want is basically the textbook definition of a heap/‘priority queue’ implemented with a vector/array plus a tracking system to be able to make changes anywhere in the tree. However, the fact that you have thousands of very small heaps may open the possibility of alternative solutions. Such small heaps will live in entirely in the register (or in the L1 cache) if implemented with vectors, so this is a must. Depending on your specifics (if you have memory for allocating the worst-case size of each heap and is losing much time with allocation and loss of locality), then maybe allocating a huge vector (as @lmiq suggested) and treating each 20 elements range as a different heap (using nothing for empty spaces or keeping their sizes in another vector) may be a good solution, but it is very low level and will be hard to work with.

4 Likes

I’m not sure that’s the case… The reason why I originally rejected heaps is that (correct me if I’m wrong) I can’t access them on a key in log time. I can access them on a “handle” which from what I understand is an Int that the heap returns to you after you insert it, not something you choose, right? That’s Int → value. I need Float → value.

Yeah, I also imagined that something along the lines of “no guarantees” was to blame. That’s why I was extra surprised when afterwards I called :jl_array_size and it returns the right thing…

First, the handles of the TrackingHeaps give O(1) access anywhere in the tree (not O(log n)), I believe the ones from DataStructures probably have the same property.

Why you would need the handles to be Floats? This is a very strange requirement, Floats rarely are used for indexing (in any structures) as (i) either you will save the handle somewhere and its specific type will be irrelevant (at most its size may be relevant) or (ii) your handle is something you will not save but want to compute again, however, Floats are notorious for giving slightly different values caused by small changes so are not great indexes.

Are you confusing the priority key and the access handle? TrackingHeaps will take into account the values stored to determine the order of the elements, if you want to store a non-Float value ordered by a Float weight, then you can just insert Tuples of (Float, non-Float) so the Float is used for the ordering (and if tied then the non-Float is used to break the tie).

You may be talking about the need to associate your Float value with the Int key (this will always be necessary, but sometimes a structure, like the SortedDict will abstract this to you, in the end, the handle needs to become a position in the memory and, therefore, an Integer). This depends on how your code is structured, but if your kept a vector of Floats lying around to be used as keys, then you should have a vector of Ints with the same size and store the corresponding tracker there.

I don’t need the handles to be Floats, I need the (priority) keys to be Floats, but I can’t access the keys in log time - Is this correct?

Ah… you’re saying keep a vector of Floats with positions like the handles. Then given a float, binary search that vector (log time) and then use the handle to access the heap in O(1) time. Is that right? I think that might work.

However, this is all depending on sizehint! doing the thing we expect, which from my experiment above seems to be failing, in which case we will still have allocations on vector resizing.

More or less. If you are gonna never change the number of elements in a heap, then the handles for X insertions will be the numbers between 1 and X and you can just have the values outside the heap in a Vector and use the index of this external Vector when you already know it (i.e., if traversing the Vector linearly) or search for the position/index of a specific (using binary search as you pointed out) when you do not know it. TrackingHeaps allows you to reuse handles, so you can change the values but keep the handle valid. However, if you add and delete values then you need to have an extra handles Int Vector which you keep synchronized with your keys Float64 Vector, then you do the same as before, but instead of using the position directly as a handle, you use the position to index handles and the result to index the heap.

However, considering the very small size of your heaps, and how computer architectures work (i.e., cache and memory locality), I sincerely think the best solution may be just having a single Float Vector with your keys in heap order and then search it linearly when necessary (i.e., when you do not need the top value). 10~20 elements will be searched linearly faster than doing a binary search in a reverse index (which will double the memory used, and add an extra layer of indirection). The big-oh complexities are only relevant for “large enough” input. For very small sequences, for example, O(n^2) sort methods like quicksort (or even insertion sort) will beat O(n log n) sort method like merge sort.

1 Like

If I may digress a little, I am surprised by not seen any C function in your package. I have no idea how to control memory usage at that level using only Julia. Where there is a guarantee that the vectors inside your heap structs will be effectively allocated in the heap? From my shallow understanding of this immutable structures will in general be in the heap, but at the end the compiler will decide that.

Sorry, are you speaking with me or with freeman? About mine or his package?

@Henrique_Becker yours, sorry

Hmmm, it seems to me that there is some misunderstanding here, maybe because heap is both a data structure and a memory concept. I never intended to say that the TrackingHeap objects have any guarantee of being allocated in the heap, however, they just wrap three Vectors; Vectors are often allocated in the heap, and objects with references to Vectors (or any heap-allocated objects) often behave the same way (i.e., also live in the heap memory). I believe a single TrackingHeap used in a limited scope may also be elided by the compiler (instead the code will just work over the three vectors directly), but a collection of TrackingHeaps, especially if it is passed around or stored inside another structure, will probably be allocated in the heap (the memory region). Maybe I have given the wrong idea somehow? I am sorry for that. Can you point to me what I said which seems to suggest memory-guarantees of my package?

Sorry, I inferred that from the name and the performance gains you discuss in the other thread. In that case isn’t your package an efficient implementation of one type of linked lists? (I am not trying to reduce it’s relevance, please, I am trying to learn. Keep in mind that my background in computer science is not great, I may even be abusing of nomenclature in the wrong way).

@Henrique_Becker don’t mind answering. Actually from the perspective that it is not memory management, I can figure out the idea behind your package. Thank you for sharing.

Implementing vector with linear scan plus sizehint!, works just as expected.
On my own dataset of 100k inserts, updates, and deletes, the straightforward SortedDict does

0.025114 seconds (165.46 k allocations: 2.529 MiB)

whereas the linear scan on sizehint!ed Vector does

0.008066 seconds (9 allocations: 3.547 KiB)

Note that despite the dataset being large-ish, with 100k inserts, updates, and deletes, at no point the size of the datastruct is bigger than our guesstimate; that’s why this idea works.

So, having some idea of what your data looks like allows you to improve performance - nothing new here :slight_smile:

But in fact, even without sizehint!, a linear scan on default Vectors does really well. This is largely (I’m guessing) because Julia Vectors don’t shrink aggressively. For example, if you remove sizehint! the end-state of the above the datastructure has 5 elements, but calling :jl_array_size you see that the size is 16, not 8 as you’d expect if the shrinking was as aggressive as the growing (i.e. a factor of 2x always).

Anyway, thanks for the helpful discussions.

2 Likes

It is Christmas and I have no obligations so it is not a problem for me to answer.

Basically, a Heap is a kind of tree that is very good to keep track of the smallest (or largest) element stored in it, but it is not very good to keep track of the rest of the elements. Trees can be represented in many ways, the one that freeman wanted to avoid is:

basically something like

struct TreeNode{T}
    data :: T
    right_child :: Union{TreeNode{T}, Nothing}
    left_child :: Union{TreeNode{T}, Nothing}
end

And maybe this is what you understand as a “linked list”, but I would not use this term, as it is often used to refer to another (simpler) data structure.

However, “linking” may not be necessary, many kinds of tree, including Heap trees are “balanced”. Which exactly this means may vary in context, but for heaps it means that if you remove a node anywhere in the tree it self-adjusts to remove a leaf node instead (i.e., a node at the largest depth possible), this regularity allows for a representation with much less overhead. Basically, you can have the whole tree in a Vector and the parent-child relations are hard-coded, the first position is always the root, then the second and third position are its children, the positions 4 to 7 are the children of the children, and so on (in fact, if X is the parent position, the children in a binary heap will be at positions 2*X - 1 and 2*X).

My code basically implements this last representation, but with a twist, it also keeps synchronized two reverse indexes, which allow O(1) access to any element in the heap (not only the root/maximal one). Obviously, this is not my invention, but I think I did a pretty good implementation of the concept.

1 Like