Sorted Array implementation?

Is there any implementation of sorted array?
I need to store a sorted list of integer. I will frequently check what is the k-th value in the list and in what position a particular value is in the list.
I checked SortedSet in DataStructures.jl, but it seems that it does not have easy way to do these tasks.

There is sort!, which can of course be used to sort any array vector. Precisely what are the shortcomings you need to be addressed? Do you need to mutate the array over time, while expecting it remains sorted? Do you have big-O efficiency requirements for certain operations (insertion, deletion, lookup, etc.)?

DataStructures.jl has SortedSet if you don’t mind not having duplicates:

julia> using DataStructures

julia> s = SortedSet{Int}()
SortedSet(Int64[],
Base.Order.ForwardOrdering())

julia> push!(s,10)
SortedSet([10],
Base.Order.ForwardOrdering())

julia> push!(s,2)
SortedSet([2, 10],
Base.Order.ForwardOrdering())

julia> push!(s,4)
SortedSet([2, 4, 10],
Base.Order.ForwardOrdering())

julia> push!(s,9)
SortedSet([2, 4, 9, 10],
Base.Order.ForwardOrdering())

julia> collect(s)
4-element Array{Int64,1}:
  2
  4
  9
 10

Yes, I need to mutate the array overtime. That’s why sort! is not so useful.
I need to frequently run a[k] to find the value of k-th smallest integer and find(a, v) to find the rank of a particular value in the array. Therefore I want to have an efficiency in insertion and lookup.

I tried SortedSet, but it doesn’t seem to have an easy way to run a[k] (to find the value of k-th smallest integer) and find(a, v) (to find the rank of a particular value in the array)
Running find(a, v) results in the following:

julia> find(s, 4)
DataStructures.Tokens.IntSemiToken(4)

I expected to get the rank of it in the array (2).

Running collect each time before find doesn’t seem to be a good idea.

Yup. It’s a set, so it’s not indexable.

This sound like you need fast random access and lookup, but not insertion.

If so and you have relatively little insertions, ordinary arrays should work fine: just insert a new value into appropriate position and use a[k] and find(a, v) as is.

If insertion is relatively frequent, you may use an array of arrays (i.e. Vector{Vector{T}}) to mitigate insertion cost by inserting into smaller array. I don’t think there’s something ready-to-use, but it should be pretty easy to implement.

Alternatively, DataStructures.jl has an internal implementation of balanced trees. Given that the Sorted* datastructures use it, it should be pretty well tested, and could serve as the basis for an implementation for SortedArray.

Which datastructure is best depends a lot on the relative frequency of insertions and lookups. Also, depending on problem sizes, constant factors may dominate asymptotic performance considerations. I would just start @dfdx’s suggestion and go from there.

Moreover, if you can put a bound on k ex ante, you can just use an array big enough for that.

Thanks for the suggestions.
The frequency of random access and lookup are roughly 2x-3x the frequency of insertion. The size of problem is roughly about a thousand item, so roughly 1000x insertion, 2000-3000x random access and lookup.

What is the complexity of insert!(s, i, val) function in Julia’s standard array? I could not find the information in the docs.

insert!(s, i, val) on a Vector is an O(n) operation because all elements are stored contiguously in memory, and all elements with index i or higher must be displaced to make room for the inserted element.

Also, instead of find (which performs linear search, taking O(n) time), you may wish to use Base.Sort.searchsorted (which performs binary search, taking O(log(n)) time).

I have an open pull request at DataStructures.jl for a SortedVector type. You might want to try it out.

Slight correction: Julia arrays grow at both the beginning and the end, so (after the first growth at the beginning), insert!() will choose the smaller amount of items to move.

Cheers!
Kevin

3 Likes

If you want to invest time and effort into this project, you can augment the underlying balanced tree data structure that is used by SortedSet so that each internal tree node contains one additional field, namely, the number of leaves that descend from that node. The insertion and deletion operations on the balanced tree follow a path from a leaf to the root, and the new field can be modified appropriately on the internal tree nodes encountered during the traversal of this path. With this additional field, the kth largest element can be found in O(log n) operations, and the rank of an arbitrary element can also be found in O(log n) operations, where n is the total number of entries in the set. You would need to have a detailed understanding of the operations on a 2-3 tree and how the code implements them.

7 Likes

Wow, what a pleasant surprise; I did not realize that a Vector is really an array deque! In the past, I’ve always stored a vector in reverse if I planned to grow it (only) from the beginning, but now I know this to be unnecessary. I think this is something to emphasize in the documentation, somewhere.

2 Likes

I actually did something like this as a sort-of “learn Julia” project. I haven’t looked at it in a few years, but the github repo is here. From memory, it has an immutable for SortedVector and SortedUniqueVector, and some functions from Base, like the deques, intersect, union, e.t.c. have methods for these types that exploit the sort order in the algorithm used.

Fair warning though: I was learning Julia at the time, so the code contains a lot of unnecessary functions. For example, I extended Base.getindex for lots of different input types, all of which could probably be reduced to one line.

Maybe a sorted array isn’t best (trees with array-like properties may help):

Streaming Cache-Oblivious B-Trees
http://publications.csail.mit.edu/abstracts/abstracts07/jfineman/jfineman.html

The shuttle tree, our main result, retains the same asymptotic search cost of the cache-oblivious B-tree while improving the insert cost.
[…]
We give another data structure that we call a lookahead array. The lookahead array is reminiscent of static-to-dynamic transformations [7] and fractional cascading [11]. […] then the lookahead array is cache-oblivious and matches the performance of the BRT. We call this version the cache-oblivious lookahead array (COLA).
[…]
We next show how efficiently the COLA performs. We implemented a COLA and compared it with a B-tree. For databases in external memory, the COLA was 90 times faster than the B-tree for random inserts, 2.5 times slower for sorted inserts, and 1.7 times slower for searches.

See also (where I found the above): https://www.quora.com/What-are-some-data-structures-that-can-maintain-a-sorted-list-allow-insertion-and-deletion-in-sub-linear-time-and-allow-iteration-through-the-list-as-quickly-as-possible

I’m not sure if the above is patented, but this one is, from memory… and seems similar):

I’m not sure if a Kinetic sorted list - Wikipedia helps you (directly) and this paper:

http://www.sciencedirect.com/science/article/pii/S092577210600068X?via%3Dihub

1 Like

Wow, what a pleasant surprise; I did not realize that a Vector is really an array deque!

Not quite! Adding elements to the left is still quite slow, as by default there is no space left at the beginning. After you removed some elements from the start it becomes fast, though.

As far as I know there is no “direct” (non pointer-manipulating) way of allocating space at the start of the Vector, i.e. of setting the offset in the corresponding C struct.

A way that kinda works is to allocate your array, and then deteteat! a part of the beginning, and then deleteat! the rest. Now you have an adequately sized array, which has quite a bit of room to grow at the start. However, the room at the start is hard limited so ~half of the array capacity (need to look up the array.c); if you exceed this limit, julia will memmove your data to the left and steal your offset.

This looks ugly but should actually have no overhead if you put bitstype content into your array. If you want to fill your array with object references, then you might pay a spurious memset, because julia needs to initialize all the object references to C_NULL (“undef” in julia parlance).

I am not entirely sure whether the offset will be preserved if the array needs to be moved because you grow beyond the capacity; one would need to look this up in array.c.

1 Like

Adding elements to the left will also reserve space.

2 Likes

Thank you for all comments and suggestions. It’s really helpful.
I just found that I can modify my algorithm so that it does not require SortedArray.
It has been a great discussion though.

I find myself in the same situation; I need to operate in-place on very sparse vectors, and cannot use hashmaps because I need to find pivots.

Julia sparse vectors are useless for this task as inserting an element into an N-element vector costs O(N) instead of O(log N).

A pretty fast library and algorithm is judy (Judy array - Wikipedia), available on almost all linux distros. Judy is radix-based, so you cannot use your own comparison function.

Unfortunately there is no current public julia wrapper for the absolutely atrocious C api of judylib. There is a currently defunct JudyDict package. Starting from there, I wrote my own paper-thin low-quality segfault-on-mistake wrapper (you want to start from JudyDict because “man judy” is kinda ambiguous about the API and the judy source code is unreadable).

Most Judy implementations don’t directly allow arbitrary length keys (they take UInt64 or C strings, but the latter get Null-terminated). This is rather unfortunate, since I would be very very happy if I could plug in longer keys; however, I am loathe to touch any judy code with a 10 foot pole.

When considering speed comparisons between Judy and hash-tables, you will see that judy is almost competitive, but slightly slower. On the other hand, for our applications (range queries! Nth element by sort order! Privot!) hash tables just don’t cut the cake, and judy will beat every red-black-tree, as long as your array/tree fits into main memory.

In your case, you want a sorted dict with Key::Int64, Value::Bool. There is Judy1 available for this usecase. You will see that judy automatically compresses common prefixes from your keys; e.g. your Int64s which might mostly share 4 zero bytes at the start are almost as fast and memory efficient as proper Int32s (which is nice because my distro’s judylib does not support Int32 keys). Common postfixes are not compressed, so don’t left-shift your ints.

Unfortunately, judyset does not support deletion of ranges, even though it “should” be simple to implement. Alas, nothing about judy is “simple”; hence I will refrain from trying to implement range-delete until I absolutely need it.

1 Like