Is there a readymade Array Type that stores data as Run-length-encoding of delta?

I wonder in Julia, is there a ready-made array type that stores data internally as run-length-encoded deltas?


This is literally a question from a colleague working in Python. I am not sharing any secrets as it’s just a general question about Python. And you can see that you can store data much more efficiently if you just delta it and then RLE it.

I know there is RLE vectors. But I think this is quite a common problem in TimeSeries so I am wondering if there’s a ready-made vector type so I don’t have to make it myself :slight_smile:

General question:
Assume I have some very large vector, but the actual count of unique values is very low.
e.g.: [1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3] Obviously, one can greatly reduce the memory required to represent this vector given the unique values and the fact that they come in consecutive blocks.So my question is: Is there some mechanism in numpy or scipy or other thing that supports these array-like structures but with efficient storage in memory?"

Something like sparse([first(x), diff(x)...])?

1 Like

x = sparse([first(x), diff(x)...])

But now

x[2] will give 0 instead of 1

The point is that the the sparse one is just the internal rep that user doesn’t see.
It’s internally stored very efficiently but what the user sees is just [1,1,....,2,.....3,...] so no different to a normal array.

It would be interesting to see applications which uses such a representation, because while memory efficient, getindex and setindex will be very slow, compared to usual array manipulations. Only in a very special cases this representation can provide some benefits, e.g. when you are not mutating vector and only iterate over it.

Yeah, you would need

struct DiffVector <: AbstractVector
DiffVector(x) = new(sparse([first(x), diff(x)...]))

getindex(x::AbstractVector, i) = sum(x.x[1:i])

or something. Sorry, the real answer is I don’t know of any ready made implementations, this is how I would start implementing it.

Perhaps RLEVectors.jl might be relevant, although I wonder if this is something that you have already considered?

julia> v = RLEVector([1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3])
RLEVector{Int64, Int64}
 Run values: [1, 2, 3]
 Run ends: [30, 175, 208]

julia> v[30]

julia> v[31]

I’m not familiar with the package though


I had mentioned that it could help. I guess it doesn’t exists yet such a package.

Built a MVP

using RLEVectors

mutable struct DiffRLEVector{T} <: AbstractVector{T}
    DiffRLEVector(v::AbstractVector{T}) where T = new{T}(v[1], RLEVector(diff(v)))

import Base

function Base.getindex(v::DiffRLEVector, i::Integer)
    i == 1 ? v._orig_val : v._orig_val + sum(v._arr[1:i-1])

function Base.size(v::DiffRLEVector)
    (1 + size(v._arr)[1], )

v = [1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3]

dv = DiffRLEVector(v)

Base.summarysize(dv) # much smaller

v == dv #true

BTW this is surprisingly hard to do in python. Probably because no one has one a rle vector type and extending numpy’s array seems very convoluted process.

1 Like

I guess OpenVDB might be helpful here, it’s a hierarchical fixed-size array mainly designed for representation of signed distance field, something like quad-tree. It supports for constant-time getindex and setindex!. I currently try to port it to Julia, but still in progress.

1 Like

Actually, a quick (and bad performance) constant-time implementation will be:

const FixedSize = 16
# A compressed string with length FixedSize, if it consists of only one unique character, then we represent it by a Char to save memory 
struct CompressedString

struct RLEString

function getindex(s::RLEString,i) 
    cs = s.s[i>>4]
    if isa(cs.val,Char)
         return cs.val
         return cs.val[i & (FixedSize-1)]