Is there a readymade Array Type that stores data as Run-length-encoding of delta?

xiaodai · June 30, 2021, 3:12am

I wonder in Julia, is there a ready-made array type that stores data internally as run-length-encoded deltas?

E.g.

This is literally a question from a colleague working in Python. I am not sharing any secrets as it’s just a general question about Python. And you can see that you can store data much more efficiently if you just delta it and then RLE it.

I know there is RLE vectors. But I think this is quite a common problem in TimeSeries so I am wondering if there’s a ready-made vector type so I don’t have to make it myself

General question:
Assume I have some very large vector, but the actual count of unique values is very low.
e.g.: [1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3] Obviously, one can greatly reduce the memory required to represent this vector given the unique values and the fact that they come in consecutive blocks.So my question is: Is there some mechanism in numpy or scipy or other thing that supports these array-like structures but with efficient storage in memory?"

gustaphe · June 30, 2021, 4:14am

Something like sparse([first(x), diff(x)...])?

xiaodai · June 30, 2021, 5:41am

x = sparse([first(x), diff(x)...])

But now

x[2] will give 0 instead of 1

The point is that the the sparse one is just the internal rep that user doesn’t see.
It’s internally stored very efficiently but what the user sees is just [1,1,....,2,.....3,...] so no different to a normal array.

Skoffer · June 30, 2021, 6:03am

It would be interesting to see applications which uses such a representation, because while memory efficient, getindex and setindex will be very slow, compared to usual array manipulations. Only in a very special cases this representation can provide some benefits, e.g. when you are not mutating vector and only iterate over it.

gustaphe · June 30, 2021, 6:03am

Yeah, you would need

struct DiffVector <: AbstractVector
x::SparseVector
DiffVector(x) = new(sparse([first(x), diff(x)...]))
end

getindex(x::AbstractVector, i) = sum(x.x[1:i])

or something. Sorry, the real answer is I don’t know of any ready made implementations, this is how I would start implementing it.

jishnub · June 30, 2021, 8:18am

Perhaps RLEVectors.jl might be relevant, although I wonder if this is something that you have already considered?

julia> v = RLEVector([1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3])
RLEVector{Int64, Int64}
 Run values: [1, 2, 3]
 Run ends: [30, 175, 208]

julia> v[30]
1

julia> v[31]
2

I’m not familiar with the package though

xiaodai · June 30, 2021, 11:03pm

I had mentioned that it could help. I guess it doesn’t exists yet such a package.

xiaodai · June 30, 2021, 11:54pm

Built a MVP

using RLEVectors

mutable struct DiffRLEVector{T} <: AbstractVector{T}
    _orig_val::T
    _arr::RLEVector{T}
    DiffRLEVector(v::AbstractVector{T}) where T = new{T}(v[1], RLEVector(diff(v)))
end

import Base

function Base.getindex(v::DiffRLEVector, i::Integer)
    i == 1 ? v._orig_val : v._orig_val + sum(v._arr[1:i-1])
end

function Base.size(v::DiffRLEVector)
    (1 + size(v._arr)[1], )
end

v = [1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3]

dv = DiffRLEVector(v)


Base.summarysize(v)
Base.summarysize(dv) # much smaller

v == dv #true

BTW this is surprisingly hard to do in python. Probably because no one has one a rle vector type and extending numpy’s array seems very convoluted process.

anon56330260 · July 1, 2021, 7:19am

I guess OpenVDB might be helpful here, it’s a hierarchical fixed-size array mainly designed for representation of signed distance field, something like quad-tree. It supports for constant-time getindex and setindex!. I currently try to port it to Julia, but still in progress.

anon56330260 · July 1, 2021, 7:34am

Actually, a quick (and bad performance) constant-time implementation will be:

const FixedSize = 16
# A compressed string with length FixedSize, if it consists of only one unique character, then we represent it by a Char to save memory 
struct CompressedString
    val::Union{String,Char}
end

struct RLEString
    s::Dict{Index,CompressedString}
end

function getindex(s::RLEString,i) 
    cs = s.s[i>>4]
    if isa(cs.val,Char)
         return cs.val
    else
         return cs.val[i & (FixedSize-1)]
    end
end

Topic		Replies	Views
N dimensional array with different type General Usage	4	201	July 27, 2023
How to implement fast range reads for my custome-defined array? Performance array	2	726	January 20, 2018
Documentation suggestion - noteworthy differences to other languages page General Usage documentation	0	322	February 21, 2021
TimeSeries Arrays in Julialang General Usage statistics	2	809	May 30, 2020
Deinterleave data Performance arrays	12	954	November 25, 2021

Is there a readymade Array Type that stores data as Run-length-encoding of delta?

Related topics