Mmap Symmetric Matricies

I have a matrix that is too large to fit into memory, so I’ve decided to use Mmap to store it on disk which works perfectly. However, the matrix I’m storing is symmetric and it seems like a waste to use twice the amount of storage for the duplicated values.

Is there a way to use Mmap with an UpperTriangular or sparse matrix? I’ve tried:

using Mmap, LinearAlgebra

#Make up some data
n = 1000
X = rand(n,n)

io = open("myMatrix.bin", "w+");
Mmap.mmap(io, UpperTriangular{Float64, Matrix{Float64}}, (n,n))

but it leads to a method error.

Usually I wouldn’t mind the duplicated storage but the matricies I’m working with can be extremely large (e.g. 350,000 × 350,000 which takes up ~900 Gb of space).

Is there a way to reduce disk storage while still maintaining Mmap’s functionality?

Can’t say I’ve done it … but look into storing your triangular matrix with HDF5, then use mmap.

1 Like

Oh would HDF5 take up less space, compared to a bin file?

Another thought I had was saving the triangular matrix to disk as a vector and reshaping it somehow after reading in the file.

I can also store everything as Float32 to use half the storage :rofl:

1 Like

Just as a follow-up, I came up with this solution:

  • Store the upper triangular matrix as a Float32 vector onto disk
  • Create a subtype of AbstractMatrix to properly index the vector
#How many data points are being stored?
n = 5050

#Open an Mmap connection
io = open("X.bin", "w+");
X = Mmap.mmap(io, Vector{Float32}, n)

#Fill with some random data
X .= rand(Float32,n)

#Sync and close the connection
Mmap.sync!(X)
close(io)

Next I read the data back in (just to be thorough)

io = open("X.bin", "r+");
Xread = Mmap.mmap(io, Vector{Float32}, n);

Finally I create an array to properly index with (i.e. the hard part):

#A symmetric upper triangular matrix
mutable struct SymUpperTri{T} <: AbstractMatrix{T}
    data::AbstractVector{T}
    dim::Int

    function SymUpperTri(data::AbstractVector{T}) where T
        #Check if data is the right size to fit into upper triangular matrix
        n = length(data)
        sq = sqrt(8*n+1) #if this is a perfect square, then n is a triangular number
        floor(sq) == ceil(sq) || error("Data is not the correct length to fit into an upper triangular matrix")

        dim = (sq - 1) ÷ 2 # solving n = dim*(dim+1)/2 with quadratic formula
        return new{T}(data, dim)
    end
end

#Julia needs to know how to size and index
Base.size(A::SymUpperTri) = (A.dim, A.dim)

function Base.getindex(A::SymUpperTri{T}, i::Int, j::Int) where T
    return i > j ? A.data[i*(i-1)÷2 + j] : A.data[j*(j-1)÷2 + i]
end

# Found this fancy piece of code from the LinearAlgebra standard library
function Base.replace_in_print_matrix(A::SymUpperTri, i::Integer, j::Integer, s::AbstractString)
    return i ≤ j ? s : Base.replace_with_centered_mark(s)
end

#Check if it works (it does!)
A = SymUpperTri(Xread)   # 100×100 SymUpperTri{Float32}:

The combination of using Float32 and storing as a vector means the data (350,000 × 350,000) takes up about 225 Gb instead of 900 Gb. Hopefully this helps someone in the future ~

1 Like