Mmap Symmetric Matricies

I have a matrix that is too large to fit into memory, so I’ve decided to use `Mmap` to store it on disk which works perfectly. However, the matrix I’m storing is symmetric and it seems like a waste to use twice the amount of storage for the duplicated values.

Is there a way to use Mmap with an UpperTriangular or sparse matrix? I’ve tried:

``````using Mmap, LinearAlgebra

#Make up some data
n = 1000
X = rand(n,n)

io = open("myMatrix.bin", "w+");
Mmap.mmap(io, UpperTriangular{Float64, Matrix{Float64}}, (n,n))
``````

but it leads to a method error.

Usually I wouldn’t mind the duplicated storage but the matricies I’m working with can be extremely large (e.g. 350,000 × 350,000 which takes up ~900 Gb of space).

Is there a way to reduce disk storage while still maintaining Mmap’s functionality?

Can’t say I’ve done it … but look into storing your triangular matrix with HDF5, then use mmap.

1 Like

Oh would HDF5 take up less space, compared to a bin file?

Another thought I had was saving the triangular matrix to disk as a vector and reshaping it somehow after reading in the file.

I can also store everything as Float32 to use half the storage

1 Like

Just as a follow-up, I came up with this solution:

• Store the upper triangular matrix as a Float32 vector onto disk
• Create a subtype of `AbstractMatrix` to properly index the vector
``````#How many data points are being stored?
n = 5050

#Open an Mmap connection
io = open("X.bin", "w+");
X = Mmap.mmap(io, Vector{Float32}, n)

#Fill with some random data
X .= rand(Float32,n)

#Sync and close the connection
Mmap.sync!(X)
close(io)
``````

Next I read the data back in (just to be thorough)

``````io = open("X.bin", "r+");
``````

Finally I create an array to properly index with (i.e. the hard part):

``````#A symmetric upper triangular matrix
mutable struct SymUpperTri{T} <: AbstractMatrix{T}
data::AbstractVector{T}
dim::Int

function SymUpperTri(data::AbstractVector{T}) where T
#Check if data is the right size to fit into upper triangular matrix
n = length(data)
sq = sqrt(8*n+1) #if this is a perfect square, then n is a triangular number
floor(sq) == ceil(sq) || error("Data is not the correct length to fit into an upper triangular matrix")

dim = (sq - 1) ÷ 2 # solving n = dim*(dim+1)/2 with quadratic formula
return new{T}(data, dim)
end
end

#Julia needs to know how to size and index
Base.size(A::SymUpperTri) = (A.dim, A.dim)

function Base.getindex(A::SymUpperTri{T}, i::Int, j::Int) where T
return i > j ? A.data[i*(i-1)÷2 + j] : A.data[j*(j-1)÷2 + i]
end

# Found this fancy piece of code from the LinearAlgebra standard library
function Base.replace_in_print_matrix(A::SymUpperTri, i::Integer, j::Integer, s::AbstractString)
return i ≤ j ? s : Base.replace_with_centered_mark(s)
end

#Check if it works (it does!)
A = SymUpperTri(Xread)   # 100×100 SymUpperTri{Float32}:
``````

The combination of using Float32 and storing as a vector means the data (350,000 × 350,000) takes up about 225 Gb instead of 900 Gb. Hopefully this helps someone in the future ~