How to read a file and initialize tensors with it's values without having it in memory?

Uranium238 · July 24, 2024, 1:32pm

I have a simple (but possibly enormous) text file which has five columns, the first one has a floating point value and the other four have integer values , it looks like

F p q r s

where basically I want to write a tensor to disk which has M[p,q,r,s] = F, after this all the entries of the tensor that could be filled using the text file have been filled, the others were filled using permutational symmetries. Here is how I am doing it now

    begin
        # Open the file
        file = open(pathtofcidump, "r")

        # Loop over each line in the file
        for line in eachline(file)
            # Check if the line starts with "&FCI NORB="
            if startswith(line, " &FCI NORB=")
                line = split(line, ",")
                words1 = split(line[1], " ")
                words2 = split(line[2], " ")
                norb = parse(Int, words1[end])
                nelec = parse(Int, words2[end])
                break
            end
        end

        # Close the file
        close(file)
    end
    data = readdlm(pathtofcidump, skipstart=linenum)
    hnuc::Float64 = data[end, 1]


    data = copy(data[1:end-1, :])
    h::Array{Float64,2} = fill(0.0, norb, norb)
    g::Array{Float64,4} = fill(0.0, norb, norb, norb, norb)
    l::Int64 = length(data[:, 1])
    non_redundant_indices = []
    for i in 1:l
        if (data[i, 4] == 0 && data[i, 5] == 0)
            I = round(Int, data[i, 2])
            J = round(Int, data[i, 3])
            h[I, J] = data[i, 1]
        else
            I = round(Int, data[i, 2])
            J = round(Int, data[i, 3])
            K = round(Int, data[i, 4])
            L = round(Int, data[i, 5])
            push!(non_redundant_indices, [I, J, K, L])
            g[I, J, K, L] = data[i, 1]
        end
    end
    for (I, J, K, L) in non_redundant_indices
        open("non_redundant_indices.txt", "a") do f
            println(f, I, J, K, L, "   ", g[I, J, K, L])
        end
        g[K, L, I, J] = g[I, J, K, L]
        g[J, I, L, K] = g[I, J, K, L]
        g[L, K, J, I] = g[I, J, K, L]
        g[J, I, K, L] = g[I, J, K, L]
        g[L, K, I, J] = g[I, J, K, L]
        g[I, J, L, K] = g[I, J, K, L]
        g[K, L, J, I] = g[I, J, K, L]

    end
    serialize("g.jlbin",g)

But this has the obvious disadvantage of having the g::Array{Float64,4} in memory all at once. We want to avoid this at all costs because norb can take values upwards of 100. I want to reach to the end of this process (which entails having the entire tensor on the disk) without reading the full thing into memory. Any idea on how to do that ? A sample file that I am trying to read can be found here.

abraemer · July 24, 2024, 2:27pm

I am a bit confused what you are asking. So is the tensor too large to fit into memory even the “unique” elements, i.e. with exploiting the symmetry and storing only the necessary information (like the text file)? Is the goal to work with this tensor (e.g. perform multiplications or contractions or something)?

Or are you just asking how to do the data transformation of flat list of indices-values to matrix as serialized binary? Why would you want to do this transformation in the first place if the resulting file is too big to work with comfortably? Isn’t the reduced representation just superior?

Uranium238 · July 24, 2024, 2:47pm

That is correct. However, for doing the contractions we shall not use the full g tensor that is being defined in the code given in the original question. The goal is to work with smaller blocks of the tensor (which we would be comfortable keeping in memory) defined as follows

    for a in 1:nv, b in 1:nv
        f_vv[a, b] = f[no+a, no+b]

        for i in 1:no, j in 1:no
            g_vvoo[a, b, i, j] = g[no+a, i, no+b, j]
            g_voov[a, i, j, b] = g[no+a, j, i, no+b]
            g_vovo[a, i, b, j] = g[no+a, no+b, i, j]
            g_oovv[i, j, a, b] = g[i, no+a, j, no+b]
            f_oo[i, j] = f[i, j]
        end
    end

Ideally I am looking for a way to define the g_xxxx tensors without having to have the entire g tensor in memory. Does that make the question clearer ? If not please let me know.

ericphanson · July 24, 2024, 11:16pm

This sounds like a job for mmap: Memory-mapped I/O · The Julia Language

mkitti · July 25, 2024, 2:43am

It sounds like you want some kind of sparse array or sparse tensor package. These exist.

That said let’s go through the exercise of implementing a sparse array via a Dict. The reason why I think this may be worthwhile is that your case is perhaps very specialized, so you may need further customizations than any existing package can provide.

julia> struct MySparseTensor <: AbstractArray{Float64,4}
           size::Dims{4}
           data::Dict{Dims{4}, Float64}
           function MySparseTensor(size::Dims{4})
               dict = Dict{Dims{4}, Float64}()
               new(size, dict)
           end
       end

julia> Base.size(t::MySparseTensor) = t.size

julia> function Base.getindex(
           t::MySparseTensor,
           I::Vararg{Int,4}
       )
           checkbounds(t, I...)
           get(t.data, I, 0.0)
       end

julia> function Base.setindex!(
           t::MySparseTensor,
           v,
           I::Vararg{Int,4}
       )
           checkbounds(t, I...)
           t.data[I] = v
       end

julia> function Base.show(
           io::IO,
           ::MIME"text/plain",
           t::MySparseTensor
       )
           print(io, MySparseTensor)
           print(io, " of size ")
           print(io, t.size)
           print(io, " with ")
           print(io, length(t.data))
           print(io, " non-zero element(s).")
       end

Now let’s try to use it.

julia> t = MySparseTensor((10, 100, 1000, 10000))
MySparseTensor of size (10, 100, 1000, 10000) with 0 non-zero element(s).

julia> t[1,1,1,1] = 500
500

julia> t
MySparseTensor of size (10, 100, 1000, 10000) with 1 non-zero element(s).

julia> t[1,1,1,10000] = 500
500

julia> t[1,1,1,:]
10000-element Vector{Float64}:
 500.0
   0.0
   0.0
   0.0
   0.0
   0.0
   0.0
   0.0
   0.0
   0.0
   0.0
   ⋮
   0.0
   0.0
   0.0
   0.0
   0.0
   0.0
   0.0
   0.0
   0.0
   0.0
 500.0

julia> sum(t[1,1,1,:])
1000.0

Uranium238 · July 25, 2024, 9:49am

Going through the preliminary documentation, it seems like mmap is useful when we want do to something like this (quoting their example)

# Create a file for mmapping
# (you could alternatively use mmap to do this step, too)
using Mmap
A = rand(1:20, 5, 30)
s = open("/tmp/mmap.bin", "w+")
# We'll write the dimensions of the array as the first two Ints in the file
write(s, size(A,1))
write(s, size(A,2))
# Now write the data
write(s, A)
close(s)

# Test by reading it back in
s = open("/tmp/mmap.bin")   # default is read-only
m = read(s, Int)
n = read(s, Int)
A2 = mmap(s, Matrix{Int}, (m,n))

now A2 is associated with the stream s instead of being present in the memory. But we still had to have A in memory for us to write the data of A to s in Matrix{Int} form. This would correspond to me having to put the g tensor in memory at least once. Which is disastrous.

ericphanson · July 25, 2024, 9:58am

You can initialize A as A = mmap("storage.bin", Array{Float64, 4}, (2,3,4,5)). This will create the file storage.bin and give you a 4D array back. Then initialize it (A .= 0), and then loop over the lines in your file, updating A as you go, thus updating the file. The OS will be in charge of paging memory in and out of memory and synchronizing with the disk file based on what you access/update in A.