How to read large matrices in Julia?

I’m trying to read text files defining matrices of the form

q = Int[1 0 0 1; 0 1 1 1; 1 0 1 0; 0 0 1 1]

via julia matrix.jl or julia -L matrix.jl. It breaks down for matrices slightly larger than 1000x1000. For example, for 1020x1020 I get

/tmp$ julia -L matrix-1020.jl
Internal error: stack overflow in type inference of typed_hvcat(Type{Int64}, NTuple{1020, Int64}, Int64, Int64...).
This might be caused by recursion over very long tuples or argument lists.
Internal error: stack overflow in type inference of hvcat_fill!(Array{Int64, 2}, NTuple{1040400, Int64}).
This might be caused by recursion over very long tuples or argument lists.

How can one read such not really large matrices into Julia (without writing a function myself that parses matrices)?

For completeness, here is one way to create such matrices:

n = 1020
q = rand((0, 1), n, n)
f = open("matrix-$n.jl", "w")
print(f, "q = Int")
println(f, q)
close(f)

I did noticed this thread. It looks related, but I there doesn’t seem to be a solution.

you shouldn’t be storing matrix as julia source code in text file…

If you only works with Julia, you can use:

Text files have their advantages: they are easy to create, easy to modify by hand and usually allow to move data around between different programs easily.

then at least use something like GitHub - JuliaData/DelimitedFiles.jl: A package for reading and writing files with delimited values (Originally a Julia stdlib) so you’re not loading Julia source code

How exactly do you intend go about modifying a 1000x1000 matrix by hand? :wink:

I use GitHub - PetrKryslUCSD/DataDrop.jl: Numbers and matrices and strings stored to disk and retrieved again., and I wouldn’t go back to ASCII files.

Right, there are many, many options for storing data sets, but source code is almost never the best way. It’s much better to use a standard format, because that will be readable by many tools (in many different languages), will often have highly optimized readers/writers, and will avoid the security vulnerability of running a program from a file just to read data, and other benefits.

Some popular options are:

  • Human-readable text formats: CSV files (use the DelimitedFiles stdlib, the highly optimized CSV.jl package, or many other tools), JSON files (e.g. JSON.jl or JSON2.jl or JSON3.jl), and others …

  • Binary files (which are more compact, can be faster to read/write, and can store metadata in addition to the raw numbers): HDF5 (use HDF5.jl or JLD.jl or JLD2.jl), NetCDF (NetCDF.jl), BSON (BSON.jl), FlatBuffers (FlatBuffers.jl), Arrow (Arrow.jl), and many others.

There are many, many formats to choose from, but the above are some of the most popular in Julia.

2 Likes

Some program (from outside the Julia world) might output matrices as

1 2 3
4 5 6

With a text editor or sed, I can easily convert it, no matter how large the matrix is. Imagine that I do some computation with some other program. Then I want to check a result in Julia, maybe because I suspect that there is a bug somewhere. This may happen only once, so I don’t want to write code.

As an analogy, take the Unix command line tools. One reason that they are so powerful is that one can easily feed the output of any one program into another program. (The inventors loved compositionality, like Julians.) And that works because it’s all based on text files.

I agree. But sometimes the format is dictated by some other program. Or I just want to have a quick-and-dirty solution, as explained above

If you insist on storing it as a large text file, perhaps the best way is to “unstack” or “melt” the matrix first. That means storing each element in a single line, with its row and column. That makes for a more traditional text file, or even CSV file which is somewhat more portable. Note that the file will have a 1_000_000 lines for a 1_000 by 1_000 matrix.

Here is some code to convert a matrix to this form:

julia> A = rand(-100:100,4,4)
4×4 Matrix{Int64}:
 -100   96   82    2
   83  -72   38  -18
   -6   79   72   25
   67   64  -70   62

julia> println("row, col, val\n"*join(["$(I[1]), $(I[2]), $v" for (I,v) in pairs(A)],"\n"))
row, col, val
1, 1, -100
2, 1, 83
3, 1, -6
4, 1, 67
1, 2, 96
:
1, 4, 2
2, 4, -18
3, 4, 25
4, 4, 62

Reading can be done with the help of CSV.jl package.

ADDED: To read this file:

using CSV, Tables

s = "row, col, val\n"*join(["$(I[1]), $(I[2]), $v" for (I,v) in pairs(A)],"\n")
iob = IOBuffer(s)

q = CSV.read(iob, Tables.matrix)
M = zeros(eltype(q), maximum(@view q[:, 1]), maximum(@view q[:,2]))
foreach(eachrow(q)) do row
    M[row[1],row[2]] = row[3]
end

# and possibly faster:
#
# M = reshape(q[:,3], (maximum(@view q[:,1]), maximum(@view q[:,2])))

and now M has the matrix back (note the s and iob are just to make testing this post easier.

That’s whitespace-delimited text — you can read that directly with DelimitedFiles or CSV.jl (with delim=' ', header=false, optionally with ignorerepeated=true if you might have multiple spaces between columns). For example:

julia> using DelimitedFiles

julia> data = """
       1 2 3
       4 5 6
       """
"1 2 3\n4 5 6\n"

julia> readdlm(IOBuffer(data), ' ')
2×3 Matrix{Float64}:
 1.0  2.0  3.0
 4.0  5.0  6.0

or

julia> using CSV

julia> CSV.File(IOBuffer(data), delim=' ', header=false) |> CSV.Tables.matrix
2×3 Matrix{Int64}:
 1  2  3
 4  5  6

If you can easily convert it to Julia source code, then you could also easily convert it to space - or comma-delimited text (if it isn’t in such a format already, as in the above example).

Such data “wrangling” is indeed often required for data generated from external sources that you don’t control! But wrangle it into a standard data format, not into source code. Code generation / metaprogramming is rarely the software-engineering approach of choice.

Note also that there is no need to use sed or a text editor. Julia has a powerful regular-expression transformation library built-in.

That’s COO or “Coordinate” format. This format makes sense for sparse matrices, but for a dense matrix it will be a lot more compact to store it in CSV or similar with one row per line, as well as being faster and easier to read and write, and CSV is also understood by more software tools.

Also, CSV’s “cousin” of space-delimited text directly corresponds to the format used by literal matrices in Julia source code, so it’s essentially the format that @matthias314 wants to use.