Huge matrix declared as const in a package leading to StackOverflowError

Hi guys!

I have some big matrices that are constants to compute Earth precession and nutation. They can be seen here.

The problem is when I wrote the matrices without the Float64, the builds on Linux and Windows crashed with StackOverflowError. This error disappeared when I add the Float64.

Question: Is it a bug? Is it a bad way to have such big matrix with constant values inside a package? Any suggestions to improve it?

I think it’s fine. Check out this file from Sobol.jl as an extreme example.

Also, the values of the matrix can still be changed (although presumably no code will actually do that). Accessing the elements of the matrix from inside functions will simply be type stable, thanks to const.

1 Like

My guess is that when you don’t specify the element type, there will be something like a promote_type(x...) with splatting of all that elements of the matrix. And since the length is known at compile time it will try to do all the allocations necessary for the splatting on the stack, which has limited size, leading to the stackoverflow error.

1 Like

I don’t know how that matrix is used, but it seems to me that that matrix is actually 4 entities awkwardly glued together.
It seems the first column should just be a constant UnitRange, 1:1306. Or, even better: Base.OneTo(1306).
Then the second and third columns should be two separate vectors.

Then the last 14 columns should be a sparse matrix.

But this is just a wild guess for what this constant matrix is used for. I might be completely wrong.

Thanks for the tip!

In fact this matrix are coefficients obtained from here:

ftp://maia.usno.navy.mil/conv2010/2010/2010_update/chapter5/additional_info/tab5.2a.txt

Indeed, the first column is not used and the last 14 columns are integers. I chose to just copy and paste from that original file because when a new version of the model is available, then it will be easier to update.

However, your suggestion makes sense. I will test to see if I can gain speed.

I’d say that many would consider it best practice to separate code and data. Although I don’t think it’s a huge deal in your example, I would put it in its own file/files; using e.g. JSON (or BSON), CSV, XML, Protobuf. Some advantages of using separate files:

  • With data separated, you and others can access it more easily from other languages and tools. You yourself had to manually copy and paste the different parts of this data from an unstructured source, right? Wouldn’t it have been convenient if all you had to do was download a JSON file?
  • Similarly, if you want to automate downloading fresh data, it’s easier to keep it separate than having to generate source code intermixed with data.
  • Storing data in code might make your repository bigger, which makes some operations slower (although if it’s just 300 KB, it’s not that bad). With data separated, there are a number of options: store it compressed in the repository, store it in a separate repository, make your script download it and cache it.
  • Looking at git history, you can easily separate changes in data vs changes in source code.
  • With thousands of lines of data in source files, measures like code coverage could be messed up.
  • A source file with several structures with thousands of lines of data can be hard to read and work with. For example, changing all array types to be Float64 was probably unnecessarily complicated. (Although that might depend on how used you are to working with it; with an editor like Atom for example you can fold large matrices for better overview.)
5 Likes

Nice ideia!

This data will not change in some years. Hence, to make life easier, I prefer to ship the input file together with the repo. My question is: how can I load all the data, put on a matrix during the initializations of the package without leading to type instability?

How about using a more structured approach, e.g.:

using JSON

struct Coefs
    id::Int
    f1::Float64
    f2::Float64
    k1::Int
    k2::Int
end

# put this in your actual code:
#   json_data = JSON.parsefile("nut_coefs_iau2006.json")
# the below is just an example:
json_str = """{"nut_coefs_iau2006_X0":[[1, -12.02, 0.00, 0, 2],[23, -12.96, 5.21, -2, 1]]}"""
json_data = JSON.parse(json_str)

coefs_x0 = map(e -> Coefs(e...), json_data["nut_coefs_iau2006_X0"])
3 Likes

I think that storing the data outside code is a sound proposition, but many these formats are problematic for various reasons. XML is probably overkill, Protobuf is not for long-term storage, BSON and JSON have dependencies.

CSV is OK, but I would just go with the standard library DelimitedFiles, which can be considered a subset of CSV, but entirely adequate for the purpose.

1 Like

Good point. I would have no problems to load it using DelimitedFiles. However, I am not sure how can I load this on the package, store as a global variable, and use without problems with type-stability and performance.

readdlm has a type argument. If you use

const this_matrix = DelimitedFiles.readdlm(path, ',', Int)

the performance of using this should be equivalent to just specifying it in code.

2 Likes

I think, if you didn’t want for the constant to become part of the pre-compiled image, then you could use __init__: Modules · The Julia Language. To read the csv during package load time.

4 Likes

Agreed, DelimitedFiles sounds like a good option, especially since that would allow copying and pasting OP’s source data without any modification. Although I think JSON would be an even better format, it’s more work to generate the data file in the first place. As for having a dependency on JSON, that seems like no big issue to me; it’s very light weight.

Does DelimitedFiles have built-in support for mixed types though? Since the input data consists of a mix of ints and floats. (Although I guess you can treat everything as float, which seems to be what OP is currently doing.)

1 Like

What is the issue with protobuf, or even just Mmap of the Array{Coefs}? As long as everything is bitstype, this is a perfectly well-defined storage format that interacts well with C.

This should also be the fastest possible package load time (just a syscall), and costs no memory at all, until the large constant array is accessed and the kernel faults it in.

If you Mmap, then you should use explicit Int64 instead of Int, in case people try to run your code on 32bit.