Huge matrix declared as const in a package leading to StackOverflowError

Ronis_BR · February 17, 2019, 5:17pm

Hi guys!

I have some big matrices that are constants to compute Earth precession and nutation. They can be seen here.

The problem is when I wrote the matrices without the Float64, the builds on Linux and Windows crashed with StackOverflowError. This error disappeared when I add the Float64.

Question: Is it a bug? Is it a bad way to have such big matrix with constant values inside a package? Any suggestions to improve it?

Elrod · February 17, 2019, 5:31pm

I think it’s fine. Check out this file from Sobol.jl as an extreme example.

Also, the values of the matrix can still be changed (although presumably no code will actually do that). Accessing the elements of the matrix from inside functions will simply be type stable, thanks to const.

favba · February 17, 2019, 5:43pm

My guess is that when you don’t specify the element type, there will be something like a promote_type(x...) with splatting of all that elements of the matrix. And since the length is known at compile time it will try to do all the allocations necessary for the splatting on the stack, which has limited size, leading to the stackoverflow error.

favba · February 17, 2019, 5:55pm

I don’t know how that matrix is used, but it seems to me that that matrix is actually 4 entities awkwardly glued together.
It seems the first column should just be a constant UnitRange, 1:1306. Or, even better: Base.OneTo(1306).
Then the second and third columns should be two separate vectors.

Then the last 14 columns should be a sparse matrix.

But this is just a wild guess for what this constant matrix is used for. I might be completely wrong.

Ronis_BR · February 17, 2019, 6:03pm

Thanks for the tip!

In fact this matrix are coefficients obtained from here:

ftp://maia.usno.navy.mil/conv2010/2010/2010_update/chapter5/additional_info/tab5.2a.txt

Indeed, the first column is not used and the last 14 columns are integers. I chose to just copy and paste from that original file because when a new version of the model is available, then it will be easier to update.

However, your suggestion makes sense. I will test to see if I can gain speed.

bennedich · February 17, 2019, 6:47pm

I’d say that many would consider it best practice to separate code and data. Although I don’t think it’s a huge deal in your example, I would put it in its own file/files; using e.g. JSON (or BSON), CSV, XML, Protobuf. Some advantages of using separate files:

With data separated, you and others can access it more easily from other languages and tools. You yourself had to manually copy and paste the different parts of this data from an unstructured source, right? Wouldn’t it have been convenient if all you had to do was download a JSON file?
Similarly, if you want to automate downloading fresh data, it’s easier to keep it separate than having to generate source code intermixed with data.
Storing data in code might make your repository bigger, which makes some operations slower (although if it’s just 300 KB, it’s not that bad). With data separated, there are a number of options: store it compressed in the repository, store it in a separate repository, make your script download it and cache it.
Looking at git history, you can easily separate changes in data vs changes in source code.
With thousands of lines of data in source files, measures like code coverage could be messed up.
A source file with several structures with thousands of lines of data can be hard to read and work with. For example, changing all array types to be Float64 was probably unnecessarily complicated. (Although that might depend on how used you are to working with it; with an editor like Atom for example you can fold large matrices for better overview.)

Ronis_BR · February 17, 2019, 7:03pm

Nice ideia!

This data will not change in some years. Hence, to make life easier, I prefer to ship the input file together with the repo. My question is: how can I load all the data, put on a matrix during the initializations of the package without leading to type instability?

bennedich · February 17, 2019, 7:23pm

How about using a more structured approach, e.g.:

using JSON

struct Coefs
    id::Int
    f1::Float64
    f2::Float64
    k1::Int
    k2::Int
end

# put this in your actual code:
#   json_data = JSON.parsefile("nut_coefs_iau2006.json")
# the below is just an example:
json_str = """{"nut_coefs_iau2006_X0":[[1, -12.02, 0.00, 0, 2],[23, -12.96, 5.21, -2, 1]]}"""
json_data = JSON.parse(json_str)

coefs_x0 = map(e -> Coefs(e...), json_data["nut_coefs_iau2006_X0"])

Tamas_Papp · February 19, 2019, 9:42am

I think that storing the data outside code is a sound proposition, but many these formats are problematic for various reasons. XML is probably overkill, Protobuf is not for long-term storage, BSON and JSON have dependencies.

CSV is OK, but I would just go with the standard library DelimitedFiles, which can be considered a subset of CSV, but entirely adequate for the purpose.

Ronis_BR · February 19, 2019, 12:26pm

Good point. I would have no problems to load it using DelimitedFiles. However, I am not sure how can I load this on the package, store as a global variable, and use without problems with type-stability and performance.

Tamas_Papp · February 19, 2019, 12:29pm

readdlm has a type argument. If you use

const this_matrix = DelimitedFiles.readdlm(path, ',', Int)

the performance of using this should be equivalent to just specifying it in code.

mauro3 · February 19, 2019, 12:44pm

I think, if you didn’t want for the constant to become part of the pre-compiled image, then you could use __init__: Modules · The Julia Language. To read the csv during package load time.

bennedich · February 19, 2019, 6:49pm

Agreed, DelimitedFiles sounds like a good option, especially since that would allow copying and pasting OP’s source data without any modification. Although I think JSON would be an even better format, it’s more work to generate the data file in the first place. As for having a dependency on JSON, that seems like no big issue to me; it’s very light weight.

Does DelimitedFiles have built-in support for mixed types though? Since the input data consists of a mix of ints and floats. (Although I guess you can treat everything as float, which seems to be what OP is currently doing.)

foobar_lv2 · February 19, 2019, 7:24pm

What is the issue with protobuf, or even just Mmap of the Array{Coefs}? As long as everything is bitstype, this is a perfectly well-defined storage format that interacts well with C.

This should also be the fastest possible package load time (just a syscall), and costs no memory at all, until the large constant array is accessed and the kernel faults it in.

If you Mmap, then you should use explicit Int64 instead of Int, in case people try to run your code on 32bit.

Topic		Replies	Views
Good practices for numerical array constants in tests General Usage question , testing	5	127	April 26, 2025
Static data in a package New to Julia	8	2099	June 12, 2019
What is the best type for a Matrix to be used in Multiplication to give good performance? General Usage question	34	919	May 20, 2022
Type stability issues when using StaticArrays Performance type-stability , staticarrays	24	1821	January 3, 2019
Long parameter for type seems very slow General Usage performance , parametric-types	18	1041	June 22, 2021

Huge matrix declared as const in a package leading to StackOverflowError

Related topics