The following is based on a real world case, where I got a huge file with univariate polynomials in them, but I’ve tried to simplify it for easier reproduction. Note that in my original test case, I actually got a ReadOnlyMemoryError
instead of a StackOverflowError
, but I figured the example here is relevant enough by itself
BTW, I am of course aware that there are “better” ways to serialize univariate polynomials, e.g. as a list of coefficients – but (a) this is the data I get, not my choice, (b) it is less clear what to do when e.g. the coefficients are from special fields where you in turn need special code to serialize them… In short, while long term my “solution” here would be to store this data differently, right now I have to deal with what I have, so there’s no point in telling me to use a different data format.
The following helper function generates the problematic code (originally there was of course just a .jl
“data file”):
function badcode(N::Int)
io = IOBuffer()
println(io, "x = 2")
println(io, "ll = [")
for i in 1:N
println(io, "[$i, x^7+x^6+x^5+x^4+x+1],")
end
println(io, "]")
return String(take!(io))
end
So for example:
julia> print(badcode(5))
x = 2
ll = [
[1, x^7+x^6+x^5+x^4+x+1],
[2, x^7+x^6+x^5+x^4+x+1],
[3, x^7+x^6+x^5+x^4+x+1],
[4, x^7+x^6+x^5+x^4+x+1],
[5, x^7+x^6+x^5+x^4+x+1],
]
Now let’s try to parse and evaluate this code for different values N
– so a Vector{Vector{Int64}}
is produced (I played with changing the inner Vector
to a tuple, but that made no difference). Storage for that seems to be 16 N
based on sizeof
.
This is what I see:
julia> bad=badcode(10000) ; @time include_string(Main,bad);
2.950018 seconds (708.01 k allocations: 42.467 MiB, 0.13% gc time)
julia> bad=badcode(20000) ; @time include_string(Main,bad);
5.690603 seconds (1.42 M allocations: 84.963 MiB, 0.54% gc time)
julia> bad=badcode(30000) ; @time include_string(Main,bad);
8.769406 seconds (2.13 M allocations: 127.459 MiB, 0.91% gc time)
julia> bad=badcode(40000) ; @time include_string(Main,bad);
11.465964 seconds (2.84 M allocations: 169.954 MiB, 0.91% gc time)
julia> bad=badcode(50000) ; @time include_string(Main,bad);
14.838581 seconds (3.55 M allocations: 212.450 MiB, 0.83% gc time)
julia> bad=badcode(60000) ; @time include_string(Main,bad);
16.992089 seconds (4.26 M allocations: 254.946 MiB, 1.44% gc time)
julia> bad=badcode(70000) ; @time include_string(Main,bad);
ERROR: StackOverflowError:
Stacktrace:
[1] top-level scope
@ none:1
[2] eval
@ ./boot.jl:360 [inlined]
[3] eval(x::Expr)
@ Base.MainInclude ./client.jl:446
[4] top-level scope
@ ./timing.jl:210 [inlined]
[5] top-level scope
@ ./REPL[96]:0
Extrapolating from this, it seems that ~71 allocations are required per additional line being read. My guess is this is first building a huge syntax tree and then collapsing it.
Question 1: Why the crash? Is this “normal” or a bug in the Julia parser?
Question 2: Is there any chance this parsing can be sped up in Julia itself?
As it is, Julia is disappointingly slow here. The previous system that was used to parse these files does so in ~250 milliseconds (for a file with ~65000 lines) and evaluates it in another 2.5 seconds.
My guess is that a lot of time is wasted producing machine code that is then avoided. I wonder if one could add an “immediate execution mode” to deal with such cases (I know that other language implementation have such a thing, I’ve even worked on something like that in the past). I know there is a JuliaInterpreter.jl
package… so now I wonder
Question 3: Is there another faster way to read this data in Julia, e.g. using JuliaInterpreter.jl or some other parser package
I’ve contemplated just implementing my own parser for simple polynomials with integer coefficients, but perhaps this has already be done by someone?
For now, my solution is a gross hack: I read these files line by line and then pass all the “interesting” lines to eval(Meta.parse(...))
. This works, but is still disappointingly small. E.g. with such a file with ~65000 lines, it takes 48 seconds on my computer, about 15 second for parsing and 30 for evaluation. A similar strategy is to modify the file data to look more like this:
x = 2
ll = Vector{Vector{Int}}(undef, 5)
ll[1] = [1, x^7+x^6+x^5+x^4+x+1]
ll[2] = [2, x^7+x^6+x^5+x^4+x+1]
ll[3] = [3, x^7+x^6+x^5+x^4+x+1]
ll[4] = [4, x^7+x^6+x^5+x^4+x+1]
ll[5] = [5, x^7+x^6+x^5+x^4+x+1]
This solves the crash, but is even slower:
julia> function goodcode(N::Int)
io = IOBuffer()
println(io, "x = 2")
println(io, "ll = Vector{Vector{Int}}(undef, $N)")
for i in 1:N
println(io, "ll[$i] = [$i, x^7+x^6+x^5+x^4+x+1]")
end
return String(take!(io))
end
goodcode (generic function with 1 method)
julia> str=goodcode(10000) ; @time include_string(Main, str);
4.311825 seconds (1.08 M allocations: 67.464 MiB, 0.33% gc time)
julia> str=goodcode(20000) ; @time include_string(Main, str);
8.783035 seconds (2.16 M allocations: 134.985 MiB, 1.44% gc time)
julia> str=goodcode(30000) ; @time include_string(Main, str);
13.460448 seconds (3.24 M allocations: 202.505 MiB, 1.14% gc time)
julia> str=goodcode(40000) ; @time include_string(Main, str);
19.910083 seconds (4.32 M allocations: 270.025 MiB, 1.31% gc time)
julia> str=goodcode(50000) ; @time include_string(Main, str);
20.537545 seconds (5.40 M allocations: 337.545 MiB, 1.24% gc time)
julia> str=goodcode(60000) ; @time include_string(Main, str);
24.248670 seconds (6.48 M allocations: 405.065 MiB, 1.25% gc time)
julia> str=goodcode(70000) ; @time include_string(Main, str);
29.532085 seconds (7.56 M allocations: 472.585 MiB, 1.28% gc time)
System info:
julia> versioninfo()
Julia Version 1.6.4
Commit 35f0c911f4 (2021-11-19 03:54 UTC)
Platform Info:
OS: macOS (x86_64-apple-darwin19.5.0)
CPU: Intel(R) Core(TM) i7-8850H CPU @ 2.60GHz
WORD_SIZE: 64
LIBM: libopenlibm
LLVM: libLLVM-11.0.1 (ORCJIT, skylake)