Help with 0.5 slowdown

I upgraded my code (~5000 lines) from 0.4.6 to to 0.5.0, and it has slowed down by a factor of >60. You can see below from the output of @time that at least one issue is a substantial increase in memory allocation. I have two questions:

(1) The track-allocation feature of Julia in 0.5.0 is not working on my code. (I already opened an issue about that:: https://github.com/JuliaLang/julia/issues/19502), Is there another way to find the culprit?

(2) Are there any known issues in 0.5.0 that could account for this slow-down and excessive allocation? My code is a finite element solver. It consists of matrix assembly, which involves many operations on small dense vectors, matrices and tensors, followed by operations on large sparse matrices (calls to sparse Cholesky mainly). I hand-coded all of the small tensor/matrix/vector operations with explicit loops (constructed by a macro) in order to avoid overhead from allocating and deallocating small objects.

0.4.6:
11.678110 seconds (10.09 M allocations: 4.090 GB, 8.31% gc time)

0.5.0:
755.822319 seconds (3.19 G allocations: 79.052 GB, 3.12% gc time)

Following a suggestion on GitHub from Yichao Yu, and following up on my own posting, I also downloaded the 0.6.0-dev 0-day master and tried my code in that. Here are the stats:

0.6.0-dev:
286.804961 seconds (62.08 M allocations: 4.928 GB, 0.62% gc time)

So the performance problem is still obvious, but slightly improved (20x slowdown instead of 60x). The memory allocation problem is significantly improved - same amount of memory, more calls to the allocator, fewer calls to GC. It appears from the .mem file that track-allocation=user is working correctly. Unfortunately, there is no longer an obvious route for determining the cause of the performance issue other than poring over profiler output.

The most common one I’ve seen is performance of captured variables in closures · Issue #15276 · JuliaLang/julia · GitHub. You’ll have to be more specific (post code, either full or reduced) about your problem before anyone can give you any advice.

Yichao,

Thanks for your continued help. It appears that my code is being affected by two separate performance problems:

In 0.6.0-dev, the performance problem seems to lie with taking the negative of a sparse matrix. I have opened an issue on GitHub about this: huge performance hit in 0.6: negative of a sparse matrix · Issue #19503 · JuliaLang/julia · GitHub

In 0.5.0, profiling seems to show that a large amount of time in my code is being spent on line 499 of the function to_indexes in operators.jl. This line is as follows:

to_indexes(i1, I…) = (to_index(i1), to_indexes(I…)…)

I have no idea what this line does, or why my code calls it. I do have some high-dimensional small arrays in an inner loop, e.g., 6-way-tensors whose dimensions are (2, 2, 6, 2, 6, 2).

I didn’t open an issue on this because I’m not sure what is going on. There are many different execution paths in my code that are reaching to_indexes, and the profiler does not seem to combine them.

There were some index inlining problem that should have fixed on master.

Yichao,

Sorry for belaboring the point, but when you say “fixed on master”, you are referring to 0.6.0-dev, right? On the download website Download Julia there is no nightly build for 0.5; only release 0.5.0 and 0.6.0-dev. So is there an 0.5.1 under development? Or is all the development effort going into 0.6 right now?

In particular, should I forget about porting my code to 0.5 and wait for 0.6 instead?

– Steve

Right.

There are never “development” on a release branch. We backport fixes and backportable improvements onto the release branch from master.

No. You should not use the master branch in general.

Does your code have any high dimensional(6 dimensional or greater) arrays? There is a performance issue there on 0.5.

(Edit: Yes. I posted this before you wrote about your high dimensional arrays, but it took a while to be approved.)

Have you tried defining specialized methods for to_indexes, from 2 to 6 arguments?
i.e.
Base.to_indexes(t1,t2) = (to_index(t1), to_index(t2))
Base.to_indexes(t1,t2,t3) = (to_index(t1), to_index(t2), to_index(t3))
Base.to_indexes(t1,t2,t3,t4) = (to_index(t1), to_index(t2), to_index(t3), to_index(t4))
Base.to_indexes(t1,t2,t3,t4,t5) = (to_index(t1), to_index(t2), to_index(t3), to_index(t4), to_index(t5))
Base.to_indexes(t1,t2,t3,t4,t5,t6) = (to_index(t1), to_index(t2), to_index(t3), to_index(t4), to_index(t5), to_index(t6))
Base.to_indexes(t1,t2,t3,t4,t5,t6,t…) = (to_index(t1), to_index(t2), to_index(t3), to_index(t4), to_index(t5), to_index(t6), to_indexes(t…)…)

That might solve that part of the problem in v0.5

Scott,

Thanks for the suggestion; unfortunately, the problem was not fixed. The exact code insert and timings are below. As a sanity check, I also tried putting in a error(...) statement for the 6-argument to_indexes, but the error was not triggered, indicating that the redefined 6-argument to_indexes was never invoked.

if VERSION == v"0.5.0"
    println("Redefining to_indexes to fix 0.5.0 performance bug")
    import Base.to_indexes
    import Base.to_index
    Base.to_indexes(t1,t2) = (to_index(t1), to_index(t2))
    Base.to_indexes(t1,t2,t3) = (to_index(t1), to_index(t2), to_index(t3))
    Base.to_indexes(t1,t2,t3,t4) = (to_index(t1), to_index(t2), to_index(t3), 
                                    to_index(t4))
    Base.to_indexes(t1,t2,t3,t4,t5) = (to_index(t1), to_index(t2), 
                                       to_index(t3), to_index(t4), 
                                       to_index(t5))
    Base.to_indexes(t1,t2,t3,t4,t5,t6) = (to_index(t1), to_index(t2), 
                                          to_index(t3), to_index(t4), 
                                          to_index(t5), to_index(t6))
    Base.to_indexes(t1,t2,t3,t4,t5,t6,t...) = (to_index(t1), to_index(t2), to_index(t3), 
                                               to_index(t4), to_index(t5), to_index(t6), 
                                               to_indexes(t...)...)
end

0.4.6:

 3.954943 seconds (4.05 M allocations: 1.424 GB, 9.00% gc time)

0.5.0:

 44.352322 seconds (382.90 M allocations: 12.754 GB, 6.76% gc time)

Do NOT redefine base methods like that!

I believe it’s fixed on master so you should check if this part of the problem is fixed for you there. I’m 90% sure that the corresponding PR is already scheduled to be backported to 0.5.

Yichao,

Veering a bit off-topic, I’m wondering about your advice not to modify Base as in my previous code snippet. Here is a different snippet intended to make the code simultaneously compatible with 0.4.6 and 0.5+ Is this also inadvisable, and if so, why? It seems to work.

Thanks,
Steve

if VERSION < v"0.5"
    typealias String ASCIIString
    identity(s::String) = copy(s)
    import Base.SparseMatrix.CHOLMOD.change_factor!
    import Base.SparseMatrix.normestinv
else    
    import Base.SparseArrays.CHOLMOD.change_factor!
    import Base.SparseArrays.normestinv
end

@yuyichao Why not? Also, there is no redefinition going on there - it is simply adding more specific optimized methods.
@Stephen_Vavasis Could the code already have been compiled with the less efficient definition from base/operators.jl? (i.e. shades of #265 raising it’s head)

I would agree that it’s not normally a good idea to redefine things in Base (although in some cases, like to fix a bug, it’s the only way, such as the long-standing bug in cmp() I found recently), but extending functions in Base is very Julian.

@ScottPJones,

How can I determine whether the code is already compiled with the less efficient definition? And how can I prevent it?

– Steve

This should not be necessary. identity(::String) should do exactly the same thing as copy(::String) at least on 0.4.

And other than the identity definition, the rest only introduce changes in your local namespace and is therefore ok. All of them are handled by Compat.jl so it’s usually better to use the package instead though.

How would that work in v0.4.x, as the old abstract String was already deprecated by then, and the new concrete type String wasn’t added until v0.5.x.

julia> VERSION
v"0.4.7"
julia> String
WARNING: Base.String is deprecated, use AbstractString instead.
likely near no file:0

That I don’t know how to do at the moment. I’m starting to think that because of #265, we may wait until v0.6 is out (and at least as a bug-fix release) and jump to that, skipping v0.5.x entirely. (We are currently deploying on v0.4.7)

Determining which definition is used where would likely be a painful adventure with code_llvm.

Prevention may be easier, if you modify the base implementation rather than trying to overwrite it.

  1. replace the offending line in ${INSTALLDIR}/share/julia/base/operators.jl :
    to_indexes(i1, I...) = (to_index(i1), to_indexes(I...)...)
with the version from master
    to_indexes(i1, I...) = (@_inline_meta; (to_index(i1), to_indexes(I...)...))
(save a pristine copy if you want to be careful)
  1. build a custom version of sysimg.[so|dll] (see under internals in the manual).
  2. try your code, starting with
julia --sysimage ${wherever_you_saved_trial_sysimg}

(I hope this approach meets with @yuyichao’s approval.)