Performance regression

question

#1

Hi,
I’m new to this so please forgive me if I am violating protocols. I would also like to report a performance regression from 0.4.5 to 0.5.0. I have looked from the discussion forums and I think my problem is related to what others have reported about indexing to arrays but I am not sure. The program is pretty naive in its uses of features. What I liked about Julia is that I didn’t need to work hard to get fast code.

Below are the outputs from @benchmark. The code is on my GitHub repository https://github.com/ccc1685/transcription-model. In using @profile, it seems as if Julia spends a lot of time using map(). My code uses a lot of arrays of arrays. Here is an example for line 222 in my code from 0.50 using @profile.

653 ...master/speedtest.jl:222; speedtest()
        602 ./array.jl:392; getindex
         602 ./abstractarray.jl:284; checkbounds
          1   ...lib/julia/sys.dylib:?; checkbounds_indices(::Type{Bo...
          1   ...lib/julia/sys.dylib:?; checkbounds_indices(::Type{Bo...
          542 ...lib/julia/sys.dylib:?; map(::Type{T}, ::Tuple{Int64})
           542 ./tuple.jl:92; map(::Type{T}, ::Tuple{Int64})
            3 ...lib/julia/sys.dylib:?; Base.OneTo{T<:Integer}(::Int64)

This is what I get for 0.4.5

   26  ...n/master/speedtest.jl; speedtest; line: 222
    11 array.jl; getindex; line: 288
     10 array.jl; unsafe_getindex; line: 291
    9  arraymath.jl; cumsum; line: 450

@benchmark outputs

Julia Version 0.4.5
Commit 2ac304d (2016-03-18 00:58 UTC)
Platform Info:
  System: Darwin (x86_64-apple-darwin13.4.0)
  CPU: Intel(R) Core(TM) i7-4558U CPU @ 2.80GHz
  WORD_SIZE: 64
  BLAS: libopenblas (USE64BITINT DYNAMIC_ARCH NO_AFFINITY Haswell)
  LAPACK: libopenblas64_
  LIBM: libopenlibm
  LLVM: libLLVM-3.3
@benchmark speedtest()
BenchmarkTools.Trial: 
  memory estimate:  477.61 mb
  allocs estimate:  11397433
  --------------
  minimum time:     531.212 ms (11.01% GC)
  median time:      543.016 ms (10.63% GC)
  mean time:        568.141 ms (10.90% GC)
  maximum time:     724.565 ms (13.12% GC)
  --------------
  samples:          9
  evals/sample:     1
  time tolerance:   5.00%
  memory tolerance: 1.00%
Julia Version 0.5.0
Commit 3c9d753 (2016-09-19 18:14 UTC)
Platform Info:
  System: Darwin (x86_64-apple-darwin13.4.0)
  CPU: Intel(R) Core(TM) i7-4558U CPU @ 2.80GHz
  WORD_SIZE: 64
  BLAS: libopenblas (USE64BITINT DYNAMIC_ARCH NO_AFFINITY Haswell)
  LAPACK: libopenblas64_
  LIBM: libopenlibm
  LLVM: libLLVM-3.7.1 (ORCJIT, haswell)

@benchmark speedtest()
BenchmarkTools.Trial: 
  memory estimate:  553.81 mb
  allocs estimate:  13027802
  --------------
  minimum time:     3.210 s (0.00% GC)
  median time:      3.243 s (0.00% GC)
  mean time:        3.243 s (0.00% GC)
  maximum time:     3.275 s (0.00% GC)
  --------------
  samples:          2
  evals/sample:     1
  time tolerance:   5.00%
  memory tolerance: 1.00%

#2

I don’t know the cause of the regression, but very simple way to boost performance of your code is to avoid the declaration az = Array(Array, nalleles). The problem is that Julia cannot deduce the type of az[1] at compile time, so it can’t compile the correct operations for az[1] and instead has to dispatch at run time. Instead, use az = Array(Array{Float64,1}, nalleles) (or whatever type you want for the inner array). Another way to boost performance is to rewite the cumsum statement with a hand-coded “in-place” function. I can explain t[his in more detail if you want to pursue it.

[Edit, 1 hour later.] A second way to get full performance without modifying Array(Array,nalleles) is to break your function into pieces: The code that processes az[i] should be in a different function that takes a simple array argument. In this case, Julia will know the type of azi[i] when the second function is invoked and therefore can specialize to the type of the inner array at compile time and give you full performance. This is a more “Julian” way to to boost performance since it allows your code to remain more generic.


#3

Hi Stephen,

Thanks for the suggestion. I implemented it for all of my array of arrays and it immediately solved the problem. In fact, 0.5 is a little faster than 0.45 now. This is great but I wonder why this was not an issue for 0.45? I actually knew that I didn’t declare the inner array in my previous code but I found that it didn’t matter for speed before so I got lazy. In fact, that is why I really liked Julia because it compensated for any sloppiness. In my lab, we simulate lots of models all the time so having the ability to write code really quickly that is fast is quite important. This is one of the reasons I abandoned Matlab. Julia had similar syntax and was so much faster. Is the necessity for being careful in declaring variables the future of Julia? If this is the case then it makes the utility of it much lower for me personally. I appreciate the elegance that Julia possesses but for me, I would rather have something less beautiful and more robust. Again, thanks for the help.
Carson


#4

This is a good question, and I don’t have a complete answer. There are a few basic rules to get high performance in Julia: aim for type stability; containers should always contain concrete types; don’t use matrix-vector operations on short vectors in an inner loop. All of these points and more are described in the section of the manual on performance. These three rules cover maybe 85% of the cases.

Declaring variables is usually unnecessary for obtaining performance with a few exceptions: the fields in a composite type should be declared, and for containers of containers, the inner container types should be declared. (Or else you can farm out all the work on the inner containers to other functions.). Even in the case of composite types and containers in containers, you can write generic high-performance code by using type parameters. If you have a basic understanding about how dispatch works in Julia, then you can develop intuition about what kinds of code will cause loss of performance.

The tools provided for checking and boosting performance (profiling, warntype, etc.) are really helpful.