Why does Juno make my code run slower than in VScode?


#1

I have the below code and I run it in Juno it takes 3.7 seconds and I run it in VScode it’s 1.48 seconds.

Is the issue Juno? Does Juno intercept the results in such a way to make it run slower?

Both are running Julia 0.6.2 on Windows 10 64 bit.

# Pkg.clone("https://github.com/xiaodaigh/FastGroupBy.jl.git")
using FastGroupBy

const M=10_000_000; const K=100
srand(1)
svec1 = rand([string(rand(Char.(32:126), rand(1:8))...) for k in 1:M÷K], M)
# using FastGroupBy.radixsort! to sort strings of length 8
@time radixsort!(svec1) # 3 seconds on 10m
issorted(svec1)

[ANN] Juno 0.6.7
#2

I’m not 100% up to date with juno, but when I was testdriving it vs VSCode, it seemed to me that there was more stuff happening in the background with debugging in mind, whereas in VSCode its just sending lines to a bare repl/terminal window. For example one of the features is highlighting the line of a thrown error in the source code. This implies that there are some hooks into the running processes = performance losses.


#3

Juno does run some things in-process, but not while your code is running (you’ll notice that things like autocomplete don’t change while Julia is busy). Unless your code performs IO or otherwise yields, it will run until completion on its own.

So I’m not sure why this is happening. It may have to do with Julia being run as a subprocess; it’d be good to figure out what’s going on here anyhow.


#4

BTW really appreciate the work that goes into making Juno, I can see alot of potential there.


#5

Alright, I looked into this for the last couple of hours, but couldn’t really figure out the cause for this.

Benchmark script:

using FastGroupBy, BenchmarkTools

const M=10_000_000; const K=100

function bench()
  rng = MersenneTwister(1)
  sv = rand(rng, [string(rand(rng, Char.(32:126), rand(rng, 1:8))...) for k in 1:M÷K], M)
  b = @benchmarkable radixsort!(v) setup = (v = copy($sv)) seconds = 60 gcsample = true
  run(b)
end

In Juno:

julia> bench()
BenchmarkTools.Trial:
  memory estimate:  305.20 MiB
  allocs estimate:  61
  --------------
  minimum time:     3.640 s (26.60% GC)
  median time:      3.812 s (26.02% GC)
  mean time:        3.791 s (26.11% GC)
  maximum time:     3.899 s (26.33% GC)
  --------------
  samples:          13
  evals/sample:     1

In the REPL:

julia> bench()
BenchmarkTools.Trial:
  memory estimate:  305.20 MiB
  allocs estimate:  61
  --------------
  minimum time:     2.460 s (26.67% GC)
  median time:      2.979 s (27.18% GC)
  mean time:        2.990 s (27.62% GC)
  maximum time:     3.960 s (27.64% GC)
  --------------
  samples:          16
  evals/sample:     1

As far as I can tell there are no differences in the start-up options (-O2 and JULIA_NUM_THREADS=4).

I also can’t reproduce the Juno times by loading Juno.jl and Atom.jl in the REPL before benchmarking (which could have caused some kind of type instability or something like that).

Other observations:

  • Times vary by a lot when rerunning bench() in both Juno and the REPL.
  • With M=100_000, I consistently get a minimum time of 11ms in Juno and 10ms in the REPL.
  • @time sleep(1) does not show any differences between the REPL and Juno.
  • @code_**** radixsort!(v) is identical in the REPL and in Juno.

All in all this seems to be a very subtle issue – my best guess is that this is in some way caused by IPC…


#6

I have the impression that Juno is slower than REPL, Jupyter notebook or embedded Julia since one year ago (or longer), no matter with Julia v0.5 or v0.6, though I just did timing for some of my julia scripts and never did a benchmark thing. will be really happy to see the problem solved! :smiley:


#7

the timing difference could be due to svec1 being modified after radixsort! so need to regenerate svec1 after each run.


#8

No, that should be taken care of by the setup = (v = copy($sv)).


#9

I have a similar experience, but have not had time to try and see what’s going on.

When I run code in Juno it is usually 1.5x to 2x slower than running the same code as a script. This is unfortunate as I run things in Juno when I want to plot, or I want to make use of Juno’s very nice profiling visualization.

For what it’s worth, whenever I run something in Juno I notice that julia is running at 250% or so and there are two Atom Helper processes running at around 40% and 15% respectively and WindowServer running at around 40% as well. This is on osx with JULIA_NUM_THREADS=4 and Optimization Level 3, so that the script would run julia at close to 400%.


#10

Is that timed with @time or BenchmarkTools.jl? If so, can you share some code that exhibits that behaviour?

Juno does incur some overhead for displaying results etc, but it’s very unexpected that just evaluating code (as in the OP) takes longer.


#11

Previously I did it using @time. A few weeks ago I stopped timing anything in Juno and moved to scripts because of the discrepancy so I don’t have very recent experience.

Here are some results using BenchmarkTools.jl, which may not actually be as informative. This not a MWE, I would be happy to run something specific to see if there is any issue with my configuration. These are from running benchmarks for SequentialMonteCarlo.jl. The exact same code can be run by

Pkg.add("SequentialMonteCarlo"); Pkg.checkout("SequentialMonteCarlo")
Pkg.add("SMCExamples"); Pkg.checkout("SMCExamples")

and then running the equivalent of

JULIA_NUM_THREADS=4 julia -O3 test/runbenchmarks.jl

in Juno or as a script

in the SequentialMonteCarlo folder. The time differences are smaller than I remember them being but also quite variable; the more reliable results may be for the longer running functions where the factor is around 1.3. Perhaps this is because @btime reports the minimum recorded time and we may be more interested here in average time; we’re not really benchmarking the function when trying to compare execution environments. It also seems that running the file in Juno is better than running line-by-line, although usually when I’m using Juno to run code I am running line-by-line.

Running the script (as also listed https://awllee.github.io/SequentialMonteCarlo.jl/latest/bench.html):

x86_64-apple-darwin14.5.0
Intel(R) Core(TM) i7-4650U CPU @ 1.70GHz (haswell) ; 2 Physical, 4 Logical
Julia 0.6.2 using 4 threads
Linear Gaussian Model, n = 10
log₂N  Benchmark
10     300.061 μs (58 allocations: 2.13 KiB)
15     7.858 ms (58 allocations: 2.13 KiB)
20     339.166 ms (58 allocations: 2.13 KiB)
Multivariate Linear Gaussian Model, d = 10, n = 10
log₂N  Benchmark
10     1.969 ms (58 allocations: 2.13 KiB)
15     65.497 ms (58 allocations: 2.13 KiB)
20     2.175 s (58 allocations: 2.13 KiB)
SMC Sampler Example, n = 12
log₂N  Benchmark
10     3.926 ms (70 allocations: 2.56 KiB)
15     128.009 ms (70 allocations: 2.56 KiB)
20     4.325 s (70 allocations: 2.56 KiB)
Lorenz96 Example, d = 8, n = 10
log₂N  Benchmark
10     13.381 ms (58 allocations: 2.13 KiB)
15     436.527 ms (58 allocations: 2.13 KiB)
20     14.559 s (58 allocations: 2.13 KiB)

Running the file in Juno:

x86_64-apple-darwin14.5.0
Intel(R) Core(TM) i7-4650U CPU @ 1.70GHz (haswell) ; 2 Physical, 4 Logical
Julia 0.6.2 using 4 threads
Linear Gaussian Model, n = 10
log₂N  Benchmark
10     330.098 μs (58 allocations: 2.13 KiB)
15     8.011 ms (58 allocations: 2.13 KiB)
20     500.772 ms (58 allocations: 2.13 KiB)
Multivariate Linear Gaussian Model, d = 10, n = 10
log₂N  Benchmark
10     2.156 ms (58 allocations: 2.13 KiB)
15     102.722 ms (58 allocations: 2.13 KiB)
20     2.779 s (58 allocations: 2.13 KiB)
SMC Sampler Example, n = 12
log₂N  Benchmark
10     3.983 ms (70 allocations: 2.56 KiB)
15     189.158 ms (70 allocations: 2.56 KiB)
20     5.750 s (70 allocations: 2.56 KiB)
Lorenz96 Example, d = 8, n = 10
log₂N  Benchmark
10     18.312 ms (58 allocations: 2.13 KiB)
15     531.985 ms (58 allocations: 2.13 KiB)
20     18.683 s (58 allocations: 2.13 KiB)
Finished: Wed, 20 Dec 2017 16:39:27

Running line by line in Juno:

x86_64-apple-darwin14.5.0
Intel(R) Core(TM) i7-4650U CPU @ 1.70GHz (haswell) ; 2 Physical, 4 Logical
Julia 0.6.2 using 4 threads
Linear Gaussian Model, n = 10
log₂N  Benchmark
10     300.117 μs (58 allocations: 2.13 KiB)
15     7.963 ms (58 allocations: 2.13 KiB)
20     569.123 ms (58 allocations: 2.13 KiB)
Multivariate Linear Gaussian Model, d = 10, n = 10
log₂N  Benchmark
10     1.996 ms (58 allocations: 2.13 KiB)
15     118.719 ms (58 allocations: 2.13 KiB)
20     3.255 s (58 allocations: 2.13 KiB)
SMC Sampler Example, n = 12
log₂N  Benchmark
10     3.835 ms (70 allocations: 2.56 KiB)
15     196.697 ms (70 allocations: 2.56 KiB)
20     5.953 s (70 allocations: 2.56 KiB)
Lorenz96 Example, d = 8, n = 10
log₂N  Benchmark
10     20.557 ms (58 allocations: 2.13 KiB)
15     602.407 ms (58 allocations: 2.13 KiB)
20     19.680 s (58 allocations: 2.13 KiB)

#12

I’m not sure this has anything to do with threads. For single-threaded benchmarks, the longer-running functions have a factor 1.1 to 1.2 difference, and perhaps the additional threads just amplify this a bit.

Running the script:

x86_64-apple-darwin14.5.0
Intel(R) Core(TM) i7-4650U CPU @ 1.70GHz (haswell) ; 2 Physical, 4 Logical
Julia 0.6.2 using 1 threads
Linear Gaussian Model, n = 10
log₂N  Benchmark
10     489.593 μs (0 allocations: 0 bytes)
15     15.796 ms (0 allocations: 0 bytes)
20     581.948 ms (0 allocations: 0 bytes)
Multivariate Linear Gaussian Model, d = 10, n = 10
log₂N  Benchmark
10     3.445 ms (0 allocations: 0 bytes)
15     116.387 ms (0 allocations: 0 bytes)
20     3.836 s (0 allocations: 0 bytes)
SMC Sampler Example, n = 12
log₂N  Benchmark
10     8.501 ms (0 allocations: 0 bytes)
15     267.428 ms (0 allocations: 0 bytes)
20     8.842 s (0 allocations: 0 bytes)
Lorenz96 Example, d = 8, n = 10
log₂N  Benchmark
10     27.231 ms (0 allocations: 0 bytes)
15     897.206 ms (0 allocations: 0 bytes)
20     29.114 s (0 allocations: 0 bytes)

Running the file in Juno:

x86_64-apple-darwin14.5.0
Intel(R) Core(TM) i7-4650U CPU @ 1.70GHz (haswell) ; 2 Physical, 4 Logical
Julia 0.6.2 using 1 threads
Linear Gaussian Model, n = 10
log₂N  Benchmark
10     482.722 μs (0 allocations: 0 bytes)
15     18.235 ms (0 allocations: 0 bytes)
20     662.377 ms (0 allocations: 0 bytes)
Multivariate Linear Gaussian Model, d = 10, n = 10
log₂N  Benchmark
10     3.487 ms (0 allocations: 0 bytes)
15     139.896 ms (0 allocations: 0 bytes)
20     4.602 s (0 allocations: 0 bytes)
SMC Sampler Example, n = 12
log₂N  Benchmark
10     8.228 ms (0 allocations: 0 bytes)
15     308.289 ms (0 allocations: 0 bytes)
20     10.025 s (0 allocations: 0 bytes)
Lorenz96 Example, d = 8, n = 10
log₂N  Benchmark
10     28.490 ms (0 allocations: 0 bytes)
15     993.012 ms (0 allocations: 0 bytes)
20     32.713 s (0 allocations: 0 bytes)

#13

Isn’t it simply a possibility that Juno takes CPU time (and L1, L2 etc. cache) from you? Because it’s continuously running something. It could just as well be your web browser taking CPU away.

Isn’t 10-20% within the margin of error you should expect? Even if Juno stopped taking CPU, just before running your code, and would freeze UI until your code gave control back, having taken CPU before would have made the CPU hotter and CPU throttling more likely.


#14

Thanks for the detailed observations, @alee!

My best guess right now is that there’s nothing fundamentally wrong with Juno’s evaluation model (phew) – benchmarking high performance code is just very sensitive to CPU load, and Atom isn’t very light regarding that.

With the original radixsort! benchmark I could reproduce the bad Juno timings in the REPL if I play around in Atom at the same time. Furthermore, turning off e.g. the progress bar animation brings significant performance gains in Juno… This is consistent with the fact that sleep(1) (which hardly requires any CPU power) doesn’t take longer in Juno; it is also consistent with @alee’s observation regarding inline evaluation being slower than file evaluation (the former has an additional animated element).

Conclusion: Let’s turn off animations and find a suitable replacement. Let’s also try to make the Juno UI lighter (though I’m not sure if there are any low hanging fruit). In hindsight it’s also not very surprising that benchmarking well optimized code is sensitive to other process taking CPU time.


#15

This makes a lot of sense now. In fact, now I can replicate very poor timing by doing something that I suppose one shouldn’t do because of the animated elements (the spinning gears).

Sometimes, I shift-enter lots of lines because I want some computations to finish while I do something else. It appears that the resulting long list of gears can make 2 Atom Helpers + WindowServer account for much larger CPU loads…and I get performance decreases that are far more substantial than what I listed above (where I stepped line-by-line one line at a time).


#16

Indeed.

There have been complaints about this before (usually in the context of working remotely), so there already is an issue here. I’ll make sure to think of something to optimise this.


#17

yeah I looked back at my blogpost that did the benchmarks vs R’s data.table. Even though the Julia version was faster I had noticed 2-3× speed up when running some benchmarks in VScode. So Julia was ahead by even more than I thought