Why Pluto is much slower than Jupyter

I am trying migrating to Pluto from Jupyter notebook. I test a function written in Jupyter. however, for same codes, it becomes about 30 times slower than running in Jupyter.

In Jupyter:


BenchmarkTools.Trial: 2 samples with 1 evaluation.
 Range (min … max):  2.972 s …   2.995 s  ┊ GC (min … max): 17.40% … 16.39%
 Time  (median):     2.983 s              ┊ GC (median):    16.90%
 Time  (mean ± σ):   2.983 s ± 16.799 ms  ┊ GC (mean ± σ):  16.90% ±  0.71%

In Pluto:

BenchmarkTools.Trial: 1 sample with 1 evaluation.
 Single result which took 125.710 s (0.47% GC) to evaluate,
 with a memory estimate of 4.17 GiB, over 33552608 allocations.

Also, in REPL it is about 10 times slower than Jupyter, still faster than Pluto. Why Pluto is so slow?

By the way, I rewrite this function in numba, and it runs twice fatser than it in julia. How can I improve my julia code?

Tested function is a recursive function called Tor as followed.

function computeL(Alist::AbstractArray{T, 3}, Llist::AbstractArray{T, 3}, Z::Vector{Int}, l::Int)::AbstractArray{T, 3} where T
    subLlist = @view Llist[Z, Z, :]
    for k in 1:size(Alist)[3]
        for j in l:size(Alist)[1]
            for i in l:j-1
                subLlist[i, j, k] = (Alist[i, j, k] -  subLlist[1:i-1, i, k] ⋅ subLlist[1:i-1, j, k]) / subLlist[i, i, k]
            end
            
            subLlist[j, j, k] = sqrt(Alist[j, j, k] - subLlist[1:j-1, j, k] ⋅ subLlist[1:j-1, j, k])
        end
    end
    return subLlist
end

function recursiveTor(Alist::AbstractArray{T, 3}, Llist::AbstractArray{T, 3}, modes::Vector{Int}, n::Int)::T where T
    nmodes = length(modes)
    start = 1
    if nmodes > 0
        start = modes[end] + 1
    end
    
    N = size(Alist)[1] ÷ 2

    
    tor = 0.
    s = (-1) ^(nmodes + 1)
    
    for i in start:n
        nextmodes = [modes;i]
        l = (i - nmodes) * 2
        Z = [1:l-2;l+1:N*2]
        subAlist = @view Alist[Z, Z, :]
        subLlist = computeL(subAlist, Llist, Z, l)
        
        det = 1.0
        for L in eachslice(subLlist, dims=3)
            det *= prod(diag(L))
        end
        
        tor += s / det +  recursiveTor(subAlist, subLlist,  nextmodes, n)
    end
    return tor
end

function Tor(Vlist::AbstractArray{T, 3})::T where T
    tor = T(1.0)
    Llist = zeros(T, size(Vlist))
    for i in 1:size(Vlist)[3]
        Llist[:,:,i] = cholesky(Vlist[:,:,i]).U
        tor /= prod(diag(Llist[:,:,i]))
    end
    modes = Int[]
    return abs(tor + recursiveTor(Vlist, Llist, modes, size(Vlist)[1] ÷ 2))
end

Please provide a MWE, ideally as a Pluto notebook.

There is no inherent reason why Pluto should be slower than Julia in Jupyter or the REPL.

2 Likes

My guess is memory usage in Pluto is more and it is hitting swap

but that’s only a guess

use your OS monitor tools to look at memory usage (“Task Manager” or top or glances or whatever you find useful)

Thanks for your reply. I check it but it is not the reason.

1 Like

Here is a short example.

using LinearAlgebra
using BenchmarkTools
using Profile

# include functions above
include("Tor.jl") 

# generate a random 40 * 40 * 1 positive-definite matrix
n = 40
A = rand(n)
A = A * A' + 2 * I
A = reshape(A, n, n, 1)

# using BenchmarkTools
@benchmark Tor(A)

Codes in Jupyter and in Pluto are identical. Then it gives the result above.

For me, the timings in Pluto and Julia REPL are similar (using Julia 1.8 RC1, latest Pluto Version, Win11):

Pluto:

BenchmarkTools.Trial: 2 samples with 1 evaluation.
 Range (min … max):  4.629 s …   4.653 s  ┊ GC (min … max): 8.60% … 7.88%
 Time  (median):     4.641 s              ┊ GC (median):    8.24%
 Time  (mean ± σ):   4.641 s ± 17.361 ms  ┊ GC (mean ± σ):  8.24% ± 0.51%

  █                                                       █  
  █▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁█ ▁
  4.63 s         Histogram: frequency by time        4.65 s <

 Memory estimate: 4.17 GiB, allocs estimate: 33552608.

REPL:

BenchmarkTools.Trial: 2 samples with 1 evaluation.
 Range (min … max):  4.116 s …    4.453 s  ┊ GC (min … max): 9.07% … 9.10%
 Time  (median):     4.285 s               ┊ GC (median):    9.09%
 Time  (mean ± σ):   4.285 s ± 238.204 ms  ┊ GC (mean ± σ):  9.09% ± 0.02%

  █                                                        █
  █▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁█ ▁
  4.12 s         Histogram: frequency by time         4.45 s <

 Memory estimate: 4.17 GiB, allocs estimate: 33552608.

Notebook file:

### A Pluto.jl notebook ###
# v0.19.8

using Markdown
using InteractiveUtils

# ╔═╡ b324fe6c-e335-475d-b3a0-930fba792c2c
using LinearAlgebra

# ╔═╡ c93ded31-0603-400f-9597-9a99457aa23e
using BenchmarkTools

# ╔═╡ 1b60dca4-0743-477d-ba8f-c3d4fd3408d9
begin
	# generate a random 40 * 40 * 1 positive-definite matrix
	n = 40
	A = rand(n)
	A = A * A' + 2 * I
	A = reshape(A, n, n, 1)
end

# ╔═╡ 9271b873-9839-4780-9bb9-abc0b165ef0c
function computeL(Alist::AbstractArray{T, 3}, Llist::AbstractArray{T, 3}, Z::Vector{Int}, l::Int)::AbstractArray{T, 3} where T
    subLlist = @view Llist[Z, Z, :]
    for k in 1:size(Alist)[3]
        for j in l:size(Alist)[1]
            for i in l:j-1
                subLlist[i, j, k] = (Alist[i, j, k] -  subLlist[1:i-1, i, k] ⋅ subLlist[1:i-1, j, k]) / subLlist[i, i, k]
            end
            
            subLlist[j, j, k] = sqrt(Alist[j, j, k] - subLlist[1:j-1, j, k] ⋅ subLlist[1:j-1, j, k])
        end
    end
    return subLlist
end

# ╔═╡ fc9d7112-5f52-416a-b07d-f93ce7828d2b
function recursiveTor(Alist::AbstractArray{T, 3}, Llist::AbstractArray{T, 3}, modes::Vector{Int}, n::Int)::T where T
    nmodes = length(modes)
    start = 1
    if nmodes > 0
        start = modes[end] + 1
    end
    
    N = size(Alist)[1] ÷ 2

    
    tor = 0.
    s = (-1) ^(nmodes + 1)
    
    for i in start:n
        nextmodes = [modes;i]
        l = (i - nmodes) * 2
        Z = [1:l-2;l+1:N*2]
        subAlist = @view Alist[Z, Z, :]
        subLlist = computeL(subAlist, Llist, Z, l)
        
        det = 1.0
        for L in eachslice(subLlist, dims=3)
            det *= prod(diag(L))
        end
        
        tor += s / det +  recursiveTor(subAlist, subLlist,  nextmodes, n)
    end
    return tor
end

# ╔═╡ 0dec84da-fd05-4df9-af1e-8ca7594d7db3
function Tor(Vlist::AbstractArray{T, 3})::T where T
    tor = T(1.0)
    Llist = zeros(T, size(Vlist))
    for i in 1:size(Vlist)[3]
        Llist[:,:,i] = cholesky(Vlist[:,:,i]).U
        tor /= prod(diag(Llist[:,:,i]))
    end
    modes = Int[]
    return abs(tor + recursiveTor(Vlist, Llist, modes, size(Vlist)[1] ÷ 2))
end

# ╔═╡ 8915b18b-db50-43a8-94e3-ba6dbc57aa96
# using BenchmarkTools
@benchmark Tor(A)

# ╔═╡ 2b7f055c-6791-45e7-8cce-3203a0356d09


# ╔═╡ 00000000-0000-0000-0000-000000000001
PLUTO_PROJECT_TOML_CONTENTS = """
[deps]
BenchmarkTools = "6e4b80f9-dd63-53aa-95a3-0cdb28fa8baf"
LinearAlgebra = "37e2e46d-f89d-539d-b4ee-838fcccc9c8e"

[compat]
BenchmarkTools = "~1.3.1"
"""

# ╔═╡ 00000000-0000-0000-0000-000000000002
PLUTO_MANIFEST_TOML_CONTENTS = """
# This file is machine-generated - editing it directly is not advised

julia_version = "1.8.0-rc1"
manifest_format = "2.0"
project_hash = "220220a2f5e36248d400fc3772f84f5dc05c2f4f"

[[deps.Artifacts]]
uuid = "56f22d72-fd6d-98f1-02f0-08ddc0907c33"

[[deps.BenchmarkTools]]
deps = ["JSON", "Logging", "Printf", "Profile", "Statistics", "UUIDs"]
git-tree-sha1 = "4c10eee4af024676200bc7752e536f858c6b8f93"
uuid = "6e4b80f9-dd63-53aa-95a3-0cdb28fa8baf"
version = "1.3.1"

[[deps.CompilerSupportLibraries_jll]]
deps = ["Artifacts", "Libdl"]
uuid = "e66e0078-7015-5450-92f7-15fbd957f2ae"
version = "0.5.2+0"

[[deps.Dates]]
deps = ["Printf"]
uuid = "ade2ca70-3891-5945-98fb-dc099432e06a"

[[deps.JSON]]
deps = ["Dates", "Mmap", "Parsers", "Unicode"]
git-tree-sha1 = "3c837543ddb02250ef42f4738347454f95079d4e"
uuid = "682c06a0-de6a-54ab-a142-c8b1cf79cde6"
version = "0.21.3"

[[deps.Libdl]]
uuid = "8f399da3-3557-5675-b5ff-fb832c97cbdb"

[[deps.LinearAlgebra]]
deps = ["Libdl", "libblastrampoline_jll"]
uuid = "37e2e46d-f89d-539d-b4ee-838fcccc9c8e"

[[deps.Logging]]
uuid = "56ddb016-857b-54e1-b83d-db4d58db5568"

[[deps.Mmap]]
uuid = "a63ad114-7e13-5084-954f-fe012c677804"

[[deps.OpenBLAS_jll]]
deps = ["Artifacts", "CompilerSupportLibraries_jll", "Libdl"]
uuid = "4536629a-c528-5b80-bd46-f80d51c5b363"
version = "0.3.20+0"

[[deps.Parsers]]
deps = ["Dates"]
git-tree-sha1 = "1285416549ccfcdf0c50d4997a94331e88d68413"
uuid = "69de0a69-1ddd-5017-9359-2bf0b02dc9f0"
version = "2.3.1"

[[deps.Printf]]
deps = ["Unicode"]
uuid = "de0858da-6303-5e67-8744-51eddeeeb8d7"

[[deps.Profile]]
deps = ["Printf"]
uuid = "9abbd945-dff8-562f-b5e8-e1ebf5ef1b79"

[[deps.Random]]
deps = ["SHA", "Serialization"]
uuid = "9a3f8284-a2c9-5f02-9a11-845980a1fd5c"

[[deps.SHA]]
uuid = "ea8e919c-243c-51af-8825-aaa63cd721ce"
version = "0.7.0"

[[deps.Serialization]]
uuid = "9e88b42a-f829-5b0c-bbe9-9e923198166b"

[[deps.SparseArrays]]
deps = ["LinearAlgebra", "Random"]
uuid = "2f01184e-e22b-5df5-ae63-d93ebab69eaf"

[[deps.Statistics]]
deps = ["LinearAlgebra", "SparseArrays"]
uuid = "10745b16-79ce-11e8-11f9-7d13ad32a3b2"

[[deps.UUIDs]]
deps = ["Random", "SHA"]
uuid = "cf7118a7-6976-5b1a-9a39-7adc72f591a4"

[[deps.Unicode]]
uuid = "4ec0a83e-493e-50e2-b9ac-8f72acf5a8f5"

[[deps.libblastrampoline_jll]]
deps = ["Artifacts", "Libdl", "OpenBLAS_jll"]
uuid = "8e850b90-86db-534c-a0d3-1478176c7d93"
version = "5.1.0+0"
"""

# ╔═╡ Cell order:
# ╠═b324fe6c-e335-475d-b3a0-930fba792c2c
# ╠═c93ded31-0603-400f-9597-9a99457aa23e
# ╠═1b60dca4-0743-477d-ba8f-c3d4fd3408d9
# ╠═9271b873-9839-4780-9bb9-abc0b165ef0c
# ╠═fc9d7112-5f52-416a-b07d-f93ce7828d2b
# ╠═0dec84da-fd05-4df9-af1e-8ca7594d7db3
# ╠═8915b18b-db50-43a8-94e3-ba6dbc57aa96
# ╠═2b7f055c-6791-45e7-8cce-3203a0356d09
# ╟─00000000-0000-0000-0000-000000000001
# ╟─00000000-0000-0000-0000-000000000002
1 Like

I shutdown my Jupyter kernel, then Pluto becomes faster. Seems like Jupyter wastes too many resource, but my CPU and RAM is far from fully occupied. I don’t know why.

I’m seeing:

image

image

image

But there’s a lot of variation in these timings - I can get anything from 2-5 seconds in all three setups. If I was you’d I’d probably be more worried about the large number of allocations than the variation in runtime between different coding environments.

Thanks for your reminding. I notice this problem and test its memory allocation using Profile. But I don’t konw how to improve it. Can you give some suggestions?

        - ### A Pluto.jl notebook ###
        - # v0.19.8
        - 
        - using Markdown
        - using InteractiveUtils
        - 
        - # ╔═╡ 63313008-e72b-11ec-0626-81fc3fb17195
        - begin
        - 	using LinearAlgebra
        - 	using LoopVectorization
        - 	using Distributed
        - 	using BenchmarkTools
        - 	using Profile
        - end
        - 
        - # ╔═╡ 0d541ca3-adcb-478b-8620-0e03798ccac7
        - function computeL(Alist::AbstractArray{T, 3}, Llist::AbstractArray{T, 3}, Z::Vector{Int}, l::Int)::AbstractArray{T, 3} where T
        - # function computeL(Alist, Llist, Z, l)
        - #     print(typeof(Alist))
469907680     subLlist = @view Llist[Z, Z, :]
        0     for k in 1:size(Alist)[3]
        0         for j in l:size(Alist)[1]
        0             for i in l:j-1
        0                 subLlist[i, j, k] = (Alist[i, j, k] -  subLlist[1:i-1, i, k] ⋅ subLlist[1:i-1, j, k]) / subLlist[i, i, k]
        0             end
        -             
        0             subLlist[j, j, k] = sqrt(Alist[j, j, k] - subLlist[1:j-1, j, k] ⋅ subLlist[1:j-1, j, k])
        0         end
        0     end
        0     return subLlist
        - end
        - 
        - # ╔═╡ 373ddc83-2ffb-4c97-b4a6-d25957d8f85a
        - function recursiveTor(Alist::AbstractArray{T, 3}, Llist::AbstractArray{T, 3}, modes::Vector{Int}, n::Int)::T where T
        - # function recursiveTor(Alist, Llist, modes, n)
        0     nmodes = length(modes)
        -     start = 1
        0     if nmodes > 0
        0         start = modes[end] + 1
        -     end
        -     
        0     N = size(Alist)[1] ÷ 2
        - 
        -     
        -     tor = 0.
        0     s = (-1) ^(nmodes + 1)
        -     
        - #     l = size(subLlist)[1] ÷ 2 - 1
        0     for i in start:n
 33554400         nextmodes = [modes;i]
        0         l = (i - nmodes) * 2
234961200         Z = [1:l-2;l+1:N*2]
469907680         subAlist = @view Alist[Z, Z, :]
        0         subLlist = computeL(subAlist, Llist, Z, l)
        -         
        -         det = 1.0
469922400         for L in eachslice(subLlist, dims=3)
        0             det *= prod(diag(L))
        0         end
        -         
        0         tor += s / det +  recursiveTor(subAlist, subLlist,  nextmodes, n)
        - #         l -= 1
        0     end
        0     return tor
        - end
        - 
        - # ╔═╡ 21922284-fe90-46be-b0a5-3abf4cd3250e
        - function Tor(Vlist::AbstractArray{T, 3})::T where T
        -     tor = T(1.0)
    12928     Llist = zeros(T, size(Vlist))
        0     for i in 1:size(Vlist)[3]
    12928         Llist[:,:,i] = cholesky(Vlist[:,:,i]).U
        0         tor /= prod(diag(Llist[:,:,i]))
        0     end
       64     modes = Int[]
        0     return abs(tor + recursiveTor(Vlist, Llist, modes, size(Vlist)[1] ÷ 2))
        - end
        - 
        - # ╔═╡ 81590e7a-ab7f-4b35-8d2e-0f841b833bde
        - begin
        - 	n = 10
        - 	A = rand(n)
        - 	A = A * A' + 2 * I
        - 	A = reshape(A, n, n , 1)
        - end
        - 
        - # ╔═╡ f2c77762-9d1d-4329-aabc-2c4b7ed94465
        - begin
        - 	Tor(A); 
        - 	
        - 	Profile.clear_malloc_data() 
        - 	
        - 	@profile Tor(A);
        - end

Also, speed is a very critical problem because its time cost is exponential. And I may use this function in greater dimension than 40 * 40 * 1. In fact, I try this function in numba, C++ Eigen, and julia, by now C++ and numba shares approaching time, and julia is a bit slower.

Probably worth it’s own thread. Have you profiled things also wrt runtime?

subLlist[1:i-1, i, k] will copy so you probably want to use views there to reduce allocations.

I have expecially tested vector dot function as you mentioned. It only has few memory allocations.

BenchmarkTools.Trial: 10000 samples with 185 evaluations.
 Range (min … max):  565.265 ns …  1.481 μs  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     579.686 ns              ┊ GC (median):    0.00%
 Time  (mean ± σ):   607.529 ns ± 66.937 ns  ┊ GC (mean ± σ):  0.00% ± 0.00%

  ██▆▅▄▄▄▅▄▄▃▃▃▃▃▂▂▂▂▂▁▁▁▁                                     ▂
  ███████████████████████████████████▇█▇▇▆▇▆▆▇▆▅▆▆▆▆▅▅▅▆▆▅▅▄▄▅ █
  565 ns        Histogram: log(frequency) by time       883 ns <

 Memory estimate: 16 bytes, allocs estimate: 1.