I use the same methods and data in both Python and Julia (except that the Python plot has much more work as far as the attributes of the plot is concerned). I am fairly new to both Python Pandas and Julia DataFrames and plotting.
$ time julia juliareport.jl
WARNING: Method definition ==(Base.Nullable{S}, Base.Nullable{T}) in module Base at nullable.jl:238 overwritten in module NullableArrays at /home/js/.julia/v0.6/NullableArrays/src/operators.jl:99.
real 1m8.597s
user 1m7.857s
sys 0m0.816s
$ time python3 pythonreport.py
real 0m1.402s
user 0m1.391s
sys 0m0.455s
Here is my Julia code. Please tell me why it is so slow. (The dataset is small (4 columns with 78 rows including headers)
using DataFrames
#using FreqTables
#using Query
using DBI
using PostgreSQL
using StatPlots
using CSV
gr(size=(1200,800))
function db_conn(query)
conn = connect(Postgres, "localhost", "js",
"513726.js","wos", 63334 )
q = prepare(conn, query)
result = execute(q)
finish(q)
disconnect(conn)
fetchdf(result)
end
function get_world_share(query = """select broadfield, pubyear,
za_publications::integer, world_publications from
basic_sciences.outputs_world_share_mv
order by pubyear;""")
#df = db_conn(query)
df = CSV.readtable("/home/js/db_docs/sql/projects/basic_sciences/worldshare.csv")
df[:percentage] = map((x,y) -> x / y * 100,
df[:za_publications], df[:world_publications])
pivot = unstack(df, :pubyear,:broadfield ,:percentage)
rename!(pivot, Symbol("Biological Sciences"), :Biological_Sciences)
rename!(pivot, Symbol("Geological Sciences"), :Geological_Sciences)
rename!(pivot, Symbol("Computer Science"), :Computer_Science)
end
function plot_world_share(df)
x = df[:pubyear]
@df df plot(Int16.(x), [:Biological_Sciences :Chemistry :Geological_Sciences :Computer_Science :Mathematics :Physics :Statistics ], xticks = 2005:2:2015)
end
plot_world_share(get_world_share())
Plots.savefig("/home/js/db_docs/sql/projects/basic_sciences/worldshare.png")
I am not sure exactly what takes time in your example but you should know that Julia functions get optimized and compiled the first time they are run. Therefore, if you use a large number of packages that make calls to a large number of functions, then the first time the commands are executed in the Julia session there will be some overhead.
The recommended workflow is therefore rarely to use julia script.jl but to start a julia session and run e.g. include("script.jl"). The first time, things are compiled, the second time will be fast.
1 min+ for first plot is, of course, a very long time and it would be interesting to analyze if there is something special that takes up the majority of that time.
OK now I did use it twice (include(“script.jl”)) and the second time it took 0.19 secs.
But it is impracticable to have a script compiled for such a long time every time you want to use it or to keep a REPL open permanently if one wants to use a script regularly.
I have tried a
@profile include('script.jl')
It prints a warning that the profile buffer is full. The output contains 13938 lines. It seems to recompile every package the script is “using”. There should be a better way…
0.19 secs seems like a more reasonable time. The issue here is that Plots.jl is not precompiled in the released versions, as some fear that it could potentially cause some issues. The unreleased version of Plots.jl has precompilation turned on, so you could probably speed up the first run by using the unreleased version, which you can get by running
Pkg.checkout("Plots")
in the Julia REPL.
Still, a better solution would probably be to download some IDE (i.e. Juno or VSCode julia extension) which you could keep more or less always open (that’s what I do anyway).
A small unrelated remark in terms of coding style, the @df macro can also take a list of columns, which could ideally simplify your life. For example if [:Biological_Sciences :Chemistry :Geological_Sciences :Computer_Science :Mathematics :Physics :Statistics ] are columns from 4 to 10, you could simply type:
Just because precompilation does not (and cannot) actually compile native code for all possible types which can be passed to functions. So AFAIK it only does the first steps of compilation (parsing, etc.), and compiles to native code only the methods for the particular types which are called during the precompilation (which you can control if needed to speed up common use cases).
Oh, thanks, I get it. So there’s a way to explicitly type functions for precompilation that would speed up this? (Is this with SnoopCompile?). That might have an effect on time-to-first-plot.
Yes, SnoopCompile automates this by looking at which functions are compiled when running a piece of code. You can also do that manually if you want. AFAIK the main limit currently is that you cannot store precompiled functions from other modules, so that only works if most of the time is spend in the package’s code.
I have timed sections of the script and it seems that “using ” in most cases takes too much time – especially with StatsPlots
And then getting the data from the CSV-file and forming a DataFrame takes more than 4 seconds and plotting it more than 22 seconds.
WARNING: Compat.UTF8String is deprecated, use String instead.
likely near /home/js/.atom/packages/julia-client/script/boot.jl:303
WARNING: Compat.UTF8String is deprecated, use String instead.
likely near /home/js/.atom/packages/julia-client/script/boot.jl:303
WARNING: Compat.UTF8String is deprecated, use String instead.
likely near /home/js/.atom/packages/julia-client/script/boot.jl:303
WARNING: Compat.UTF8String is deprecated, use String instead.
likely near /home/js/.atom/packages/julia-client/script/boot.jl:303
WARNING: Compat.UTF8String is deprecated, use String instead.
likely near /home/js/.atom/packages/julia-client/script/boot.jl:303
WARNING: Compat.UTF8String is deprecated, use String instead.
likely near /home/js/.atom/packages/julia-client/script/boot.jl:303
WARNING: Compat.UTF8String is deprecated, use String instead.
likely near /home/js/.atom/packages/julia-client/script/boot.jl:303
WARNING: Compat.UTF8String is deprecated, use String instead.
likely near /home/js/.atom/packages/julia-client/script/boot.jl:303
WARNING: Compat.UTF8String is deprecated, use String instead.
likely near /home/js/.atom/packages/julia-client/script/boot.jl:303
WARNING: Compat.UTF8String is deprecated, use String instead.
likely near /home/js/.atom/packages/julia-client/script/boot.jl:303
1.316751 seconds (763.69 k allocations: 43.116 MiB, 6.73% gc time)
@time using DBI
1.275604 seconds (560.80 k allocations: 29.948 MiB, 0.71% gc time)
@time using PostgreSQL
0.671187 seconds (115.84 k allocations: 6.716 MiB, 1.56% gc time)
@time using StatPlots
24.933359 seconds (16.47 M allocations: 1.079 GiB, 1.93% gc time)
@time using CSV
1.870682 seconds (3.28 M allocations: 188.478 MiB, 9.61% gc time)
@time gr(size=(1200,800))
1.549224 seconds (739.00 k allocations: 39.950 MiB, 1.65% gc time)
@time df = get_world_share()
1.549224 seconds (739.00 k allocations: 39.950 MiB, 1.65% gc time)
3.856186 seconds (1.82 M allocations: 95.870 MiB, 1.51% gc time)
@time plot_world_share(df)
22.627836 seconds (11.13 M allocations: 599.857 MiB, 2.98% gc time)
Also, I’d suggest writing modules instead of scripts yourself if you’ll be reusing them a lot. You can precompile a module (provided everything you’re using can be precompiled).
That would only go so far, but there isn’t much of a downside.
I’ve also found the gr() backend of Plots to be much faster than R’s ggplot2, but I don’t have experience with StatsPlots or Python Pandas.
Small off topic: Let’s say I have a code ( “script.jl” that reads a file and outputs another file) that I want to use frequently in a production environment. How should I call the code so that the code is fast from the start?
Put it in a module and precompile that module. I’m actually not sure if it needs to be a full package in your .julia folder for the precompilation to be persistent?