Julia DataFrames -> plot 48 times slower than Python Pandas


#1

I use the same methods and data in both Python and Julia (except that the Python plot has much more work as far as the attributes of the plot is concerned). I am fairly new to both Python Pandas and Julia DataFrames and plotting.

$ time julia juliareport.jl 
WARNING: Method definition ==(Base.Nullable{S}, Base.Nullable{T}) in module Base at nullable.jl:238 overwritten in module NullableArrays at /home/js/.julia/v0.6/NullableArrays/src/operators.jl:99.

real	1m8.597s
user	1m7.857s
sys	0m0.816s

$ time python3 pythonreport.py 

real	0m1.402s
user	0m1.391s
sys	0m0.455s

Here is my Julia code. Please tell me why it is so slow. (The dataset is small (4 columns with 78 rows including headers)

using DataFrames
#using FreqTables
#using Query
using DBI
using PostgreSQL
using StatPlots
using CSV
gr(size=(1200,800))

function db_conn(query)
    conn = connect(Postgres, "localhost", "js",
    "513726.js","wos", 63334 )
    q = prepare(conn, query)
    result = execute(q)
    finish(q)
    disconnect(conn)
    fetchdf(result)
end

function get_world_share(query = """select broadfield, pubyear,
    za_publications::integer, world_publications from
    basic_sciences.outputs_world_share_mv
    order by pubyear;""")
    #df = db_conn(query)
    df = CSV.readtable("/home/js/db_docs/sql/projects/basic_sciences/worldshare.csv")
    df[:percentage] = map((x,y) ->  x / y * 100,
    df[:za_publications], df[:world_publications])
    pivot = unstack(df, :pubyear,:broadfield ,:percentage)
    rename!(pivot, Symbol("Biological Sciences"), :Biological_Sciences)
    rename!(pivot, Symbol("Geological Sciences"), :Geological_Sciences)
    rename!(pivot, Symbol("Computer Science"), :Computer_Science)
end

function plot_world_share(df)
    x = df[:pubyear]
    @df df plot(Int16.(x), [:Biological_Sciences :Chemistry :Geological_Sciences  :Computer_Science :Mathematics :Physics :Statistics ], xticks = 2005:2:2015)
    end

plot_world_share(get_world_share())
Plots.savefig("/home/js/db_docs/sql/projects/basic_sciences/worldshare.png")

#2

I am not sure exactly what takes time in your example but you should know that Julia functions get optimized and compiled the first time they are run. Therefore, if you use a large number of packages that make calls to a large number of functions, then the first time the commands are executed in the Julia session there will be some overhead.

The recommended workflow is therefore rarely to use julia script.jl but to start a julia session and run e.g. include("script.jl"). The first time, things are compiled, the second time will be fast.

1 min+ for first plot is, of course, a very long time and it would be interesting to analyze if there is something special that takes up the majority of that time.


#3

Thanks for you answer @kristoffer.carlsson.

OK now I did use it twice (include(“script.jl”)) and the second time it took 0.19 secs.

But it is impracticable to have a script compiled for such a long time every time you want to use it or to keep a REPL open permanently if one wants to use a script regularly.

I have tried a

@profile include('script.jl')

It prints a warning that the profile buffer is full. The output contains 13938 lines. It seems to recompile every package the script is “using”. There should be a better way…


#4

That shouldn’t happen. Can you post the output you get when running the using commands?


#5

0.19 secs seems like a more reasonable time. The issue here is that Plots.jl is not precompiled in the released versions, as some fear that it could potentially cause some issues. The unreleased version of Plots.jl has precompilation turned on, so you could probably speed up the first run by using the unreleased version, which you can get by running

Pkg.checkout("Plots")

in the Julia REPL.

Still, a better solution would probably be to download some IDE (i.e. Juno or VSCode julia extension) which you could keep more or less always open (that’s what I do anyway).

A small unrelated remark in terms of coding style, the @df macro can also take a list of columns, which could ideally simplify your life. For example if [:Biological_Sciences :Chemistry :Geological_Sciences :Computer_Science :Mathematics :Physics :Statistics ] are columns from 4 to 10, you could simply type:

@df df plot(Int16.(x), cols(4:10), xticks = 2005:2:2015)


#6

Well, even if the package is using __precompile__() there is still a lot of compilation happening at first run time.


#7

I’ve actually never understood why that is?


#8

Just because precompilation does not (and cannot) actually compile native code for all possible types which can be passed to functions. So AFAIK it only does the first steps of compilation (parsing, etc.), and compiles to native code only the methods for the particular types which are called during the precompilation (which you can control if needed to speed up common use cases).


#9

Oh, thanks, I get it. So there’s a way to explicitly type functions for precompilation that would speed up this? (Is this with SnoopCompile?). That might have an effect on time-to-first-plot.


#10

Yes, SnoopCompile automates this by looking at which functions are compiled when running a piece of code. You can also do that manually if you want. AFAIK the main limit currently is that you cannot store precompiled functions from other modules, so that only works if most of the time is spend in the package’s code.


#11

I have timed sections of the script and it seems that "using " in most cases takes too much time – especially with StatsPlots

And then getting the data from the CSV-file and forming a DataFrame takes more than 4 seconds and plotting it more than 22 seconds.

WARNING: Compat.UTF8String is deprecated, use String instead.
  likely near /home/js/.atom/packages/julia-client/script/boot.jl:303
WARNING: Compat.UTF8String is deprecated, use String instead.
  likely near /home/js/.atom/packages/julia-client/script/boot.jl:303
WARNING: Compat.UTF8String is deprecated, use String instead.
  likely near /home/js/.atom/packages/julia-client/script/boot.jl:303
WARNING: Compat.UTF8String is deprecated, use String instead.
  likely near /home/js/.atom/packages/julia-client/script/boot.jl:303
WARNING: Compat.UTF8String is deprecated, use String instead.
  likely near /home/js/.atom/packages/julia-client/script/boot.jl:303
WARNING: Compat.UTF8String is deprecated, use String instead.
  likely near /home/js/.atom/packages/julia-client/script/boot.jl:303
WARNING: Compat.UTF8String is deprecated, use String instead.
  likely near /home/js/.atom/packages/julia-client/script/boot.jl:303
WARNING: Compat.UTF8String is deprecated, use String instead.
  likely near /home/js/.atom/packages/julia-client/script/boot.jl:303
WARNING: Compat.UTF8String is deprecated, use String instead.
  likely near /home/js/.atom/packages/julia-client/script/boot.jl:303
WARNING: Compat.UTF8String is deprecated, use String instead.
  likely near /home/js/.atom/packages/julia-client/script/boot.jl:303

  1.316751 seconds (763.69 k allocations: 43.116 MiB, 6.73% gc time)

@time using DBI

1.275604 seconds (560.80 k allocations: 29.948 MiB, 0.71% gc time)

@time using PostgreSQL

 0.671187 seconds (115.84 k allocations: 6.716 MiB, 1.56% gc time)

@time using StatPlots

24.933359 seconds (16.47 M allocations: 1.079 GiB, 1.93% gc time)

@time using CSV

1.870682 seconds (3.28 M allocations: 188.478 MiB, 9.61% gc time)

@time gr(size=(1200,800))

1.549224 seconds (739.00 k allocations: 39.950 MiB, 1.65% gc time)

@time df = get_world_share()
1.549224 seconds (739.00 k allocations: 39.950 MiB, 1.65% gc time)
3.856186 seconds (1.82 M allocations: 95.870 MiB, 1.51% gc time)

@time plot_world_share(df)

22.627836 seconds (11.13 M allocations: 599.857 MiB, 2.98% gc time)

#12

Did you check out the (precompilable) master of Plots as suggested above?


#13

I thought that I did, yes. Anyhow I did it (again?) and it brought the time for

using StatPlots

down to 12+ seconds from 22+.

Still terrible slow.

Regards
Johann


#14

It’s still slow, yes. Bringing down time-to-first-plot is a priority, but less easy than you’d think.

cf




#15

Also, I’d suggest writing modules instead of scripts yourself if you’ll be reusing them a lot. You can precompile a module (provided everything you’re using can be precompiled).
That would only go so far, but there isn’t much of a downside.

I’ve also found the gr() backend of Plots to be much faster than R’s ggplot2, but I don’t have experience with StatsPlots or Python Pandas.


#16

Small off topic: Let’s say I have a code ( “script.jl” that reads a file and outputs another file) that I want to use frequently in a production environment. How should I call the code so that the code is fast from the start?


#17

Put it in a module and precompile that module. I’m actually not sure if it needs to be a full package in your .julia folder for the precompilation to be persistent?