I want to speed up running Julia script. Some code executed rarely and require heavy libraries.
The problem is
ERROR: LoadError: syntax: "import" expression not at top level.
Eval doesn’t work ether, it complains about some world counter.
Any solution?
using PyCall, DataFrames
function cached(get::Function, id::AbstractString)
get() # In reality it will be serialised to disk
end
data = cached("data") do
import Pandas # <== Error
py"""
import numpy as np
import pandas as pd
"""
df_py = pyeval("""pd.DataFrame({ "id": np.arange(1, 11), "value": np.random.randn(10), })""")
DataFrame(Pandas.Pandas.DataFrame(df_py))
end
println(data)
P.S. In this specific case it’s possible to avoid using Pandas, like use Python → csv → Julia etc. But it’s a common pattern and would be nice to know how to solve it.
eval(expr)
Evaluate an expression in the global scope of the containing module. Every
Module (except those defined with baremodule) has its own 1-argument
definition of eval, which evaluates expressions in that module.
Thanks, I found easier way to convert data from python
df = data_py.load().reset_index(drop=true)
data = DataFrame(df.to_dict(orient="list"))
As for eval the whole block had to be put in eval
using PyCall, DataFrames
function cached(get::Function, id::AbstractString)
get() # In reality it will be serialised to disk
end
data = cached("data") do
# @eval import Pandas # <= Doesnt' work, the whole block had to be put in eval
@eval begin
import Pandas
py"""
import numpy as np
import pandas as pd
"""
df_py = pyeval("""pd.DataFrame({ "id": np.arange(1, 11), "value": np.random.randn(10), })""")
DataFrame(Pandas.Pandas.DataFrame(df_py))
end
end
println(data)
Imports are not just how you load packages; in fact, loading only occurs for packages when they haven’t been loaded for a session, so subsequent imports are much cheaper. Imports trade names among global scopes in Julia; eval worked to an extent because it evaluates code in the global scope. Unlike Python, names can’t be traded with local scopes.
The world age counter is a byproduct of optimizing JIT compilation. Dispatched methods are compiled before they are called, so your method was compiled before the @eval import Pandas could run. The compiled method had no idea what Pandas.Pandas.DataFrame meant, let alone what methods it had, so it compiled to throwing a MethodError. When it executes, the Pandas.Pandas.DataFrame object does exist after the import, despite the compiled method not knowing about the methods. If you had executed the method a 2nd time interactively, it would work, but the method would still lag in the previous world age’s methods and often be recompiled. In order to use the methods immediately and avoid that repeated recompilation, you can use invokelatest on the call, but that sacrifices optimizations, similar to how executing code in the global scope does. Depending on what you’re doing, that sacrifice might be acceptable.
There could be improvements depending on how you’re conditionally evaluating imports and code. Your example only shows an unconditional higher-order function call, so there’s no opportunity to omit anything and your @eval import and calls are a needless sacrifice. Is it possible to make a slightly bigger example with a condition?
Your example only shows an unconditional higher-order function call
The code is conditioned, the real code for cached, it skip the computation if cache exist, I omit it in the example to keep it short.
function cached(get::Function, id::AbstractString)
date_s = Dates.format(Dates.now(), dateformat"yyyy-mm-dd")
path = "./tmp/cache/$(id)-$(date_s).jls"
if isfile(path)
@info "cache loading" id
return deserialize(path)
end
@info "cache calculating" id
result = get()
mkpath(dirname(path))
serialize(path, result)
result
end
About the @eval - is such usage - to avoid compilation of conditioned code block undesirable? Like making optimising other code worse? The code in eval block doesn’t have to be fast but say would it affect performance of some unrelated fit_model() function?
data = cached("data") do
# @eval import Pandas # <= Doesnt' work, the whole block had to be put in eval
@eval begin
import Pandas
py"""
import numpy as np
import pandas as pd
"""
df_py = pyeval("""pd.DataFrame({ "id": np.arange(1, 11), "value": np.random.randn(10), })""")
DataFrame(Pandas.Pandas.DataFrame(df_py))
end
end
The way cached is written there, the get input function cannot be conditional (in your call, that’s the part in the do block), it’s the get() call that is. That’s why you had to put the actual work of the do block inside @eval to make it entirely conditioned on the call instead of being compiled inside the get input function with the associated drawbacks.
Also worth mentioning that with what is written so far, you still only define the cached function to call once, so all of this is far easier in the global scope without a function at all (for example, you can do conditional imports in top-level if blocks). But I assume you might want to call cached many times in the same script (maybe a loop over many id), and eval (and related things like include) in a function is pretty much how you’d rerun a script without dealing with method compilation.
As in Python, you should avoid conditional import schemes in a longer-lived process with enough repeated calls because it seriously hampers maintaining more complex code for increasingly negligible or unlikely savings. For example, if you’re iterating through many id with scattered caches, you should load everything you needed to deal with the uncached from the start. On the other hand, if you’re iterating id that either only have caches (maybe a whole directory of caches) or don’t have caches at all (filling an empty directory), you could use different scripts for them to begin with instead of dealing with the unnecessary overhead of repeated conditional checks.
Really depends on how much Julia code there is. In your example as stated, the @eval block doesn’t directly do much in Julia so there isn’t much to optimize anyway. Your cached function call won’t know much about the input get method, but there isn’t much code afterward that needs to know. Again, I’d only be concerned for repeated calls in longer-lived processes; short scripts can afford to be a little rougher.
Maybe I miss something, the way cached written is a standard and well known pattern. A slow_get function that could be called any time and some sort of cache. Cache don’t know about slow_get and slow_get don’t know about cache, and the main script don’t know about internals of those two too - it’s supposed to be that way.
Suppose that slow_get is some slow call to download data from network, or slow computation (as it is in this case, made by some legacy python code). I just wrap slow_get it into cache, and it solves the problem and makes slow operation instant.