[solved] How to load Modules on demand? To speed up

I want to speed up running Julia script. Some code executed rarely and require heavy libraries.

The problem is

ERROR: LoadError: syntax: "import" expression not at top level.

Eval doesn’t work ether, it complains about some world counter.

Any solution?

using PyCall, DataFrames

function cached(get::Function, id::AbstractString)
  get() # In reality it will be serialised to disk
end

data = cached("data") do
  import Pandas # <== Error
  py"""
  import numpy as np
  import pandas as pd
  """
  df_py = pyeval("""pd.DataFrame({ "id": np.arange(1, 11), "value": np.random.randn(10), })""")
  DataFrame(Pandas.Pandas.DataFrame(df_py))
end

println(data)

P.S. In this specific case it’s possible to avoid using Pandas, like use Python → csv → Julia etc. But it’s a common pattern and would be nice to know how to solve it.

Can’t have using or import in function local scope—it has to be global.

Is there a compelling reason to serialize? Because

using CSV, DataFrames
data_store = "/path/to/csv_file.csv"

function fetch(csv_file)
     return CSV.read(fetch_file, DataFrame)
end

is pretty simple. And if you’ve got big N, an SQL backend has got to be a more robust way to go.

You can use

  eval(expr)

  Evaluate an expression in the global scope of the containing module. Every
  Module (except those defined with baremodule) has its own 1-argument
  definition of eval, which evaluates expressions in that module.

Thanks, I found easier way to convert data from python

df = data_py.load().reset_index(drop=true)
data = DataFrame(df.to_dict(orient="list"))

As for eval the whole block had to be put in eval

using PyCall, DataFrames

function cached(get::Function, id::AbstractString)
  get() # In reality it will be serialised to disk
end

data = cached("data") do
  # @eval import Pandas # <= Doesnt' work, the whole block had to be put in eval

  @eval begin
    import Pandas
    py"""
    import numpy as np
    import pandas as pd
    """
    df_py = pyeval("""pd.DataFrame({ "id": np.arange(1, 11), "value": np.random.randn(10), })""")
    DataFrame(Pandas.Pandas.DataFrame(df_py))
  end
end

println(data)
  1. Imports are not just how you load packages; in fact, loading only occurs for packages when they haven’t been loaded for a session, so subsequent imports are much cheaper. Imports trade names among global scopes in Julia; eval worked to an extent because it evaluates code in the global scope. Unlike Python, names can’t be traded with local scopes.

  2. The world age counter is a byproduct of optimizing JIT compilation. Dispatched methods are compiled before they are called, so your method was compiled before the @eval import Pandas could run. The compiled method had no idea what Pandas.Pandas.DataFrame meant, let alone what methods it had, so it compiled to throwing a MethodError. When it executes, the Pandas.Pandas.DataFrame object does exist after the import, despite the compiled method not knowing about the methods. If you had executed the method a 2nd time interactively, it would work, but the method would still lag in the previous world age’s methods and often be recompiled. In order to use the methods immediately and avoid that repeated recompilation, you can use invokelatest on the call, but that sacrifices optimizations, similar to how executing code in the global scope does. Depending on what you’re doing, that sacrifice might be acceptable.

There could be improvements depending on how you’re conditionally evaluating imports and code. Your example only shows an unconditional higher-order function call, so there’s no opportunity to omit anything and your @eval import and calls are a needless sacrifice. Is it possible to make a slightly bigger example with a condition?

Your example only shows an unconditional higher-order function call

The code is conditioned, the real code for cached, it skip the computation if cache exist, I omit it in the example to keep it short.

function cached(get::Function, id::AbstractString)
  date_s = Dates.format(Dates.now(), dateformat"yyyy-mm-dd")
  path = "./tmp/cache/$(id)-$(date_s).jls"

  if isfile(path)
    @info "cache loading" id
    return deserialize(path)
  end

  @info "cache calculating" id
  result = get()
  mkpath(dirname(path))
  serialize(path, result)
  result
end

About the @eval - is such usage - to avoid compilation of conditioned code block undesirable? Like making optimising other code worse? The code in eval block doesn’t have to be fast but say would it affect performance of some unrelated fit_model() function?

data = cached("data") do
  # @eval import Pandas # <= Doesnt' work, the whole block had to be put in eval

  @eval begin
    import Pandas
    py"""
    import numpy as np
    import pandas as pd
    """
    df_py = pyeval("""pd.DataFrame({ "id": np.arange(1, 11), "value": np.random.randn(10), })""")
    DataFrame(Pandas.Pandas.DataFrame(df_py))
  end
end

The way cached is written there, the get input function cannot be conditional (in your call, that’s the part in the do block), it’s the get() call that is. That’s why you had to put the actual work of the do block inside @eval to make it entirely conditioned on the call instead of being compiled inside the get input function with the associated drawbacks.

Also worth mentioning that with what is written so far, you still only define the cached function to call once, so all of this is far easier in the global scope without a function at all (for example, you can do conditional imports in top-level if blocks). But I assume you might want to call cached many times in the same script (maybe a loop over many id), and eval (and related things like include) in a function is pretty much how you’d rerun a script without dealing with method compilation.

As in Python, you should avoid conditional import schemes in a longer-lived process with enough repeated calls because it seriously hampers maintaining more complex code for increasingly negligible or unlikely savings. For example, if you’re iterating through many id with scattered caches, you should load everything you needed to deal with the uncached from the start. On the other hand, if you’re iterating id that either only have caches (maybe a whole directory of caches) or don’t have caches at all (filling an empty directory), you could use different scripts for them to begin with instead of dealing with the unnecessary overhead of repeated conditional checks.

Really depends on how much Julia code there is. In your example as stated, the @eval block doesn’t directly do much in Julia so there isn’t much to optimize anyway. Your cached function call won’t know much about the input get method, but there isn’t much code afterward that needs to know. Again, I’d only be concerned for repeated calls in longer-lived processes; short scripts can afford to be a little rougher.

1 Like

Maybe I miss something, the way cached written is a standard and well known pattern. A slow_get function that could be called any time and some sort of cache. Cache don’t know about slow_get and slow_get don’t know about cache, and the main script don’t know about internals of those two too - it’s supposed to be that way.

Suppose that slow_get is some slow call to download data from network, or slow computation (as it is in this case, made by some legacy python code). I just wrap slow_get it into cache, and it solves the problem and makes slow operation instant.

Those are all true statements, so I’m don’t know what you might be missing or are trying to ask.