Efficiency for calling Julia from python and purely run Julia

I have the follow Julia code

using BenchmarkTools, LoopVectorization, TensorOperations


function test7(A)
    @tensor B[i,j,k,l] := 0.1*A[i,j,k,l] + 0.2*A[j,i,k,l]
    B
end    

@btime test7(A) setup=(n=30; A=rand(Float64,(n,n,n,n)))

if I name it as test.jl, the result is
2.769 ms (21 allocations: 6.18 MiB).

I may call the above code from python by

import numpy as np
import time

#from julia.api import Julia
#jl = Julia(compiled_modules=False)
#import julia


na = 30
A = np.random.random((na,na,na,na))

from julia import Main
Main.include("test.jl")

start = time.time()
B = Main.test7(A)
end = time.time()
print(end-start)

n_max = 20
start = time.time()
for i in range(n_max):
    B = Main.test7(A)
end = time.time()
print((end-start)/n_max)

start = time.time()
for i in range(n_max):
    B = Main.test7(A)
end = time.time()
print((end-start)/n_max)

start = time.time()
for i in range(n_max):
    B = 0.1*A + 0.2*np.transpose(A, (1,0,2,3))
end = time.time()
print('python', (end-start)/n_max)


as call_tensor.py. Run as python-jl call_tensor.py. (I got Your Python interpreter … is statically linked to libpython. Currently, PyJulia does not fully, and I ran in this way)

and I got

  2.718 ms (21 allocations: 6.18 MiB)
1.042999029159546
0.3651303887367249
0.49322196245193484
python 0.18051644563674926

My question is regarding in real-life calculation. The time measured from print((end-start)/n_max) is one order of magnitude higher than from btime. It is also slower than numpy transpose. I suppose Julia is faster.

I heard julia is using dynamical approach that compiling and running at the same time. A function called 2nd time will be faster than the first time. But, I measured the timing several times in my python code. They are all one-order of magnitude slower. Is there any solution?

Could it be that I should convert numpy array into Julia format? If so, I may need to convert Julia array back to numpy for my further usage many times. How fast this type of conversion?

Milliseconds are small, and you are including in the time the for loop managed by python. Have you tried with a single measure, perhaps on a bigger matrix?

1 Like

you are including in the time the for loop managed by python.

I added a comparison with numpy transpose. Both loading Julia and numpy inside loops are compared. Numpy transpose is still faster :frowning: For na = 50 (modified both .py and .jl files), I got

  84.856 ms (21 allocations: 47.69 MiB)
1.1065576076507568
0.33703491687774656
0.34638277292251585
python 0.17203961610794066

If I removed the loops, and let n_max = 1, I got

  82.875 ms (21 allocations: 47.69 MiB)
1.3454747200012207
0.4381554126739502
0.4645373821258545
python 0.23416924476623535

The main issue is, I plan to call julia from python to accelerate the performance of my python code (rewrite in Julia is less practical; has lots of array calculations with not very large dimension). If the calling takes longer time, it is not that worthwhile

There is another mechanism, which is to use PackageCompiler.jl and export a C function pointer to a Julia function via @cfunction. You can then call this C function pointer from Python.

@mlxd put together a presentation regarding this approach:

3 Likes

That’s fantastic!

Though, I would like to avoid C/C++ as much as possible, just interacting with python-julia… In any case, PackageCompiler.jl looks formidable to me. Perhaps I need some more elementary tutorial/example than in Home · PackageCompiler

You won’t get a large speedup (or any) if you are using Julia to replace vectorized numpy operations.

To use Julia effectively you need to see if what you need to compute is not natural to be coded with numpy, and if a custom kernel for your problem may be faster. This may not be obvious as first sight.

4 Likes

Thank you so much for pointing out.

The motivation is, from I heard and some tests,

TensorOperations.jl is often the fastest way to do permutedims on larger arrays. It has a smarter cache-friendly blocking algorithm than the one in Base. But how much this matters of course depends on size & permutation.

My goal involves a series of array additions and permutations, it is complicated to optimize memory cache to improve the performance in Numpy/Fortran. TensorOperations.jl has some feature on it.

In one example in the above link, the timing from python is 4.956309795379639 ms and in julia is 2.585 ms (21 allocations: 6.18 MiB), about a factor of 2 better. I observed a similar pattern in other cases.

Try this one:

It basically boils down to:

using PackageCompiler
create_sysimage(["Plots"], sysimage_path="sys_plots.so", precompile_execution_file="precompile_plots.jl")

This may be unavoidable given your requirements. The bottleneck you are facing is the latency for Python to dynamically call native code, in this case Julia’s C API. To decrease this latency, you essentially need to build a C based extension for Python to call the Julia code directly.

You might try to use ctypes — A foreign function library for Python — Python 3.12.0 documentation to call the C function pointer exported for your Julia code, but this is essentially what PyCall is trying to do already. You might be able to optimize this though.

Another approach would be to use Cython, dlopen a system image made with PackageCompiler.jl, dlsym to get the C function pointer, and then invoke it.

1 Like

Thanks. I tried to follow Creating a sysimage for fast plotting with Plots.jl · PackageCompiler and failed at create_sysimage(["Plots"], sysimage_path="sys_plots.so", precompile_execution_file="precompile_plots.jl"). (I have created precompile_plots.jl)

I got

connect: Connection refused
GKS: can't connect to GKS socket application

GKS: Open failed in routine OPEN_WS

and leads to

⠸ [01m:02s] PackageCompiler: compiling incremental system image

keep on running seems forever

I tried to fix the GKS issue from GR can't connect on Windows - #3 by jbreeden
the last reply.
then rerun
create_sysimage(["Plots"], sysimage_path="sys_plots.so", precompile_execution_file="precompile_plots.jl")
I got
[ Info: PackageCompiler: Executing /mnt/c/.../precompile_plots.jl => /tmp/jl_packagecompiler_n2iAWn/jl_Y2asBo
forever

If I close windows subsystem linux, reopen, and rerun julia, still the same :frowning:

I’m not sure if Plots runs within WSL. Try a different package, perhaps your own.

Thanks… I replaced precompile_plots.jl
as

using BenchmarkTools, LoopVectorization, TensorOperations


function test7(A)
    @tensor B[i,j,k,l] := 0.1*A[i,j,k,l] + 0.2*A[j,i,k,l]
    B
end    

@btime test7(A) setup=(n=30; A=rand(Float64,(n,n,n,n)))

and using

using PackageCompiler
create_sysimage(["Plots"], sysimage_path="sys_plots.so", precompile_execution_file="precompile_plots.jl")

I got

[ Info: PackageCompiler: Executing /mnt/c/Users/user/Documents/GitHub/contraction_benchmark/contraction_benchmark/precompile_plots.jl => /tmp/jl_packagecompiler_Qw93PP/jl_RAp5sM
  2.779 ms (21 allocations: 6.18 MiB)
[ Info: PackageCompiler: Done
⠙ [03m:19s] PackageCompiler: compiling incremental system image

i am not sure if i did the right thing, namely if i need create_sysimage. In any case, Creating a sysimage for fast plotting with Plots.jl · PackageCompiler
is about loading julia code from julia faster. Seems there is a long way to go towards efficient import in python :frowning:

A few thoughts:

  1. precompilation recipe - my understanding was that you are meant to use a recorded precompilation recipe (precompile_plots.jl in your example). You can see the detail here: OhMyRepl precompilation tutorial
    In short, just run your code as
julia --trace-compile=my_precompile_recipe.jl precompile_plots.jl

It will execute your jl script and record all the methods that were used in my_precompile_recipe.jl. You can then precompile the sys image as:

using PackageCompiler
create_sysimage(["LoopVectorization, TensorOperations"], sysimage_path="sys_image.so", precompile_execution_file="my_precompile_recipe.jl")

You will notice that I also changed the first array to exclude Plots package (you don’t seem to need it).
In addition, you can remove all references to benchmarking, eg, BenchmarkTools, @btime, etc. for the use in Python

  1. linking pyjulia to precompile image - you want to make sure your precompiled image is used by pyjulia (see more HERE
    In short, import pyjulia as
from julia import Julia
jl = Julia(sysimage="sys_image.so")
  1. interface python-julia - 2-3ms is incredibly fast, so I could see that the movements of data become more expensive than the computation itself. (IMO, it’s so fast that you wouldn’t even use multi-threading within Julia itself. I usually start exploring it with functions that run >10ms-100ms).
    There are (at least) two ways how to interface between Python and Julia: A) PyCall.jl and PyJulia and B) PythonCall.jl and JuliaCall. The latter is slightly newer and might be better in this case.
    This is a guess but maybe JuliaCall.jl could reduce the overhead in your example. From their README:
  • Fast non-copying conversion of numeric arrays in either direction: modify Python arrays (e.g. bytes , array.array , numpy.ndarray ) from Julia or Julia arrays from Python.

If you can use the same piece of memory/data from python and julia without copying, it should get you closer to the julia runtime

Another general strategy is to move slightly more code to Julia - more on that below.

  1. other thoughts
  • Have you profiled your python code? Maybe the code your optimizing is not the biggest bottleneck (especially if you can use numpy - it’s hard to beat pure numpy. The benefits usually come because numpy is only a fraction of your overall code)
  • What’s the block of code around the above example? Are there any loops, IO operations, array creations/mutations? If you move that to Julia as well, you could start to see a lot more benefit (you reduce the Python overhead)

In terms of prioritization, I would skip the precompilation step and instead focus on profiling your python code & moving a little bit more code into Julia.

Hope it helps.

The immediate goal here is to create a system image, “sys_plots.so” or “sys_image.so” that contains the native machine code for the Julia packages that you indicated. In particular, this should decrease the initial latency of using Julia for each process since Julia does not need to recompile the Julia function each time. This is particularly relevant if you tend to start a new Python process every time or if you run multiple Python processes, which is the main mechanism for parallelization in Python.

Once you can do that then we will want to create a C function pointer that you can use directly from Python or via a Python extension in C or Cython. We may be able to even embed and export the C function pointer within the system image itself. Calling that C function pointer directly should remove a few steps in the process, reducing the latency of the call form Python to Julia.

To summarize:

  1. Create a system image to compile Julia code to native code ahead of time.
  2. Obtain a C function pointer to the Julia function, perhaps embedded in the system image.
  3. Invoke the C function pointer from Python or an extension of Python in C or Cython.

The objective is to make calling a Julia function as cheap (low latency) as calling a C function.

Thank you so much.

julia --trace-compile=my_precompile_recipe.jl precompile_plots.jl

works. Then,

using PackageCompiler
create_sysimage([“LoopVectorization, TensorOperations”], sysimage_path=“sys_image.so”, precompile_execution_file=“my_precompile_recipe.jl”)

after I entered julia, the second step led to

ERROR: package(s) LoopVectorization, TensorOperations not in project
Stacktrace:
 [1] error(s::String)
   @ Base ./error.jl:33
 [2] check_packages_in_project(ctx::Pkg.Types.Context, packages::Vector{String})
   @ PackageCompiler ~/.julia/packages/PackageCompiler/wpsGv/src/PackageCompiler.jl:108
 [3] create_sysimage(packages::Vector{String}; sysimage_path::String, project::String, precompile_execution_file::String, precompile_statements_file::Vector{String}, incremental::Bool, filter_stdlibs::Bool, cpu_target::String, script::Nothing, sysimage_build_args::Cmd, include_transitive_dependencies::Bool, base_sysimage::Nothing, julia_init_c_file::Nothing, version::Nothing, soname::Nothing, compat_level::String, extra_precompiles::String)
   @ PackageCompiler ~/.julia/packages/PackageCompiler/wpsGv/src/PackageCompiler.jl:445
 [4] top-level scope
   @ REPL[2]:1

it’s weird that using LoopVectorization, TensorOperations is at the top of precompile_plots.jl :frowning: (I have commented out #BenchmarkTools, and #@btime test7(A) setup=(n=30; A=rand(Float64,(n,n,n,n)))

  1. I profiled. Using snakeviz for profiling gives me many many information. I also used time.time() to check my python code. The conclusion is, the series of permutations and additions of arrays is the bottleneck (besides some possible improvement). I used transpose in numpy. But julia’s @tensor has some memory cache management, that outperform numpy by a factor of 2 ~ 3, Is there any way to optimize array additions and multiplications with transposes?, some of the discussions are summarized in the #6 reply in this question.

It would be great to start from julia. So far, the main python code has many other calculations. I think migrate into julia is bit complicated and prefer to improve the bottleneck by interfacing with julia.

Thanks a lot. I thought sysimage is something about plot… so was confused that if tensor calculation would require image. Thankfully, that sysimage has more abstract meaning.

I followed svilupp’s suggestion, and stuck at
create_sysimage(["LoopVectorization, TensorOperations"], sysimage_path="sys_image.so", precompile_execution_file="my_precompile_recipe.jl"), when I use my own precompile_recipe.jl
as

using LoopVectorization, TensorOperations #BenchmarkTools, 
 

function test7(A)
    @tensor B[i,j,k,l] := 0.1*A[i,j,k,l] + 0.2*A[j,i,k,l]
    B
end    

#@btime test7(A) setup=(n=30; A=rand(Float64,(n,n,n,n)))

:frowning:

I’ll reiterate again that I suspect you might be focusing your efforts in the wrong direction. Pre-compilation will speed up the start time of the julia image for the first time, any subsequent execution will not be any faster. Ie, if you call that function many times, you won’t see much difference. You might be better off pursuing some alternatives - more on that at the bottom of this post.

Re. your error,

ERROR: package(s) LoopVectorization, TensorOperations not in project

This error probably relates to your environment management, ie, what packages do you load / from where (more HERE)

Assuming you want to have a project-specific environment (healthier approach):

  • You either need to start your julia repl with “–project=.” flag to activate the environment defined in your folder (you’d see file Project.toml in your folder that would mention the packages you use),
  • or you need to start your code with import Pkg; Pkg.activate(".");
  • or you need to use ]activate . in REPL.

The best way is to test if you can run your script without a problem, then you want to compile your project the same way, eg,

Assume I have the environment with all packages (ie, Project.toml file is present). If not, run julia --project=. then ] add XYZ

I would have my script, let’s call it mwe.jl:

using LoopVectorization, TensorOperations


function test7(A)
    @tensor B[i,j,k,l] := 0.1*A[i,j,k,l] + 0.2*A[j,i,k,l]
    B
end  

# you need to also execute your function for it to be compiled for the right types
A=ones(Float64,(30,30,30,30))
test7(A);

You would start Julia REPL with the compile tracking and execute the above script
julia --project=. --trace-compile=my_precompile.jl mwe.jl
As a check, you can open the my_precompile.jl and see that many function calls were recorded that should be compiled.

Then I would start julia REPL and run the PackageCompiler (remember that you need to add it as a package too!):

julia --project=.

using PackageCompiler
PackageCompiler.create_sysimage(["LoopVectorization","TensorOperations"]; sysimage_path="my-sysimage.so", 
                                precompile_statements_file="my_precompile.jl")

Everything should be done, so you can test your system image now:
julia --project=. --sysimage=my-sysimage.so mwe.jl
It should start much faster, since it’s all precompiled.

To use it in python, you’d link your my-sysimage.so file in the sys_image keyword

from julia import Julia
>>> jl = Julia(runtime="PATH/TO/my-sysimage.so")

I’ve had problems with it in the past, so you can use the low-level API (see examples HERE). It’s also on SO (see StackOverflow issue)

Re. your overall strategy, I’ve benchmarked the code on my PC (M1 Pro running on Rosetta):

@btime test7(A) setup=(n=30; A=rand(Float64,(n,n,n,n))); 
# 367.375 μs (289 allocations: 6.21 MiB) 

@btime A=rand(Float64,($n,$n,$n,$n));
# 722.583 μs (3 allocations: 6.18 MiB)
# time it takes to create a random array of that size is 2x long

@btime A=ones(Float64,($n,$n,$n,$n));
# 106.000 μs (3 allocations: 6.18 MiB)
# even creating plain array of 1s takes 1/3 of the time (no computations!)

This tells me that your code is too fast comparatively to memory operations (data movements). That’s probably why moving stuff across from and to Python hides all Julia benefits.

Therefore, I’d propose:
A) Move more of your code to Julia
B) Run your python code in Julia (via PyCall) and leverage the Julia speed for your kernel

More on option B), I’ve added PyCall package to my Julia project, which allows me to test the Numpy code on equal footing and execute arbitrary Python code from the Julia side:

using BenchmarkTools
using PyCall 

# bring your script
include("mwe.jl")

# define the python code (you can also import all your code as one script/one function)
py"""
import numpy as np

# this is example python object
x=np.ones((30,30,30,30))

# this is example python function
def my_func(A):
    return 0.1*A + 0.2*np.transpose(A, (1,0,2,3))
"""

# benchmark with a Julia array `A`
# notice the `$A`, you need to interpolate your Julia variables as arguments into Python functions
@btime py"my_func($A)" setup=(n=30; A=rand(Float64,(n,n,n,n)));
# 2.318 ms (41 allocations: 6.18 MiB)

# benchmark with a Python array `x`
# I had some doubts if it will memorize the outputs, but judging by the results it runs it properly (ie, no setup=... needed)
@btime py"my_func(x)";
# 2.517 ms (35 allocations: 6.18 MiB)

You can probably see the huge difference in running the Python function vs Julia function on equally-sized arrays (370 microsec vs 2300 microsec)

Hope it helps.

5 Likes

Thank you for your reply. I apologize that I overlooked at your point about 2-3 ms. Now I suppose it is about the timing 2.769 ms (21 allocations: 6.18 MiB).. And I should have expressed clearer about my question. I thought about using a simple example. In the actual calculations, the timing for one set of permutation with addition is about 5 second. (there are many), more worthwhile to exploit than millisecond scale.

If I call julia from python directly, for a longer time calculation (larger dimension/larger rank of array/more permutations etc), the delay may be less important. For example, if I set na=80 in the python example on the top of the question and n_max = 2 to save time, I got

7.643103837966919
2.206459641456604
1.2286523580551147
python 1.7510950565338135

the third line of Julia is faster than python.

This implies I may not need system image/ C, C++ pointer or calling python from julia (the main python code has 1000+ lines, i think it would be a bit complicated to call from julia. I may try anyway).

But in this approach (call julia from python) performance somehow varies time to time. Let me exploit the direct calling julia approach and advanced approaches for a while.

EDITED:

julia --project=. --sysimage=my-sysimage.so mwe.jl

I can pass this step.

In calling from python, I replaced PATH/TO as my current directory, got
subprocess.CalledProcessError: Command '['/mnt/c/Users/.../my-sysimage.so', '-e', '...']' died with <Signals.SIGSEGV: 11>.

using lower API from python - Compile and use custom system image for PyJulia - Stack Overflow
after replaced the ‘PATH’ as my current directory, passed first 4 steps, in from julia import Main,
I got

Error processing line 1 of /home/.../python3.8/site-packages/google_auth-1.33.0-py3.9-nspkg.pth:

Fatal Python error: init_import_size: Failed to import the site module
Python runtime state: initialized
Traceback (most recent call last):
  File "/home/.../python3.8/site.py", line 169, in addpackage
    exec(line)
  File "<string>", line 1, in <module>
  File "/home/.../python3.8/importlib/util.py", line 14, in <module>
    from contextlib import contextmanager
  File "/home/.../python3.8/contextlib.py", line 5, in <module>
    from collections import deque
  File "/home/.../python3.8/collections/__init__.py", line 24, in <module>
    import heapq as _heapq
  File "/home/.../python3.8/heapq.py", line 581, in <module>
    from _heapq import *
SystemError: initialization of _heapq did not return an extension module

I vaguely recall previously the system asked about Project.toml etc, and I copied them to my current directory. I got it work once, but not anymore.
Sorry for bothering you time to time :frowning:

I’m sorry, not following the complete thread, but with this GitHub - cjdoris/PythonCall.jl: Python and Julia in harmony. I’ve been able to call Julia from Python with a very small overhead.

1 Like

I wouldn’t at all recommend it for really any practical use yet, but if anyone wants to try out something very very very (very very very!) experimental, there is also this approach calling (some) compiled Julia from Python just for fun.

3 Likes

Thanks for your link!

you need to avoid

Multithreading

Hence, the StaticTools is so far limited to single-core. Anyway, I may have a try as well!