Julia vs. Python Performance on package catch22: Why is Julia slower in this case and how can I improve it?

Hello,

I’m comparing the performance of Julia and Python for a specific task involving time series data and feature extraction. I’ve implemented a parallel computation in both languages and noticed that Python is significantly faster than Julia in this scenario. I am trying to understand why this might be happening and suggestions for improving the performance of the Julia implementation.
In terms of speed, python has been on average 2x faster.

# JULIA
using Catch22
using Random
using Base.Threads

# Function to compute all features for a given time series
function compute_features(x::AbstractVector)
    res = catch22(x)
    return res
end

# Generate the dataset
dataset = [randn(2000) for _ in 1:5000]  # 5000 time series, each with 2000 samples

# Print number of threads available
println("Number of threads available: ", Threads.nthreads())

# Measure the start time
start_time = time()

# Function to compute features in parallel
function compute_all_features_parallel(dataset)
    results = Vector{Any}(undef, length(dataset))
    @threads for i in 1:length(dataset)
        results[i] = compute_features(dataset[i])
    end
    return results
end

# Compute features in parallel
results_list = compute_all_features_parallel(dataset)

# Measure the end time
end_time = time() - start_time
println("Multithreaded method time: $(end_time) seconds")
# PYTHON
import pycatch22
import os
import time
import numpy as np
from joblib import Parallel, delayed

dataset = [np.random.randn(2000) for _ in range(5000)]

# Compute all with np array is the fastest.
def compute_features(x):
    res = pycatch22.catch22_all(x)
    return res  # just return the values
    
print(f"Number of cores available: {os.cpu_count()}")

start_time = time.time()
threads_to_use = os.cpu_count()
results_list = Parallel(n_jobs=threads_to_use)(
    delayed(compute_features)(dataset[i]) for i in range(len(dataset))
)
joblib_time = time.time() - start_time
print(f"Joblib method time: {joblib_time:.2f} seconds")

Could you share your timings?

Also have you considered the impact of compilation time?

For example, you could do a warmup run before entering the time trial or compared successive time trials.

The above line stands out to me as concerning. Is the output of compute_features really so variable that the result should be Any?

The data is more or less the same, I tried placing Float64 but got an error in this code.

I got 20s~ on python and 42s~ on Julia.

It appears that Catch22.jl just calls out to C (which presumably is also the case for Python). As such, you shouldn’t expect Julia to be faster here, but it also shouldn’t be slower. Is the multi-threading actually making anything faster here? I wouldn’t be surprised if the C library is also spawning multiple threads leading to over-subscription.

1 Like

Did you setup PrecompileTools.jl on the package so it all is precompiled? Also, did you check that the compute cost is high enough that multithreading is actually beneficial? Did you try @threads :static?

On my 8-core i7 machine I can confirm that using @threads provides a speedup over the non-threaded loop of 6.6x.

Wait, I think the difference here is just Float32 vs Float64. How is the performance if you have
dataset = [randn(Float32, 2000) for _ in 1:5000]?

1 Like

This package is C a wrapper, there is a weird behavior in python that makes calling all functions faster than a single one.
I will create a PR in pycatch22 instead of trying to understand how catch22 runs on Julia but I appreciate the help and it was a good exercise to try Julia.

On my machine, the Python version takes 6.85 seconds and the Julia version takes 14.41 seconds. If I rerun and use @time in Julia I get

julia> @time results_list = compute_all_features_parallel(dataset);
 14.351482 seconds (4.09 M allocations: 2.822 GiB, 0.76% gc time)

Changing the inputs to Float32 made no difference for the time needed by the Julia version. The returned result was still computed in Float64. Also, correctly typing the output array and even preallocating it and mutating its elements in the loop made only a small difference in the timing.

I took a look at how long a single call to catch22 takes in both languages. First Python:

>>> timeit.timeit(lambda: 'compute_features(dataset[0])', number=10000)
0.0012677829945459962

and then Julia:

Julia> using BenchmarkTools

julia> @btime compute_features($dataset[1]);
  19.572 ms (818 allocations: 591.80 KiB)

If I’m understanding the timeit outputs correctly, then a single call in Julia takes about 15 times as long as in Python! Could it be because of the complicated type that is returned in Julia?:

julia> typeof(compute_features(dataset[1]))
FeatureArray{Float64, 1, Tuple{DimensionalData.Dimensions.Dim{:feature, DimensionalData.Dimensions.LookupArrays.Categorical{Symbol, Vector{Symbol}, DimensionalData.Dimensions.LookupArrays.Unordered, DimensionalData.Dimensions.LookupArrays.NoMetadata}}}, Tuple{}, Vector{Float64}, String, DimensionalData.Dimensions.LookupArrays.NoMetadata}

The result returned by a single call in Python is just a dict:

>>> type(compute_features(dataset[0]))
<class 'dict'>

A complicated type is not directly correlated with long runtime more with longer compile times.

But you might be right that there are underlying issues in the package Catch22.jl. I had a quick look at the library and saw many type instabilities (e.g. by using a Dict to map symbols to functions, generally wrapping functions in non-parametric structs). Now I don’t know anything about this domain so I can’t really comment on the tradeoffs/design goals but it might be that this is the cause of the performance difference.

Here is a short trip through the codebase:

  1. Calling catch22 leads us to its definition
catch22 = SuperFeatureSet([(x -> _catch22(x, f)) for f in featurenames], 
     featurenames, featuredescriptions, featurekeywords, zᶠ)

Here we already see an array of closures, which itself is actually concretely typed, but it will be stored as features::AbstractVector in SuperFeatureSet
2. When you call a SuperFeatureSet from TimeseriesFeatures.jl

function (𝒇::SuperFeatureSet)(x::AbstractArray)
    ℱ = getsuper.(𝒇) |> unique |> SuperFeatureSet
    supervals = Dict(getname(f) => f(x) for f in ℱ)
    FeatureArray(vcat([superloop(𝑓, supervals, x) for 𝑓 in 𝒇]...), 𝒇; refdims = refdims(x),
                 name = name(x), metadata = metadata(x))
end

First of all, everytime all contained things(?) are uniqued to construct a new SuperFeatureSet everytime (would probably be better to do this once on construction?). And then we essentially loop over all the contained functions (which is type unstable), apply them and vcat the results (that should be reduce(vcat, (<superloop code>)) instead of the vcat([<superloop code>]...) ).
3. Back to Catch22.jl where we call the functions (which are the closures defined in 1.)

function _catch22(𝐱::AbstractVector, fName::Symbol)::Float64
    nancheck(𝐱) && return NaN
    𝐱 = 𝐱 |> Vector{Float64}
    fType = featuretypes[fName]
    return _ccall(fName, fType)(𝐱)
end

This is again quite type unstable. Mostly because we need to lookup the ftype in a global Dict{Symbol, Datatype} and this ftype governs the dispatch to _ccall.
4. Catch22 has 22 features, so we go through 3. 22 times which is not that often. However all these inefficiencies likely sum up especially if there is not much time spent doing the computation in the C-code.

1 Like

Perhaps create an issue in the wrapper package?

Hi there! Sorry to jump on this so late

I’m the author of Catch22.jl; thanks for bringing this to my attention, I had no idea the performance difference was so dramatic
The main hit to performance didn’t come from any of the Julia code, but from the compiled binaries (see here); even directly calling the shared library was twice as slow. Adding gcc -O2 solved the issue, cutting the run time for each feature by ~1/2.

I’ve just released a new version, v0.7.0, of Catch22.jl (which uses new optimized binaries), along with v0.6.0 of TimeseriesFeatures.jl (which cleans up many of the type instabilities related to feature set calculation).

Performance should now be equivalent to (or better than) the pycatch22 package, especially if you use FeatureSets [i.e. Catch22.catch22(x)]:

Method (1000 time series of 10000 samples) Time Memory
Python loop (with z-score) 8.83 s 324 MiB (final), 644 MiB (peak)
Manual Julia loop (manual method, no z-score) 8.75 s 1.29 MiB
Julia FeatureSet (with z-score) 8.23 s 78.8 MiB

Python loop

Similar to the original post, but with multiple runs and memory profiling. The pycatch22 package applies a z-score to the input time series for each feature

Manual Julia loop

Like the original post, this test calls the method of each catch22 feature individually, and operates on the raw time series rather than the z-score

Julia FeatureSet

This uses the exported functionality of TimeseriesFeatures, automatically parallelizing feature calculations and preprocessing the input data with a z-score

14 Likes