NeuralOperators.jl Performance (compared with Python)

Julian_Villaquira · November 11, 2023, 2:37pm

Hi, I am training a Neural Operator to learn a map between a two dimensional domain. For Julia, this is a MWE of the problem I am considering:

# Imports.
using NeuralOperators
using Flux
using FluxTraining
using MLUtils

# Generate data.
N_samples = 5_000
xdata = Float32.(rand(1, 16, 16, N_samples))
ydata = Float32.(rand(1, 16, 16, N_samples))

# Model parameters.
optimiser = Flux.Optimiser(WeightDecay(1.0f-4), Flux.Adam(5.0f-4))

# DataLoaders.
data_train, data_test = splitobs((xdata, ydata), at = 0.9)
loader_train, loader_test = DataLoader(data_train, batchsize = 100), DataLoader(data_test, batchsize = 100)
data = collect.((loader_train, loader_test))

chs =  (1, 32, 32, 32, 32, 32, 64, 1)
model = FourierNeuralOperator(ch = chs, modes = (8, 8), σ = gelu)
learner = Learner(model, data, optimiser, l₂loss)

for k in range(1, 100)
  epoch!(learner, TrainingPhase(), learner.data.training)
end

It takes ~30-40s to train an epoch (even when using julia --threads=auto). In Python, the equivalent problem takes almost an order of magnitude less time while maintaining the same error by epoch.

How can I get closer to Python’s performance in Julia?

Any help will be appreciated. Thanks!

Zentrik · November 11, 2023, 4:10pm

Try putting the code into a function.

adienes · November 11, 2023, 4:22pm

in this case I think most of the time is spent inside epoch! call so not sure it will help much

Julian_Villaquira · November 11, 2023, 4:23pm

Already tried. And as @adienes said, there is not really any change.

maxfreu · November 12, 2023, 8:27am

I‘d say this is a perfect use case for a profiler. Just slam @ profview (or similar, without space, I’m on phone) in front of the loop and see where most time is spent. Have you checked how much of this time is compilation time?

Julian_Villaquira · November 12, 2023, 11:19am

Yes, already did thanks to a suggestion given in Slack a few hours ago (by Chris). It turned out to be Tullio, a package used to perform tensor operations with Einstein notation. I changed it to a standard matrix multiplication, and now a significant amount of time is spent doing that.

Here is the new profiling.

PS: w.r.t. compilation time, I ran the function before to avoid that.

mkitti · November 12, 2023, 5:39pm

We could try increasing the number of Open Blas threads or using MKL. You have not shown us which Python code you are comparing to, so this is difficult to compare apples to apples.

ChrisRackauckas · November 12, 2023, 6:07pm

Longer discussion is on Slack. Basically, it comes down to the fact that Tullio.jl reverse passes are very slow and allocate a lot, so it’s much faster when replaced with matrix multiplications. However, this algorithm really shouldn’t be using matmuls for this operation. Quoting from the Slack:

In theory it could be changed to conv, the problem is that the X I mentioned before is the truncated Fourier transform of the series. One would need some additional process, like applying FFT, truncating, applying inverse FFT and the conv.

So using conv isn’t great either.

The best thing here would be to have a good einsum operation handle this. Since Tullio.jl is well-optimized in the forward pass and @mcabbott works on automatic differentiation, I presume this unoptimized behavior is likely just something that was overlooked and fixable. Making Tullio.jl better instead of dumping it is probably the best option IMO.

avikpal · August 20, 2024, 3:35am

(Wouldn’t have necro-posted but this post shows up quite frequently when searching for neural operators in Julia)

With our in-progress rewrite of NeuralOperators using Lux in GitHub - LuxDL/NeuralOperators.jl (part of GSoC via SciML), the performance significantly improves: (there are still some bottlenecks we are looking into, but currently DeepONets outperform the Pytorch version using deepxde and FNOs are much faster than the older Flux NeuralOperators version)

CPU Version: Forward Pass is down by 3x; Backward Pass improves by ~2x
CUDA Version: Time per gradient call goes down by 2x

CPU Code

using NeuralOperators, Random
# using NeuralOperators: NeuralOperators, Flux # <-- Flux Version
using BenchmarkTools

N_samples = 128
xdata = rand(Float32, 1, 16, 16, N_samples);

lux_fno = FourierNeuralOperator(; σ=gelu, chs=(1, 32, 32, 32, 32, 32, 64, 1), 
    modes=(8, 8))
ps, st = Lux.setup(Xoshiro(), lux_fno)

# flux_fno = NeuralOperators.FourierNeuralOperator(;
#     ch=(1, 32, 32, 32, 32, 32, 64, 1), modes=(8, 8), σ=gelu)

# @benchmark $flux_fno($xdata)
# BenchmarkTools.Trial: 14 samples with 1 evaluation.
#  Range (min … max):  309.223 ms … 577.339 ms  ┊ GC (min … max): 2.81% … 29.22%
#  Time  (median):     341.670 ms               ┊ GC (median):    2.04%
#  Time  (mean ± σ):   376.467 ms ±  85.218 ms  ┊ GC (mean ± σ):  4.56% ±  7.43%

#   █▁ █▁ ▁ ▁ █              █                          ▁       ▁  
#   ██▁██▁█▁█▁█▁▁▁▁▁▁▁▁▁▁▁▁▁▁█▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁█▁▁▁▁▁▁▁█ ▁
#   309 ms           Histogram: frequency by time          577 ms <

#  Memory estimate: 216.32 MiB, allocs estimate: 972.

@benchmark Lux.apply($lux_fno, $xdata, $ps, $st)
# BenchmarkTools.Trial: 32 samples with 1 evaluation.
#  Range (min … max):  102.107 ms … 316.077 ms  ┊ GC (min … max):  1.12% … 52.78%
#  Time  (median):     129.961 ms               ┊ GC (median):     3.90%
#  Time  (mean ± σ):   156.749 ms ±  57.966 ms  ┊ GC (mean ± σ):  20.78% ± # 19.24%
# 
#       ▂ █▂  ▂                                                    
#   # ▅▅█▁████▅██▁▁▁▁▁▁▁▅▁▁▁▁▁▅▁▁▁▁▁▁▁▅▁▁▁▅▁▁▁▁▅▁▅▁▁▅▁▅▁▁▁▁▁▁▁▁▁▁# ▁▅ ▁
#   102 ms           Histogram: frequency by time          316 ms <
# 
#  Memory estimate: 188.18 MiB, allocs estimate: 644.

loss(m, x) = sum(abs2, m(x))
loss(m, x, ps, st) = sum(abs2, first(m(x, ps, st)))

# @benchmark Zygote.gradient($loss, $flux_fno, $xdata)
# BenchmarkTools.Trial: 5 samples with 1 evaluation.
#  Range (min … max):  866.505 ms …    1.541 s  ┊ GC (min … max):  0.00% … 12.45%
#  Time  (median):        1.154 s               ┊ GC (median):    16.62%
#  Time  (mean ± σ):      1.207 s ± 253.181 ms  ┊ GC (mean ± σ):  13.15% ± 13.80%

#   █                      █ █                 █                █  
#   █▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁█▁█▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁█▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁█ ▁
#   867 ms           Histogram: frequency by time          1.54 s <

#  Memory estimate: 467.05 MiB, allocs estimate: 3982.

@benchmark Zygote.gradient($loss, $lux_fno, $xdata, $ps, $st)
# BenchmarkTools.Trial: 11 samples with 1 evaluation.
#  Range (min … max):  358.702 ms … 543.770 ms  ┊ GC (min … max):  1.55% … 29.35%
#  Time  (median):     468.138 ms               ┊ GC (median):    31.51%
#  Time  (mean ± σ):   457.426 ms ±  66.325 ms  ┊ GC (mean ± σ):  25.74% ± 13.01%
# 
#   █               ▁    ▁ ▁            ▁▁             █      ▁ ▁  
#   # █▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁█▁▁▁▁█▁█▁▁▁▁▁▁▁▁▁▁▁▁██▁▁▁▁▁▁▁▁▁▁▁▁▁█▁▁▁▁▁▁█# ▁█ ▁
#   359 ms           Histogram: frequency by time          544 ms <
# 
#  Memory estimate: 478.67 MiB, allocs estimate: 2637.

GPU (CUDA 1650Ti) Code

using NeuralOperators, Random
# using NeuralOperators: NeuralOperators, Flux
using LuxCUDA, BenchmarkTools

CUDA.allowscalar(false)

gdev = gpu_device()

N_samples = 128
xdata = rand(Float32, 1, 16, 16, N_samples)|> gdev;

# flux_fno = NeuralOperators.FourierNeuralOperator(;
#     ch=(1, 32, 32, 32, 32, 32, 64, 1), modes=(8, 8), σ=gelu) |> Flux.gpu

lux_fno = FourierNeuralOperator(; σ=gelu, chs=(1, 32, 32, 32, 32, 32, 64, 1), 
    modes=(8, 8))
ps, st = Lux.setup(Xoshiro(), lux_fno) |> gdev

# @benchmark CUDA.@sync $flux_fno($xdata)
# BenchmarkTools.Trial: 348 samples with 1 evaluation.
#  Range (min … max):  10.444 ms … 191.160 ms  ┊ GC (min … max): 26.16% … 88.66%
#  Time  (median):     12.158 ms               ┊ GC (median):     0.00%
#  Time  (mean ± σ):   14.303 ms ±  12.171 ms  ┊ GC (mean ± σ):   6.34% ±  9.47%

#     █▁▂                                                         
#   ▄▇███▄▄▄▅▅▄▅▄▄▄▃▄▃▁▁▁▂▁▂▂▂▂▂▂▁▂▁▂▁▂▂▁▁▁▁▁▁▁▁▂▁▁▁▁▁▁▂▁▁▁▁▁▁▁▂ ▃
#   10.4 ms         Histogram: frequency by time         33.3 ms <

#  Memory estimate: 99.48 KiB, allocs estimate: 3402.

@benchmark CUDA.@sync Lux.apply($lux_fno, $xdata, $ps, $st)
# BenchmarkTools.Trial: 604 samples with 1 evaluation.
#  Range (min … max):  6.271 ms … 24.575 ms  ┊ GC (min … max): 0.00% … 14.43%
#  Time  (median):     6.486 ms              ┊ GC (median):    0.00%
#  Time  (mean ± σ):   8.255 ms ±  3.474 ms  ┊ GC (mean ± σ):  2.85% ±  4.21%
# 
#   █▃▃     ▁▃▂ ▁                                               
#   ███▇▁▁▁██████▆▇▆▆▅▆▅▅▅▄▅▅▁▁▁▁▁▁▁▄▁▁▁▁▁▁▆██▇▄▄▄▄▄▄▁▄▅▄▄▄▁▁▅ ▇
#   6.27 ms      Histogram: log(frequency) by time     21.7 ms <
# 
#  Memory estimate: 60.91 KiB, allocs estimate: 2539.

loss(m, x) = sum(abs2, m(x))
loss(m, x, ps, st) = sum(abs2, first(m(x, ps, st)))

# @benchmark CUDA.@sync Zygote.gradient($loss, $flux_fno, $xdata)
# BenchmarkTools.Trial: 109 samples with 1 evaluation.
#  Range (min … max):  37.156 ms … 174.007 ms  ┊ GC (min … max): 0.00% … 82.40%
#  Time  (median):     44.307 ms               ┊ GC (median):    0.00%
#  Time  (mean ± σ):   45.864 ms ±  12.516 ms  ┊ GC (mean ± σ):  5.23% ±  8.41%

#                                 ▁█▁▅▅██ ▂▁▄                     
#   ▃▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁█▅▃▅███████▆███▅▅▁▁▁▁▆▁▃▃▁▁▃▃█▅▁▃▆ ▃
#   37.2 ms         Histogram: frequency by time         49.2 ms <

#  Memory estimate: 640.98 KiB, allocs estimate: 11110.

@benchmark CUDA.@sync Zygote.gradient($loss, $lux_fno, $xdata, $ps, $st)
# BenchmarkTools.Trial: 303 samples with 1 evaluation.
#  Range (min … max):  15.836 ms …  24.553 ms  ┊ GC (min … max): 0.00% … 11.17%
#  Time  (median):     16.253 ms               ┊ GC (median):    0.00%
#  Time  (mean ± σ):   16.511 ms ± 792.804 μs  ┊ GC (mean ± σ):  4.25% ±  7.07%
# 
     ▅▅▇█▂                                                      
#   ▅▆███████▄▃▂▃▃▅▄▆▄▅▄▅▃▃▃▃▃▂▂▃▂▂▁▂▁▁▁▂▁▂▁▃▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▂▁▁▂ ▃
#   15.8 ms         Histogram: frequency by time         19.7 ms <

#  Memory estimate: 302.09 KiB, allocs estimate: 8704.

Topic		Replies	Views
[ANN] NeuralOperators.jl - Learning the Infinite-dimensional operator for partial differential equations Package Announcements package , flux , pde	2	1509	September 28, 2021
[ANN] OperatorLearning.jl: Functional mappings to solve parametric PDEs Package Announcements package , flux , sciml	2	674	February 9, 2022
Julia Implementation of Transformer Neural Network Model Machine Learning flux	3	1636	April 19, 2019
Flux example 100x slower with L2 regularisation(?!) Performance question	5	700	November 10, 2020
Performance issues? New to Julia question , flux	9	1012	September 12, 2020

NeuralOperators.jl Performance (compared with Python)

CPU Code

GPU (CUDA 1650Ti) Code

Related topics