A demo is 1.5x faster in Flux than tensorflow, both use cpu; while 3.0x slower during using CUDA

HANHAOHAN · August 20, 2021, 3:56am

using Flux
using CUDA
data = randn(Float32, 2, 100000) |> gpu
y = reshape(sin.(data[1,:] .* data[2,:]), (1, size(data)[2])) |> gpu
model = Chain(
Dense(2, 10, relu),
Dense(10, 10, relu),
Dense(10, 10, relu),
Dense(10, 10, relu),
Dense(10, 10, relu),
Dense(10, 10, relu),
Dense(10, 10, relu),
Dense(10, 1),
) |> gpu
opt = ADAM(0.001, (0.9, 0.999))
loss(x, y) = Flux.Losses.mse(model(x), y)
ps = Flux.params(model)
dl = Flux.DataLoader((data, y), batchsize=500, shuffle=true)|> gpu
Flux.@epochs 100 Flux.train!(loss, ps, dl, opt; cb = Flux.throttle(() -> @show(loss(data, y)), 10))

def test_tf():
import tensorflow as tf
import numpy as np
from tensorflow import keras
# tf.config.experimental.set_visible_devices(gpu[0], 'GPU')
with tf.device("/gpu:0"):
model = tf.keras.Sequential([
keras.layers.Dense(units=10, activation='relu', input_shape=[2]),
keras.layers.Dense(units=10, activation='relu'),
keras.layers.Dense(units=10, activation='relu'),
keras.layers.Dense(units=10, activation='relu'),
keras.layers.Dense(units=10, activation='relu'),
keras.layers.Dense(units=10, activation='relu'),
keras.layers.Dense(units=10, activation='relu'),
keras.layers.Dense(units=1),
]
)
model.compile(optimizer=keras.optimizers.Adam(1e-3), loss="mean_squared_error")
xs = np.random.randn(100000, 2).astype(np.float32)
ys = np.sin(xs[:,0] * xs[:, 1]).astype(np.float32)
model.fit(xs, ys, epochs=100, batch_size=500)

if name == "main":
import time
t0 = time.time()
test_tf()
print("everage time of epoch is {}".format((time.time()-t0)/100))

ToucheSir · August 21, 2021, 6:12pm

Thanks for posting here! This turned out to be a bit of a rabbit hole

I modified your code above to use some of CUDA.jl’s profiling tools. I also added a couple of lines to warm up the forward and backwards passes before running any epochs. That doesn’t help with the overall time (which is dominated by compilation), but it will give a better picture of per-epoch timings.

modified model

@time using Zygote
@time using CUDA
@time using Flux

let xpu = gpu

data = randn(Float32, 2, 100000) |> xpu
y = reshape(sin.(data[1,:] .* data[2,:]), (1, size(data)[2])) |> collect |> xpu

model = Chain(
  Dense(2, 10, relu),
  Dense(10, 10, relu),
  Dense(10, 10, relu),
  Dense(10, 10, relu),
  Dense(10, 10, relu),
  Dense(10, 10, relu),
  Dense(10, 10, relu),
  Dense(10, 1),
) |> xpu

opt = ADAM(0.001, (0.9, 0.999))
loss(x, y) = Flux.Losses.mse(model(x), y)
ps = Flux.params(model)

dl = Flux.DataLoader((data, y), batchsize=500, shuffle=false)

dummy_x, dummy_y = rand(Float32, 2, 500) |> xpu, rand(Float32, 1, 500) |> xpu
@time loss(dummy_x, dummy_y)
@time gradient(() -> loss(dummy_x, dummy_y), ps)

CUDA.@profile begin
  for i in 1:10
    NVTX.@range "epoch $i" begin
      for (j, (xs, ys)) in enumerate(dl)
        NVTX.@range "batch $j" begin
          gs = gradient(() -> loss(xs, ys), ps)
          NVTX.mark("gradient end")
          Flux.Optimise.update!(opt, ps, gs)
          NVTX.mark("update! end")
        end
      end
    end
  end
end

end

Running with a reduced epoch count of 10, the overall time was 62s. This is in contrast to ~4s for the TensorFlow version.

Here’s the profile loaded up in Nsight:

A few observations:

Runtime is dominated by epoch 1, and epoch 1 time is dominated by batch 1 time
Because gradient was warmed up beforehand, most of the batch 1 time is spent elsewhere. This other time is the highlighted range in Nsight.
It appears that almost all of the remaining time in batch 1, epoch 1 is used on compiling the broadcast kernels at https://github.com/FluxML/Flux.jl/blob/v0.12.6/src/optimise/optimisers.jl#L181-L184. The actual execution and memory operations associated with launching those kernels are shown by the vertical blue lines. The rest of the time is spent in LLVM and host-side Julia (black bars above)
Pre-warming Flux.Optimise.apply! on 1D and 2D arrays (i.e. the dense weights and biases) further reduces the first batch time by 2.5s. As expected though, it does not change the overall run time.

So the good news is that if you exclude the initial compilation time for CPU and GPU, per-batch and per-epoch times should be similar between Flux and TensorFlow. The bad news is that you probably care about those times too. We have issues tracking some parts of that (e.g. Slow kernel compilation · Issue #65 · JuliaGPU/GPUCompiler.jl · GitHub), but for non-repl workflows I’m afraid you’re stuck with the latency for now.

CC @maleadt and @dhairyagandhi96 for their thoughts as well.

dhairyagandhi96 · August 21, 2021, 6:38pm

Interesting that the runtimes are similar between Flux and Tensorflow, we should see if there are benefits that we can gain still anyway. What do the numbers look like without the compilation? What are the per batch numbers counting here?

For the compilation, I believe the best answer to get rid of compilation cost in the current scenario would be a PackageCompiler.jl based workflow, which is probably not what we want. Could we run this with interpreted mode to see if that helps our case? Are there ways that GPUCompiler.jl can be made aware of it (running interpreted)? Would it help even if that was the case?

ToucheSir · August 21, 2021, 6:59pm

The screenshot above is showing the timeline with as much of the compilation removed as possible. If you ignore the very first batch, then the 10 epoch runtime is ~4s. That’s still 30% slower than the TF runtime (~3s), but it’s at least within an order of magnitude. The rest of the difference probably comes down to memory management, scheduling and other smaller optimizations that TF does and we don’t.

I’m not sure how much it would help, given GPUCompiler just hands its work off to LLVM. I assume TF uses a precompiled ADAM kernel here, but whether we could do the same is beyond my expertise.

maleadt · August 23, 2021, 7:28am

Caching compiled code is currently not possible, let alone in GPUCompiler.jl. We do use different compilation pipelines though, and some GPU-specific passes we wrote ourselves, so it’s possible there are some inefficiencies there. You can always try running the initial compilation under Profile.@profile to see if anything sticks out.

HANHAOHAN · August 25, 2021, 2:30am

Thanks!
Flux has improved a lot in the past year and we are looking forward to a reign of Flux.

Topic		Replies	Views
Flux benchmark being too slow vs Jax Machine Learning gpu	11	1689	February 15, 2023
Flux running slow? Machine Learning	16	2750	August 19, 2021
The same network performs differently in Flux.jl and tensorflow Machine Learning performance	13	3068	December 18, 2019
Significant compile time latency in Flux with a GAN Machine Learning compilation , flux	28	3551	November 29, 2021
Compilation time of Flux models General Usage flux	9	358	March 19, 2024

A demo is 1.5x faster in Flux than tensorflow, both use cpu; while 3.0x slower during using CUDA

Related topics