Why is flux model slower than python?

mdsa3d · January 29, 2021, 1:11am

I ran a benchmark test to observe the performance gain using flux. But, rather I observed that the flux runtine is longer than python equivalent. However, the julia is supposed to be faster.
May I know, what am I doing wrong in the following code which is making it slow:

#model
using Flux
vgg19() = Chain(
    Conv((3, 3), 3 => 64, relu, pad=(1, 1), stride=(1, 1)),
    Conv((3, 3), 64 => 64, relu, pad=(1, 1), stride=(1, 1)),
    MaxPool((2,2)),
    Conv((3, 3), 64 => 128, relu, pad=(1, 1), stride=(1, 1)),
    Conv((3, 3), 128 => 128, relu, pad=(1, 1), stride=(1, 1)),
    MaxPool((2,2)),
    Conv((3, 3), 128 => 256, relu, pad=(1, 1), stride=(1, 1)),
    Conv((3, 3), 256 => 256, relu, pad=(1, 1), stride=(1, 1)),
    Conv((3, 3), 256 => 256, relu, pad=(1, 1), stride=(1, 1)),
    MaxPool((2,2)),
    Conv((3, 3), 256 => 512, relu, pad=(1, 1), stride=(1, 1)),
    Conv((3, 3), 512 => 512, relu, pad=(1, 1), stride=(1, 1)),
    Conv((3, 3), 512 => 512, relu, pad=(1, 1), stride=(1, 1)),
    MaxPool((2,2)),
    Conv((3, 3), 512 => 512, relu, pad=(1, 1), stride=(1, 1)),
    Conv((3, 3), 512 => 512, relu, pad=(1, 1), stride=(1, 1)),
    Conv((3, 3), 512 => 512, relu, pad=(1, 1), stride=(1, 1)),
    BatchNorm(512),
    MaxPool((2,2)),
    flatten,
    Dense(512, 4096, relu),
    Dropout(0.5),
    Dense(4096, 4096, relu),
    Dropout(0.5),
    Dense(4096, 10),
    softmax
)

#Data processing
using MLDatasets: CIFAR10
using Flux: onehotbatch
# Data comes pre-normalized in Julia
trainX, trainY = CIFAR10.traindata(Float32)
testX, testY = CIFAR10.testdata(Float32)
# One hot encode labels
trainY = onehotbatch(trainY, 0:9)
testY = onehotbatch(testY, 0:9)

#training
using Flux: crossentropy, @epochs
using Flux.Data: DataLoader
model = vgg19()
opt = Momentum(.001, .9)
loss(x, y) = crossentropy(model(x), y)
data = DataLoader(trainX, trainY, batchsize=64)
@epochs 2 Flux.train!(loss, params(model), data, opt)

This example was taken from mode-zoo.
The python alternative finishes in almost half the time. The used code in python is mentioned below:

#Import Libraries
from keras.models import Sequential
from keras.layers import Dense, Conv2D, MaxPool2D , Flatten
from keras.optimizers import SGD
import time

#model details
vgg19 = Sequential()
vgg19.add(Conv2D(input_shape=(32,32,3),filters=64,kernel_size=(3,3),padding="same", activation="relu"))
vgg19.add(Conv2D(filters=64,kernel_size=(3,3),padding="same", activation="relu"))
vgg19.add(MaxPool2D(pool_size=(2,2),strides=(2,2)))
vgg19.add(Conv2D(filters=128, kernel_size=(3,3), padding="same", activation="relu"))
vgg19.add(Conv2D(filters=128, kernel_size=(3,3), padding="same", activation="relu"))
vgg19.add(MaxPool2D(pool_size=(2,2),strides=(2,2)))
vgg19.add(Conv2D(filters=256, kernel_size=(3,3), padding="same", activation="relu"))
vgg19.add(Conv2D(filters=256, kernel_size=(3,3), padding="same", activation="relu"))
vgg19.add(Conv2D(filters=256, kernel_size=(3,3), padding="same", activation="relu"))
vgg19.add(MaxPool2D(pool_size=(2,2),strides=(2,2)))
vgg19.add(Conv2D(filters=512, kernel_size=(3,3), padding="same", activation="relu"))
vgg19.add(Conv2D(filters=512, kernel_size=(3,3), padding="same", activation="relu"))
vgg19.add(Conv2D(filters=512, kernel_size=(3,3), padding="same", activation="relu"))
vgg19.add(MaxPool2D(pool_size=(2,2),strides=(2,2)))
vgg19.add(Conv2D(filters=512, kernel_size=(3,3), padding="same", activation="relu"))
vgg19.add(Conv2D(filters=512, kernel_size=(3,3), padding="same", activation="relu"))
vgg19.add(Conv2D(filters=512, kernel_size=(3,3), padding="same", activation="relu"))
vgg19.add(MaxPool2D(pool_size=(2,2),strides=(2,2)))
vgg19.add(Flatten())
vgg19.add(Dense(units=4096,activation="relu"))
vgg19.add(Dense(units=4096,activation="relu"))
vgg19.add(Dense(units=10, activation="softmax"))

#Data preparation
from keras.datasets import cifar10
from keras.utils import to_categorical

(X, Y), (tsX, tsY) = cifar10.load_data() 
# Use a one-hot-encoding
Y = to_categorical(Y)
tsY = to_categorical(tsY)
# Change datatype to float
X = X.astype('float64')
tsX = tsX.astype('float64')
 
# Scale X and tsX so each entry is between 0 and 1
X = X / 255.0
tsX = tsX / 255.0

#train
optimizer = SGD(lr=0.001, momentum=0.9)
vgg19.compile(optimizer=optimizer, loss='categorical_crossentropy', metrics=['accuracy'])
history = vgg19.fit(X, Y, epochs=2, batch_size=64, validation_data=(tsX, tsY), verbose=0)

I was wondering why this is happening, please suggest solutions to make julia code faster than python.

Thanks in advance.

Tomas_Pevny · January 29, 2021, 5:37am

Well, i think that in this case, you are comparing the speed of tensoflow and julia, not python and julia, and this is the usecase for which tensorflow was optimized. I would doubt Julia to be faster here.

carstenbauer · January 29, 2021, 7:07am

x-ref: performance - Flux.jl doesn't utilize all the available threads for machine learning in julia - Stack Overflow

lungben · January 29, 2021, 8:31am

A good way to see this point is looking at the Github page of TensorFlow:

It has 61% C++ code and 26% Python code, i.e. it is practically a C++ library with a Python frontend, not a Python library.
So essentially you are comparing Julia to C++, not Julia to Python. In principle both languages have similar run-time performance, it is mainly a matter of optimization effort (and trade-offs, e.g. for flexibility) which library is faster.
And I am pretty sure that much more money and raw man-power was invested into the Google-backed Tensorflow than into Flux.

mcabbott · January 29, 2021, 9:56am

I presume this is https://github.com/FluxML/NNlib.jl/issues/234. Essentially all the time will be in conv here, I presume, and the CPU implementation called by Julia is slower than the one called by Pytorch.

cirobr · May 11, 2024, 11:07am

I just completed a comparison between the training speed of the U-Net implementation (ResNet-50 backbone) from Metalhead.jl and its equivalent from Python’s Segmentation Models. Training set are about 900 augmented images from PascalVOC, size normalized to 512x512.

Both models were trained with IoU_loss, have reached to almost the same train/test curves and number of epochs. Julia version took slightly above 2h for training on a resourceful machine, and about 30 min on tf/keras at the very same machine.

Not sure if I should continue my project with Julia/Flux… may I please ask what is the current evolution status?

Thanks.

mkitti · May 11, 2024, 2:50pm

Our base expectation is that they should take about the same time. There is not a strong fundamental reason that a basic implementation in Julia should be 4x slower. That said there are opportunities with Julia to optimize things to be faster.

If algorithms differ then there could be differences in time.

Without taking a look at your specific implementation of the model, it is hard to say what could be the cause.

Topic		Replies	Views
Flux running slow? Machine Learning	16	2707	August 19, 2021
Flux model on CPU runs slowly Performance question , flux	3	419	October 4, 2020
Flux multi-cpu parallelism? New to Julia question , flux , zygote	9	2917	November 21, 2020
Flux slows down by 10x when moving from local system to high performance cluster Performance flux	10	269	September 24, 2024
My Flux Application painfully slow General Usage question	21	1458	October 20, 2020

Why is flux model slower than python?

Related topics