Why is flux model slower than python?

I ran a benchmark test to observe the performance gain using flux. But, rather I observed that the flux runtine is longer than python equivalent. However, the julia is supposed to be faster.
May I know, what am I doing wrong in the following code which is making it slow:

#model
using Flux
vgg19() = Chain(
    Conv((3, 3), 3 => 64, relu, pad=(1, 1), stride=(1, 1)),
    Conv((3, 3), 64 => 64, relu, pad=(1, 1), stride=(1, 1)),
    MaxPool((2,2)),
    Conv((3, 3), 64 => 128, relu, pad=(1, 1), stride=(1, 1)),
    Conv((3, 3), 128 => 128, relu, pad=(1, 1), stride=(1, 1)),
    MaxPool((2,2)),
    Conv((3, 3), 128 => 256, relu, pad=(1, 1), stride=(1, 1)),
    Conv((3, 3), 256 => 256, relu, pad=(1, 1), stride=(1, 1)),
    Conv((3, 3), 256 => 256, relu, pad=(1, 1), stride=(1, 1)),
    MaxPool((2,2)),
    Conv((3, 3), 256 => 512, relu, pad=(1, 1), stride=(1, 1)),
    Conv((3, 3), 512 => 512, relu, pad=(1, 1), stride=(1, 1)),
    Conv((3, 3), 512 => 512, relu, pad=(1, 1), stride=(1, 1)),
    MaxPool((2,2)),
    Conv((3, 3), 512 => 512, relu, pad=(1, 1), stride=(1, 1)),
    Conv((3, 3), 512 => 512, relu, pad=(1, 1), stride=(1, 1)),
    Conv((3, 3), 512 => 512, relu, pad=(1, 1), stride=(1, 1)),
    BatchNorm(512),
    MaxPool((2,2)),
    flatten,
    Dense(512, 4096, relu),
    Dropout(0.5),
    Dense(4096, 4096, relu),
    Dropout(0.5),
    Dense(4096, 10),
    softmax
)

#Data processing
using MLDatasets: CIFAR10
using Flux: onehotbatch
# Data comes pre-normalized in Julia
trainX, trainY = CIFAR10.traindata(Float32)
testX, testY = CIFAR10.testdata(Float32)
# One hot encode labels
trainY = onehotbatch(trainY, 0:9)
testY = onehotbatch(testY, 0:9)

#training
using Flux: crossentropy, @epochs
using Flux.Data: DataLoader
model = vgg19()
opt = Momentum(.001, .9)
loss(x, y) = crossentropy(model(x), y)
data = DataLoader(trainX, trainY, batchsize=64)
@epochs 2 Flux.train!(loss, params(model), data, opt)

This example was taken from mode-zoo.
The python alternative finishes in almost half the time. The used code in python is mentioned below:

#Import Libraries
from keras.models import Sequential
from keras.layers import Dense, Conv2D, MaxPool2D , Flatten
from keras.optimizers import SGD
import time

#model details
vgg19 = Sequential()
vgg19.add(Conv2D(input_shape=(32,32,3),filters=64,kernel_size=(3,3),padding="same", activation="relu"))
vgg19.add(Conv2D(filters=64,kernel_size=(3,3),padding="same", activation="relu"))
vgg19.add(MaxPool2D(pool_size=(2,2),strides=(2,2)))
vgg19.add(Conv2D(filters=128, kernel_size=(3,3), padding="same", activation="relu"))
vgg19.add(Conv2D(filters=128, kernel_size=(3,3), padding="same", activation="relu"))
vgg19.add(MaxPool2D(pool_size=(2,2),strides=(2,2)))
vgg19.add(Conv2D(filters=256, kernel_size=(3,3), padding="same", activation="relu"))
vgg19.add(Conv2D(filters=256, kernel_size=(3,3), padding="same", activation="relu"))
vgg19.add(Conv2D(filters=256, kernel_size=(3,3), padding="same", activation="relu"))
vgg19.add(MaxPool2D(pool_size=(2,2),strides=(2,2)))
vgg19.add(Conv2D(filters=512, kernel_size=(3,3), padding="same", activation="relu"))
vgg19.add(Conv2D(filters=512, kernel_size=(3,3), padding="same", activation="relu"))
vgg19.add(Conv2D(filters=512, kernel_size=(3,3), padding="same", activation="relu"))
vgg19.add(MaxPool2D(pool_size=(2,2),strides=(2,2)))
vgg19.add(Conv2D(filters=512, kernel_size=(3,3), padding="same", activation="relu"))
vgg19.add(Conv2D(filters=512, kernel_size=(3,3), padding="same", activation="relu"))
vgg19.add(Conv2D(filters=512, kernel_size=(3,3), padding="same", activation="relu"))
vgg19.add(MaxPool2D(pool_size=(2,2),strides=(2,2)))
vgg19.add(Flatten())
vgg19.add(Dense(units=4096,activation="relu"))
vgg19.add(Dense(units=4096,activation="relu"))
vgg19.add(Dense(units=10, activation="softmax"))

#Data preparation
from keras.datasets import cifar10
from keras.utils import to_categorical

(X, Y), (tsX, tsY) = cifar10.load_data() 
# Use a one-hot-encoding
Y = to_categorical(Y)
tsY = to_categorical(tsY)
# Change datatype to float
X = X.astype('float64')
tsX = tsX.astype('float64')
 
# Scale X and tsX so each entry is between 0 and 1
X = X / 255.0
tsX = tsX / 255.0

#train
optimizer = SGD(lr=0.001, momentum=0.9)
vgg19.compile(optimizer=optimizer, loss='categorical_crossentropy', metrics=['accuracy'])
history = vgg19.fit(X, Y, epochs=2, batch_size=64, validation_data=(tsX, tsY), verbose=0)

I was wondering why this is happening, please suggest solutions to make julia code faster than python.

Thanks in advance.

1 Like

Well, i think that in this case, you are comparing the speed of tensoflow and julia, not python and julia, and this is the usecase for which tensorflow was optimized. I would doubt Julia to be faster here.

4 Likes

x-ref: performance - Flux.jl doesn't utilize all the available threads for machine learning in julia - Stack Overflow

2 Likes

A good way to see this point is looking at the Github page of TensorFlow:

It has 61% C++ code and 26% Python code, i.e. it is practically a C++ library with a Python frontend, not a Python library.
So essentially you are comparing Julia to C++, not Julia to Python. In principle both languages have similar run-time performance, it is mainly a matter of optimization effort (and trade-offs, e.g. for flexibility) which library is faster.
And I am pretty sure that much more money and raw man-power was invested into the Google-backed Tensorflow than into Flux.

2 Likes

I presume this is https://github.com/FluxML/NNlib.jl/issues/234. Essentially all the time will be in conv here, I presume, and the CPU implementation called by Julia is slower than the one called by Pytorch.

2 Likes

I just completed a comparison between the training speed of the U-Net implementation (ResNet-50 backbone) from Metalhead.jl and its equivalent from Python’s Segmentation Models. Training set are about 900 augmented images from PascalVOC, size normalized to 512x512.

Both models were trained with IoU_loss, have reached to almost the same train/test curves and number of epochs. Julia version took slightly above 2h for training on a resourceful machine, and about 30 min on tf/keras at the very same machine.

Not sure if I should continue my project with Julia/Flux… may I please ask what is the current evolution status?

Thanks.

1 Like

Our base expectation is that they should take about the same time. There is not a strong fundamental reason that a basic implementation in Julia should be 4x slower. That said there are opportunities with Julia to optimize things to be faster.

If algorithms differ then there could be differences in time.

Without taking a look at your specific implementation of the model, it is hard to say what could be the cause.

2 Likes