Why is flux model slower than python?

I ran a benchmark test to observe the performance gain using flux. But, rather I observed that the flux runtine is longer than python equivalent. However, the julia is supposed to be faster.
May I know, what am I doing wrong in the following code which is making it slow:

#model
using Flux
vgg19() = Chain(
    Conv((3, 3), 3 => 64, relu, pad=(1, 1), stride=(1, 1)),
    Conv((3, 3), 64 => 64, relu, pad=(1, 1), stride=(1, 1)),
    MaxPool((2,2)),
    Conv((3, 3), 64 => 128, relu, pad=(1, 1), stride=(1, 1)),
    Conv((3, 3), 128 => 128, relu, pad=(1, 1), stride=(1, 1)),
    MaxPool((2,2)),
    Conv((3, 3), 128 => 256, relu, pad=(1, 1), stride=(1, 1)),
    Conv((3, 3), 256 => 256, relu, pad=(1, 1), stride=(1, 1)),
    Conv((3, 3), 256 => 256, relu, pad=(1, 1), stride=(1, 1)),
    MaxPool((2,2)),
    Conv((3, 3), 256 => 512, relu, pad=(1, 1), stride=(1, 1)),
    Conv((3, 3), 512 => 512, relu, pad=(1, 1), stride=(1, 1)),
    Conv((3, 3), 512 => 512, relu, pad=(1, 1), stride=(1, 1)),
    MaxPool((2,2)),
    Conv((3, 3), 512 => 512, relu, pad=(1, 1), stride=(1, 1)),
    Conv((3, 3), 512 => 512, relu, pad=(1, 1), stride=(1, 1)),
    Conv((3, 3), 512 => 512, relu, pad=(1, 1), stride=(1, 1)),
    BatchNorm(512),
    MaxPool((2,2)),
    flatten,
    Dense(512, 4096, relu),
    Dropout(0.5),
    Dense(4096, 4096, relu),
    Dropout(0.5),
    Dense(4096, 10),
    softmax
)

#Data processing
using MLDatasets: CIFAR10
using Flux: onehotbatch
# Data comes pre-normalized in Julia
trainX, trainY = CIFAR10.traindata(Float32)
testX, testY = CIFAR10.testdata(Float32)
# One hot encode labels
trainY = onehotbatch(trainY, 0:9)
testY = onehotbatch(testY, 0:9)

#training
using Flux: crossentropy, @epochs
using Flux.Data: DataLoader
model = vgg19()
opt = Momentum(.001, .9)
loss(x, y) = crossentropy(model(x), y)
data = DataLoader(trainX, trainY, batchsize=64)
@epochs 2 Flux.train!(loss, params(model), data, opt)

This example was taken from mode-zoo.
The python alternative finishes in almost half the time. The used code in python is mentioned below:

#Import Libraries
from keras.models import Sequential
from keras.layers import Dense, Conv2D, MaxPool2D , Flatten
from keras.optimizers import SGD
import time

#model details
vgg19 = Sequential()
vgg19.add(Conv2D(input_shape=(32,32,3),filters=64,kernel_size=(3,3),padding="same", activation="relu"))
vgg19.add(Conv2D(filters=64,kernel_size=(3,3),padding="same", activation="relu"))
vgg19.add(MaxPool2D(pool_size=(2,2),strides=(2,2)))
vgg19.add(Conv2D(filters=128, kernel_size=(3,3), padding="same", activation="relu"))
vgg19.add(Conv2D(filters=128, kernel_size=(3,3), padding="same", activation="relu"))
vgg19.add(MaxPool2D(pool_size=(2,2),strides=(2,2)))
vgg19.add(Conv2D(filters=256, kernel_size=(3,3), padding="same", activation="relu"))
vgg19.add(Conv2D(filters=256, kernel_size=(3,3), padding="same", activation="relu"))
vgg19.add(Conv2D(filters=256, kernel_size=(3,3), padding="same", activation="relu"))
vgg19.add(MaxPool2D(pool_size=(2,2),strides=(2,2)))
vgg19.add(Conv2D(filters=512, kernel_size=(3,3), padding="same", activation="relu"))
vgg19.add(Conv2D(filters=512, kernel_size=(3,3), padding="same", activation="relu"))
vgg19.add(Conv2D(filters=512, kernel_size=(3,3), padding="same", activation="relu"))
vgg19.add(MaxPool2D(pool_size=(2,2),strides=(2,2)))
vgg19.add(Conv2D(filters=512, kernel_size=(3,3), padding="same", activation="relu"))
vgg19.add(Conv2D(filters=512, kernel_size=(3,3), padding="same", activation="relu"))
vgg19.add(Conv2D(filters=512, kernel_size=(3,3), padding="same", activation="relu"))
vgg19.add(MaxPool2D(pool_size=(2,2),strides=(2,2)))
vgg19.add(Flatten())
vgg19.add(Dense(units=4096,activation="relu"))
vgg19.add(Dense(units=4096,activation="relu"))
vgg19.add(Dense(units=10, activation="softmax"))

#Data preparation
from keras.datasets import cifar10
from keras.utils import to_categorical

(X, Y), (tsX, tsY) = cifar10.load_data() 
# Use a one-hot-encoding
Y = to_categorical(Y)
tsY = to_categorical(tsY)
# Change datatype to float
X = X.astype('float64')
tsX = tsX.astype('float64')
 
# Scale X and tsX so each entry is between 0 and 1
X = X / 255.0
tsX = tsX / 255.0

#train
optimizer = SGD(lr=0.001, momentum=0.9)
vgg19.compile(optimizer=optimizer, loss='categorical_crossentropy', metrics=['accuracy'])
history = vgg19.fit(X, Y, epochs=2, batch_size=64, validation_data=(tsX, tsY), verbose=0)

I was wondering why this is happening, please suggest solutions to make julia code faster than python.

Thanks in advance.

Well, i think that in this case, you are comparing the speed of tensoflow and julia, not python and julia, and this is the usecase for which tensorflow was optimized. I would doubt Julia to be faster here.

2 Likes

x-ref: performance - Flux.jl doesn't utilize all the available threads for machine learning in julia - Stack Overflow

1 Like

A good way to see this point is looking at the Github page of TensorFlow:

It has 61% C++ code and 26% Python code, i.e. it is practically a C++ library with a Python frontend, not a Python library.
So essentially you are comparing Julia to C++, not Julia to Python. In principle both languages have similar run-time performance, it is mainly a matter of optimization effort (and trade-offs, e.g. for flexibility) which library is faster.
And I am pretty sure that much more money and raw man-power was invested into the Google-backed Tensorflow than into Flux.

1 Like

I presume this is https://github.com/FluxML/NNlib.jl/issues/234. Essentially all the time will be in conv here, I presume, and the CPU implementation called by Julia is slower than the one called by Pytorch.

1 Like