Please note that the model used for benchmarking is quite tiny:
The model that we are going to use is a Multilayer Perceptron with the following architecture: 4 neurons for the input layer, 10 neurons for the hidden layer, and 3 neurons for the output layer.
Most likely timing for TF is comprised mostly from initialization overhead.