← Back to homepage

How to evaluate a TensorFlow 2.0 Keras model with model.evaluate

November 3, 2020 by Chris

Training a supervised machine learning model means that you want to achieve two things: firstly, a model that performs - in other words, that it can successfully predict what class a sample should belong to, or what value should be output for some input. Secondly, while predictive power is important, your model should also be able to generalize well. In other words, it should also be able to predict relatively correctly for input samples that it hasn't seen before.

This often comes at a trade-off: the trade-off between underfitting and overfitting. You don't want your model to lose too much of its predictive power, i.e. being overfit. However, you neither want it to be too good for the data it is trained on - causing it to be overfit, and losing its ability to generalize to data that it hasn't seen before.

And although it may sound strange, this can actually cause problems, because the training dataset and inference samples should not necessarily come from a sample with an approximately equal distribution!

Measuring the balance between underfitting and overfitting can be done by splitting the dataset into three subsets: training data, validation data and testing data. The first two ensure that the model is trained (training data) and steered away from overfitting (validation data), while the latter can be used to test the model after it has been trained. In this article, we'll focus on the latter.

First, we will look at the balance between underfitting and overfitting in more detail. Subsequently, we will use the tensorflow.keras functionality for evaluating your machine learning model, called model.evaluate. This includes a full Keras example, where we train a model and subsequently evaluate it.

Let's take a look! 😎

Why evaluate Keras models?

Great question - why do we need to evaluate TensorFlow/Keras models in the first place?

To answer it, we must take a look at how a supervised machine learning model is trained. Following the supervised learning process linked before, we note that samples from a training set are fed forward, after which an average error value is computed and subsequently used for model optimization.

The samples in a training set are often derived from some kind of population. For example, if we want to measure voting behavior in a population, we often take a representative sample. We therefore don't measure the behavior of the entire population - which would be really inefficient - but instead assume that if our sample is large enough, its distribution approaches the distribution of the entire population.

In other words, we generalize the smaller sample to the population.

While this often leads to good results, it can also be really problematic.

This emerges from the fact that we don't know whether our sample distribution is equal to the population distribution. While exact equality is hard to achieve, we should do our best to make them as equal as possible. And we know that neither without thorough analysis, and even then, because we can only compare to bigger samples.

Now, if you would train a supervised machine learning model with the training set, you would train until it is no longer underfit. This means that the model is capable of correctly generating predictions for the samples in your generalized population. However, we must also ensure that it is not overfit - meaning that it was trained too closely for the distribution of your training set. If the distributions don't match, the model will show worse performance when it is used in practice.

Model evaluation helps us to avoid falling into the underfitting/overfitting trap. Before training the model, we split off and set apart some data from the training set, called a testing dataset, Preferably, we split off randomly - in order to ensure that the distributions of the testing set and remaining training set samples are relatively equal. After training the model, we then feed the test samples to the model. When it performs well for those samples, we can be more confident that our model can work in practice.

Especially models with high variance are sensitive to overfitting.

Working with model.evaluate

If you look at the TensorFlow API, the model.evaluate functionality for model evaluation is part of the tf.keras.Model functionality class, which "groups layers into an object with training and inference features" (Tf.kerasa.Model, n.d.).

It looks like this:

    x=None, y=None, batch_size=None, verbose=1, sample_weight=None, steps=None,
    callbacks=None, max_queue_size=10, workers=1, use_multiprocessing=False,

With these attributes:

A full Keras example

Let's now take a look at creating a TensorFlow/Keras model that uses model.evaluate for model evaluation.

We first create the following TensorFlow model.

Click here if you wish to understand creating a Convolutional Neural Network in more detail.

from tensorflow.keras.datasets import mnist
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense, Dropout, Flatten
from tensorflow.keras.layers import Conv2D, MaxPooling2D
from tensorflow.keras.losses import sparse_categorical_crossentropy
from tensorflow.keras.optimizers import Adam
from extra_keras_datasets import emnist

# Model configuration
img_width, img_height = 28, 28
batch_size = 250
no_epochs = 25
no_classes = 10
validation_split = 0.2
verbosity = 1

# Load EMNIST dataset
(input_train, target_train), (input_test, target_test) = emnist.load_data(type='digits')

# Reshape data
input_train = input_train.reshape(input_train.shape[0], img_width, img_height, 1)
input_test = input_test.reshape(input_test.shape[0], img_width, img_height, 1)
input_shape = (img_width, img_height, 1)

# Cast numbers to float32
input_train = input_train.astype('float32')
input_test = input_test.astype('float32')

# Scale data
input_train = input_train / 255
input_test = input_test / 255

# Create the model
model = Sequential()
model.add(Conv2D(32, kernel_size=(3, 3), activation='relu', input_shape=input_shape))
model.add(MaxPooling2D(pool_size=(2, 2)))
model.add(Conv2D(64, kernel_size=(3, 3), activation='relu'))
model.add(MaxPooling2D(pool_size=(2, 2)))
model.add(Dense(256, activation='relu'))
model.add(Dense(no_classes, activation='softmax'))

# Compile the model

# Fit data to model
model.fit(input_train, target_train,

As we saw, training a model is only one step - your other task as a ML engineer is to see whether your model generalizes well.

For this, during the loading operation, we loaded both training data and testing data.

You can now use model.evaluate in order to generate evaluation scores and print them in your console.

# Generate generalization metrics
score = model.evaluate(input_test, target_test, verbose=0)
print(f'Test loss: {score[0]} / Test accuracy: {score[1]}')

Running the model will first train our model and subsequently print the evaluation metrics:

Test loss: 0.0175113923806377 / Test accuracy: 0.9951000213623047

Keras model.evaluate if you're using a generator

In the example above, we used load_data() to load the dataset into variables. This is easy, and that's precisely the goal of my Keras extensions library. However, many times, practice is a bit less ideal. In those cases, many approaches to importing your training dataset are out there. Three of them are, for example:

With the former two, you likely still end up with lists of training samples - i.e., having to load them into variables and thus in memory. For these cases, the example above can be used. But did you know that it is also possible to flow data from your system into the model. In other words, did you know that you can use a generator to train your machine learning model?

And it is also possible to evaluate a model using model.evaluate if you are using a generator. Say, for example, that you are using the following generator:

# Load data
def generate_arrays_from_file(path, batchsize):
    inputs = []
    targets = []
    batchcount = 0
    while True:
        with open(path) as f:
            for line in f:
                x,y = line.split(',')
                batchcount += 1
                if batchcount > batchsize:
                  X = np.array(inputs, dtype='float32')
                  y = np.array(targets, dtype='float32')
                  yield (X, y)
                  inputs = []
                  targets = []
                  batchcount = 0

Then you can evaluate the model by passing the generator to the evaluation function. Make sure to use a different path compared to your training dataset, since these need to be strictly separated.

# Generate generalization metrics
score = model.evaluate(generate_arrays_from_file('./five_hundred_evaluation_samples.csv', batch_size), verbose=0)
print(f'Test loss: {score[0]} / Test accuracy: {score[1]}')

Here, we would have a CSV file with five hundred evaluation samples - and we feed them forward with batch_size sized sample batches. In our cases, that would be 2 steps for each evaluation round, as we configured batch_size to be 250.

Note that you don't have to pass targets here, as they are obtained from the generator (Tf.keras.Model, n.d.).


In this article, we looked at model evaluation, and most specifically the usage of model.evaluate in TensorFlow and Keras. Firstly, we looked at the need for evaluating your machine learning model. We saw that it is necessary to do that because of the fact that models must work in practice, and that it is easy to overfit them in some cases.

We then moved forward to practice, and demonstrated how model.evaluate can be used to evaluate TensorFlow/Keras models based on the loss function and other metrics specified in the training process. This included an example. Another example was also provided for people who train their Keras models by means of a generator and want to evaluate them.

I hope that you have learnt something from today's article! If you did, please feel free to leave a comment in the comments section 💬 I'd love to hear from you. Please do the same if you have questions or other comments. Where possible, I'd love to help you out. Thank you for reading MachineCurve today and happy engineering! 😎


Tf.keras.Model. (n.d.). TensorFlow. https://www.tensorflow.org/api_docs/python/tf/keras/Model#evaluate

Hi, I'm Chris!

I know a thing or two about AI and machine learning. Welcome to MachineCurve.com, where machine learning is explained in gentle terms.