Training a supervised machine learning model means that you want to achieve two things: firstly, a model that performs - in other words, that it can successfully predict what class a sample should belong to, or what value should be output for some input. Secondly, while predictive power is important, your model should also be able to generalize well. In other words, it should also be able to predict relatively correctly for input samples that it hasn't seen before.
This often comes at a trade-off: the trade-off between underfitting and overfitting. You don't want your model to lose too much of its predictive power, i.e. being overfit. However, you neither want it to be too good for the data it is trained on - causing it to be overfit, and losing its ability to generalize to data that it hasn't seen before.
And although it may sound strange, this can actually cause problems, because the training dataset and inference samples should not necessarily come from a sample with an approximately equal distribution!
Measuring the balance between underfitting and overfitting can be done by splitting the dataset into three subsets: training data, validation data and testing data. The first two ensure that the model is trained (training data) and steered away from overfitting (validation data), while the latter can be used to test the model after it has been trained. In this article, we'll focus on the latter.
First, we will look at the balance between underfitting and overfitting in more detail. Subsequently, we will use the tensorflow.keras
functionality for evaluating your machine learning model, called model.evaluate
. This includes a full Keras example, where we train a model and subsequently evaluate it.
Let's take a look! 😎
Great question - why do we need to evaluate TensorFlow/Keras models in the first place?
To answer it, we must take a look at how a supervised machine learning model is trained. Following the supervised learning process linked before, we note that samples from a training set are fed forward, after which an average error value is computed and subsequently used for model optimization.
The samples in a training set are often derived from some kind of population. For example, if we want to measure voting behavior in a population, we often take a representative sample. We therefore don't measure the behavior of the entire population - which would be really inefficient - but instead assume that if our sample is large enough, its distribution approaches the distribution of the entire population.
In other words, we generalize the smaller sample to the population.
While this often leads to good results, it can also be really problematic.
This emerges from the fact that we don't know whether our sample distribution is equal to the population distribution. While exact equality is hard to achieve, we should do our best to make them as equal as possible. And we know that neither without thorough analysis, and even then, because we can only compare to bigger samples.
Now, if you would train a supervised machine learning model with the training set, you would train until it is no longer underfit. This means that the model is capable of correctly generating predictions for the samples in your generalized population. However, we must also ensure that it is not overfit - meaning that it was trained too closely for the distribution of your training set. If the distributions don't match, the model will show worse performance when it is used in practice.
Model evaluation helps us to avoid falling into the underfitting/overfitting trap. Before training the model, we split off and set apart some data from the training set, called a testing dataset, Preferably, we split off randomly - in order to ensure that the distributions of the testing set and remaining training set samples are relatively equal. After training the model, we then feed the test samples to the model. When it performs well for those samples, we can be more confident that our model can work in practice.
Especially models with high variance are sensitive to overfitting.
If you look at the TensorFlow API, the model.evaluate
functionality for model evaluation is part of the tf.keras.Model
functionality class, which "groups layers into an object with training and inference features" (Tf.kerasa.Model, n.d.).
It looks like this:
evaluate(
x=None, y=None, batch_size=None, verbose=1, sample_weight=None, steps=None,
callbacks=None, max_queue_size=10, workers=1, use_multiprocessing=False,
return_dict=False
)
With these attributes:
x
and y
representing the samples and targets of your testing data, respectively.batch_size
representing the number of samples fed through evaluate
at once. Default, it's None
, and then equals to 32.verbose
, it is possible to show a progress bar (1
) or nothing (0
).sample_weight
to specify an 1D or 2D array with weights in order to weigh individual samples (the 1D case) or timesteps (the 2D case).steps
represents the total number of batches before evaluating is declared finished. If the number of batches available based on the batch size (i.e. int(len(test_data) / batch_size)
) is higher than steps
, only steps
batches will be fed forward. If set to None
, it will continue until exhausted (i.e. until all batches have been fed forward).callbacks
, it is possible to attach callbacks to the evaluation process.max_queue_size
, workers
and use_multiprocessing
.return_dict
to True
in order to let the evaluate
function return a dictionary.Let's now take a look at creating a TensorFlow/Keras model that uses model.evaluate
for model evaluation.
We first create the following TensorFlow model.
float32
format (read here why), and then scale the data to the \([0, 1]\) range.Click here if you wish to understand creating a Convolutional Neural Network in more detail.
from tensorflow.keras.datasets import mnist
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense, Dropout, Flatten
from tensorflow.keras.layers import Conv2D, MaxPooling2D
from tensorflow.keras.losses import sparse_categorical_crossentropy
from tensorflow.keras.optimizers import Adam
from extra_keras_datasets import emnist
# Model configuration
img_width, img_height = 28, 28
batch_size = 250
no_epochs = 25
no_classes = 10
validation_split = 0.2
verbosity = 1
# Load EMNIST dataset
(input_train, target_train), (input_test, target_test) = emnist.load_data(type='digits')
# Reshape data
input_train = input_train.reshape(input_train.shape[0], img_width, img_height, 1)
input_test = input_test.reshape(input_test.shape[0], img_width, img_height, 1)
input_shape = (img_width, img_height, 1)
# Cast numbers to float32
input_train = input_train.astype('float32')
input_test = input_test.astype('float32')
# Scale data
input_train = input_train / 255
input_test = input_test / 255
# Create the model
model = Sequential()
model.add(Conv2D(32, kernel_size=(3, 3), activation='relu', input_shape=input_shape))
model.add(MaxPooling2D(pool_size=(2, 2)))
model.add(Dropout(0.25))
model.add(Conv2D(64, kernel_size=(3, 3), activation='relu'))
model.add(MaxPooling2D(pool_size=(2, 2)))
model.add(Dropout(0.25))
model.add(Flatten())
model.add(Dense(256, activation='relu'))
model.add(Dense(no_classes, activation='softmax'))
# Compile the model
model.compile(loss=sparse_categorical_crossentropy,
optimizer=Adam(),
metrics=['accuracy'])
# Fit data to model
model.fit(input_train, target_train,
batch_size=batch_size,
epochs=no_epochs,
verbose=verbosity,
validation_split=validation_split)
As we saw, training a model is only one step - your other task as a ML engineer is to see whether your model generalizes well.
For this, during the loading operation, we loaded both training data and testing data.
You can now use model.evaluate
in order to generate evaluation scores and print them in your console.
evaluate
on the model
with the testing data - verbosity off, so we don't see output on the screen.score
variable contains the scores in that particular other. Hence, score[0]
represents crossentropy, and score[1]
represents accuracy. We finally call print()
to output the scores on screen.# Generate generalization metrics
score = model.evaluate(input_test, target_test, verbose=0)
print(f'Test loss: {score[0]} / Test accuracy: {score[1]}')
Running the model will first train our model and subsequently print the evaluation metrics:
Test loss: 0.0175113923806377 / Test accuracy: 0.9951000213623047
In the example above, we used load_data()
to load the dataset into variables. This is easy, and that's precisely the goal of my Keras extensions library. However, many times, practice is a bit less ideal. In those cases, many approaches to importing your training dataset are out there. Three of them are, for example:
With the former two, you likely still end up with lists of training samples - i.e., having to load them into variables and thus in memory. For these cases, the example above can be used. But did you know that it is also possible to flow data from your system into the model. In other words, did you know that you can use a generator to train your machine learning model?
And it is also possible to evaluate a model using model.evaluate
if you are using a generator. Say, for example, that you are using the following generator:
# Load data
def generate_arrays_from_file(path, batchsize):
inputs = []
targets = []
batchcount = 0
while True:
with open(path) as f:
for line in f:
x,y = line.split(',')
inputs.append(x)
targets.append(y)
batchcount += 1
if batchcount > batchsize:
X = np.array(inputs, dtype='float32')
y = np.array(targets, dtype='float32')
yield (X, y)
inputs = []
targets = []
batchcount = 0
Then you can evaluate the model by passing the generator to the evaluation function. Make sure to use a different path
compared to your training dataset, since these need to be strictly separated.
# Generate generalization metrics
score = model.evaluate(generate_arrays_from_file('./five_hundred_evaluation_samples.csv', batch_size), verbose=0)
print(f'Test loss: {score[0]} / Test accuracy: {score[1]}')
Here, we would have a CSV file with five hundred evaluation samples - and we feed them forward with batch_size
sized sample batches. In our cases, that would be 2 steps for each evaluation round, as we configured batch_size
to be 250.
Note that you don't have to pass targets here, as they are obtained from the generator (Tf.keras.Model, n.d.).
In this article, we looked at model evaluation, and most specifically the usage of model.evaluate
in TensorFlow and Keras. Firstly, we looked at the need for evaluating your machine learning model. We saw that it is necessary to do that because of the fact that models must work in practice, and that it is easy to overfit them in some cases.
We then moved forward to practice, and demonstrated how model.evaluate
can be used to evaluate TensorFlow/Keras models based on the loss function and other metrics specified in the training process. This included an example. Another example was also provided for people who train their Keras models by means of a generator and want to evaluate them.
I hope that you have learnt something from today's article! If you did, please feel free to leave a comment in the comments section 💬 I'd love to hear from you. Please do the same if you have questions or other comments. Where possible, I'd love to help you out. Thank you for reading MachineCurve today and happy engineering! 😎
Tf.keras.Model. (n.d.). TensorFlow. https://www.tensorflow.org/api_docs/python/tf/keras/Model#evaluate
Learn how large language models and other foundation models are working and how you can train open source ones yourself.
Keras is a high-level API for TensorFlow. It is one of the most popular deep learning frameworks.
Read about the fundamentals of machine learning, deep learning and artificial intelligence.
To get in touch with me, please connect with me on LinkedIn. Make sure to write me a message saying hi!
The content on this website is written for educational purposes. In writing the articles, I have attempted to be as correct and precise as possible. Should you find any errors, please let me know by creating an issue or pull request in this GitHub repository.
All text on this website written by me is copyrighted and may not be used without prior permission. Creating citations using content from this website is allowed if a reference is added, including an URL reference to the referenced article.
If you have any questions or remarks, feel free to get in touch.
TensorFlow, the TensorFlow logo and any related marks are trademarks of Google Inc.
PyTorch, the PyTorch logo and any related marks are trademarks of The Linux Foundation.
Montserrat and Source Sans are fonts licensed under the SIL Open Font License version 1.1.
Mathjax is licensed under the Apache License, Version 2.0.