The inputs to individual layers in a neural network can be normalized to speed up training. This process, called Batch Normalization, attempts to resolve an issue in neural networks called internal covariate shift.
But how does it work? And how does it work in terms of code - with the Keras deep learning framework? We'll take a look at these questions in this blog.
Firstly, we'll provide a recap on Batch Normalization to ensure that you've gained some conceptual understanding, or that it has been revived. This includes a discussion on the problem, why it occurs during training, and how Batch Normalization may resolve it.
Then, we move on to the actual Keras part - by providing you with an example neural network using Batch Normalization to learn classification on the KMNIST dataset. Each step of the code which creates the neural network is explained so that you understand how it works.
Are you ready? Let's go! :)
Before we start coding, let's take a brief look at Batch Normalization again. We start off with a discussion about internal covariate shift and how this affects the learning process. Subsequently, as the need for Batch Normalization will then be clear, we'll provide a recap on Batch Normalization itself to understand what it does.
Suppose that you have this neural network, which is composed of Dropout neurons:
Following the high-level supervised machine learning process, training such a neural network is a multi-step process:
Now take a look at the neural network from a per-layer point of view. Each layer takes some input, performs a linear operation using the input vector and the weights vector, feeds the data into a nonlinear activation function, and passes the data to the next layer or the output.
Neural networks train fast if the distribution of the data remains the same, and especially if it is normalized to the range of \((\mu = 0, \sigma = 1)\). This is not the case when no Batch Normalization is applied: by training the network (i.e. changing the weights of the individual neurons), the outputs for every layer change, which means that the distribution of input data for every layer will change during every iteration.
We call this internal covariate shift (Ioffe & Szegedy, 2015). It is bad, because it can slow down learning. Fortunately, it can be avoided - and Batch Normalization is a way of doing so.
As we saw before, neural networks train fast if the distribution of the input data remains similar over time. Batch Normalization helps you do this by doing two things: normalizing the input value and scaling and shifting it.
Normalizing the value: \(\hat{x}_B^{(k)} \leftarrow \frac{x_B{ ^{(k)} } - \mu_B^{(k)}}{\sqrt{ \sigma^2{ _B^{(k)} } + \epsilon}}\)
Every input \(x_B{ ^{(k)}}\) is normalized by first subtracting input sample mean \( \mu_B^{(k)} \) and then dividing by \( \sqrt{ \sigma^2{ _B^{(k)} } + \epsilon} \), which is the square root of the variance of the input sample, plus some \( \epsilon \). Do note:
Scaling and shifting: \(y_i \leftarrow \gamma\hat{x} _B ^{(k)} + \beta\).
With some activation functions (such as the Sigmoid activation function), normalizing inputs to have the \((0, 1)\) distribution may result in a different issue: they'll activate almost linearly as they primarily activate in the linear segment of the activation function.
Here, I explain this in more detail, and why this needs to be avoided.
By scaling the value with some \(\gamma\) and shifting the value with some \(\beta\), this problem can be avoided. The values for these are learnt during training.
In the Keras API (TensorFlow, n.d.), Batch Normalization is defined as follows:
keras.layers.BatchNormalization(axis=-1, momentum=0.99, epsilon=0.001, center=True, scale=True, beta_initializer='zeros', gamma_initializer='ones', moving_mean_initializer='zeros', moving_variance_initializer='ones', beta_regularizer=None, gamma_regularizer=None, beta_constraint=None, gamma_constraint=None)
Put simply, Batch Normalization can be added as easily as adding a BatchNormalization() layer to your model, e.g. with model.add. However, if you wish, local parameters can be tuned to steer the way in which Batch Normalization works. These parameters are as follows:
True, the value for \(\beta\) is used; if False, it's ignored.True, the value for \(\gamma\) is used; if False, it's ignored.Why the moving mean and variance, you say?
This has to do with how Batch Normalization works during training time versus inference time.
During training time, there's a larger minibatch available which you can use to compute sample mean and sample variance.
However, during inference, the sample size is one. There's no possibility to compute an average mean and an average variance - because you have one value only, which may be an outlier. Having the moving mean and moving variance from the training process available during inference, you can use these values to normalize during inference. Smart and simple, but a great fix for this issue :)
Let's take a look at the model we're going to create today :) First, we'll see what dataset we're going to use - being the KMNIST datset. This is followed by a discussion about the model we'll be creating in this tutorial.
For the dataset, we're using the KMNIST dataset today:
It is a drop-in replacement for the MNIST dataset:
extra-keras-datasets moduleWe use the extra-keras-datasets module to load our dataset. This module, which we created and discussed in a different blog post, attempts to replicate the keras.datasets way of loading data into your machine learning projects, albeit with different datasets. As we believe that making more datasets easily available boosts adoption of a framework, especially by people who are just starting out, we've been making available additional datasets for Keras through this module. It may be worthwhile to check it out separately!
Installing this module is required if you wish to run the model (if you don't, you may also replace it with keras.datasets.mnist), and can be done very easily:
pip install extra-keras-datasets
With one line, a dataset can be imported into your model:
from extra_keras_datasets import kmnist
And subsequently loading the data into the particular variables is also easy:
(input_train, target_train), (input_test, target_test) = kmnist.load_data(type='kmnist')
This is the architecture of today's model, which we generated with Net2Vis (Bäuerle & Ropinski, 2019):
Our model has two convolutional blocks followed by two dense layers:
Let's now see how we can implement this model with Keras :) We'll be using the TensorFlow 2.0 approach to Keras, which is the currently preferred way of using the library. This unfortunately means that it's no longer possible to use Keras with Theano or CNTK. However, if you wish to still use it, it may still work to replace tensorflow.keras with keras, i.e. the original library.
Creating the model is a multi-step process:
Let's go! Open your Explorer or Finder, navigate to some folder, and create a Python file, e.g. model_batchnorm.py. Next, open this file in your code editor - so that we can start coding :)
These are our model imports:
from extra_keras_datasets import kmnist
import tensorflow
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense, Flatten
from tensorflow.keras.layers import Conv2D, MaxPooling2D
from tensorflow.keras.layers import BatchNormalization
We can describe them as follows:
tensorflow library so that we can import Keras stuff next.models, we import the Sequential API - which allows us to stack individual layers nicely and easily.layers, we import Dense, Flatten, Conv2D, MaxPooling2D and BatchNormalization - i.e., the layers from the architecture that we specified.kmnist dataset from the extra_keras_datasets library.We can then set the configuration for our model:
# Model configuration
batch_size = 250
no_epochs = 25
no_classes = 10
validation_split = 0.2
verbosity = 1
The batch size of our minibatches is set to 250, which balances well between gradient preciseness and memory requirements. We'll train for 25 epochs (which could be higher if you wish, just configure it to a different number :)) and tell the model that we have 10 classes that it can classify into - i.e., the 10 KMNIST classes. 20% of the training data will be used for validation purposes and with verbosity mode set to True, all output will be displayed on screen.
The next step is loading the data. Given the simplicity with which extra-keras-datasets can be used, thanks to the original keras.datasets module, this is definitely easy:
# Load KMNIST dataset
(input_train, target_train), (input_test, target_test) = kmnist.load_data(type='kmnist')
Subsequent processing of the data so that it is prepared for training is a bit more complex, but it is neither very difficult:
# Shape of the input sets
input_train_shape = input_train.shape
input_test_shape = input_test.shape 
With this step, we obtain the shape of our input_train and input_test datasets, i.e. our features. We'll use the first to set the shape of our Keras input data next - which are image height (shape dim 1), image width (shape dim 2) and the number of channels (just one):
# Keras layer input shape
input_shape = (input_train_shape[1], input_train_shape[2], 1)
Channels have to be included because Keras expects them during training.
Next, because the data does not have yet the channels property, we'll have to reshape our data to include it there as well:
# Reshape the training data to include channels
input_train = input_train.reshape(input_train_shape[0], input_train_shape[1], input_train_shape[2], 1)
input_test = input_test.reshape(input_test_shape[0], input_test_shape[1], input_test_shape[2], 1)
Now, the bulk of the work is done. We next convert the data to float32 format which presumably speeds up training:
# Parse numbers as floats
input_train = input_train.astype('float32')
input_test = input_test.astype('float32')
And finally normalize the data:
# Normalize input data
input_train = input_train / 255
input_test = input_test / 255
We're now ready to define the architecture.
Since we already discussed the architecture of our model above, its components won't be too surprising. However, what is still lacking is the actual code for our architecture - so let's write it now and explain it afterwards:
# Create the model
model = Sequential()
model.add(Conv2D(32, kernel_size=(3, 3), activation='relu', input_shape=input_shape))
model.add(BatchNormalization())
model.add(MaxPooling2D(pool_size=(2, 2)))
model.add(BatchNormalization())
model.add(Conv2D(64, kernel_size=(3, 3), activation='relu'))
model.add(BatchNormalization())
model.add(MaxPooling2D(pool_size=(2, 2)))
model.add(BatchNormalization())
model.add(Flatten())
model.add(Dense(256, activation='relu'))
model.add(BatchNormalization())
model.add(Dense(no_classes, activation='softmax'))
What this code does is create an instance of a model based on the Sequential API. Subsequently, the convolutional, pooling, batch normalization and Dense layers are stacked with model.add.
Some things we haven't included in the architectural discussion before:
The next step is model compilation:
# Compile the model
model.compile(loss=tensorflow.keras.losses.sparse_categorical_crossentropy,
              optimizer=tensorflow.keras.optimizers.Adam(),
              metrics=['accuracy'])
As you can see, model compilation is essentially instantiating the model architecture we defined before with the model configuration we set before. We use sparse categorical crossentropy loss, which combines nicely with our integer target values - so that we don't have to convert these into categorical format before we start training. To optimize the model, we use the Adam optimizer, and add accuracy as an additional metric.
Then, we fit the data to our model, a.k.a. starting the training process:
# Fit data to model
history = model.fit(input_train, target_train,
            batch_size=batch_size,
            epochs=no_epochs,
            verbose=verbosity,
            validation_split=validation_split)
We fit the input training set with its corresponding targets, and train according to the preconfigured batch_size and no_epochs, with verbosity mode set to on and the validation_split set as before (i.e., to 20%).
Note that the history object can be used for visualizing the training process / the improvements over epochs later.
The final step is generating evaluation metrics with our test set, to see whether our model generalizes to unseen data:
# Generate generalization metrics
score = model.evaluate(input_test, target_test, verbose=0)
print(f'Test loss: {score[0]} / Test accuracy: {score[1]}')
Now, we can start training! :)
Should you wish instead to obtain the full code for the model at once, here you go :)
from extra_keras_datasets import kmnist
import tensorflow
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense, Flatten
from tensorflow.keras.layers import Conv2D, MaxPooling2D
from tensorflow.keras.layers import BatchNormalization
# Model configuration
batch_size = 250
no_epochs = 25
no_classes = 10
validation_split = 0.2
verbosity = 1
# Load KMNIST dataset
(input_train, target_train), (input_test, target_test) = kmnist.load_data(type='kmnist')
# Shape of the input sets
input_train_shape = input_train.shape
input_test_shape = input_test.shape 
# Keras layer input shape
input_shape = (input_train_shape[1], input_train_shape[2], 1)
# Reshape the training data to include channels
input_train = input_train.reshape(input_train_shape[0], input_train_shape[1], input_train_shape[2], 1)
input_test = input_test.reshape(input_test_shape[0], input_test_shape[1], input_test_shape[2], 1)
# Parse numbers as floats
input_train = input_train.astype('float32')
input_test = input_test.astype('float32')
# Normalize input data
input_train = input_train / 255
input_test = input_test / 255
# Create the model
model = Sequential()
model.add(Conv2D(32, kernel_size=(3, 3), activation='relu', input_shape=input_shape))
model.add(BatchNormalization())
model.add(MaxPooling2D(pool_size=(2, 2)))
model.add(BatchNormalization())
model.add(Conv2D(64, kernel_size=(3, 3), activation='relu'))
model.add(BatchNormalization())
model.add(MaxPooling2D(pool_size=(2, 2)))
model.add(BatchNormalization())
model.add(Flatten())
model.add(Dense(256, activation='relu'))
model.add(BatchNormalization())
model.add(Dense(no_classes, activation='softmax'))
# Compile the model
model.compile(loss=tensorflow.keras.losses.sparse_categorical_crossentropy,
              optimizer=tensorflow.keras.optimizers.Adam(),
              metrics=['accuracy'])
# Fit data to model
history = model.fit(input_train, target_train,
            batch_size=batch_size,
            epochs=no_epochs,
            verbose=verbosity,
            validation_split=validation_split)
# Generate generalization metric  s
score = model.evaluate(input_test, target_test, verbose=0)
print(f'Test loss: {score[0]} / Test accuracy: {score[1]}')
To start training, open up a terminal which has the required software dependencies installed (i.e. tensorflow 2.0+ and the extra-keras-datasets module), cd to the folder where your Python file is located, and run it with e.g. python model_batchnorm.py.
Most likely, the training process will then begin, and you should see the test results once it finishes. Here are the results over the epochs shown visually. They were generated by means of the history object (note that you must add extra code to make this work):
As you can see, the model performs well. Obviously, for practical settings, this will be different as your data set is likely much more complex, but I'm curious whether Batch Normalization will help ensure faster convergence in your models! Please let me know in the comments section below :)
In this blog post, we've looked at how to apply Batch Normalization in your Keras models. This included a discussion about the concept of internal covariate shift and why this may slow down the learning process. Additionally, we provided a recap on the concept of Batch Normalization and how it works, and why it may reduce these issues.
This was followed by a Keras implementation using the TensorFlow 2.0 way of working. The full code was split into small blocks which contained an explanation. This way, I hope that you understood well why I coded what I coded.
A long story short: I hope you've learnt something today! If you did, I'd love to know what, and you can leave a comment below. Please do the same if you have questions left or remarks that you wish to express. Thank you for reading MachineCurve today and happy engineering! 😊
Ioffe, S., & Szegedy, C. (2015). Batch normalization: Accelerating deep network training by reducing internal covariate shift. arXiv preprint arXiv:1502.03167.
Bäuerle, A., & Ropinski, T. (2019). Net2Vis: Transforming Deep Convolutional Networks into Publication-Ready Visualizations. arXiv preprint arXiv:1902.04394.
MachineCurve. (2020, January 14). What is Batch Normalization for training neural networks? Retrieved from https://www.machinecurve.com/index.php/2020/01/14/what-is-batch-normalization-for-training-neural-networks/
Clanuwat, T., Bober-Irizar, M., Kitamoto, A., Lamb, A., Yamamoto, K., & Ha, D. (2018). Deep learning for classical Japanese literature. arXiv preprint arXiv:1812.01718. Retrieved from https://arxiv.org/abs/1812.01718
TensorFlow. (n.d.). tf.keras.layers.BatchNormalization. Retrieved from https://www.tensorflow.org/api_docs/python/tf/keras/layers/BatchNormalization
 
                        Learn how large language models and other foundation models are working and how you can train open source ones yourself.
Keras is a high-level API for TensorFlow. It is one of the most popular deep learning frameworks.
Read about the fundamentals of machine learning, deep learning and artificial intelligence.
To get in touch with me, please connect with me on LinkedIn. Make sure to write me a message saying hi!
The content on this website is written for educational purposes. In writing the articles, I have attempted to be as correct and precise as possible. Should you find any errors, please let me know by creating an issue or pull request in this GitHub repository.
All text on this website written by me is copyrighted and may not be used without prior permission. Creating citations using content from this website is allowed if a reference is added, including an URL reference to the referenced article.
If you have any questions or remarks, feel free to get in touch.
TensorFlow, the TensorFlow logo and any related marks are trademarks of Google Inc.
PyTorch, the PyTorch logo and any related marks are trademarks of The Linux Foundation.
Montserrat and Source Sans are fonts licensed under the SIL Open Font License version 1.1.
Mathjax is licensed under the Apache License, Version 2.0.