In this article, we are going to see how to use standardize the data using Tensorflow in Python.
What is Data Standardize?
The process of converting the organizational structure of various datasets into a single, standard data format is known as data standardization. It is concerned with the modification of datasets following their collection from various sources and before their loading into target systems. It requires a significant amount of time and iteration to complete, resulting in extremely accurate, efficient, time-consuming integration and development effort.
How can Tensorflow be used to standardize the data?
We are using the flower dataset for understanding how can Tensorflow be used to standardize the data using Python. That Flower dataset contains several thousands of images of flowers with proper naming. There is one sub-directory for each class inside its five sub-directories. The flower dataset will be loaded into the environment for use after being downloaded using the ‘get_file’ method.
Now, let’s try to understand how we can download the flower dataset but before downloading we need to import some of the python libraries, and to run the code below, we use Google Collaborate.
Import libraries
In the first step, we import some of the important tensorflow and python libraries that we are going to use in the further process.
Python
import matplotlib.pyplot as plt import numpy as np import os import PIL import tensorflow as tf from tensorflow import keras from tensorflow.keras import layers from tensorflow.keras.models import Sequential import pathlib as pt |
Download the Dataset
we are using a Flower dataset that contains five sub-directories and one for each class. so, for using that dataset we need to download it first. and for downloading the dataset we need get_file() method.
Python3
dataset_url = "https: / / storage.googleapis.com / \ download.tensorflow.org / example_images / flower_photos.tgz" data_dir = tf.keras.utils.get_file( 'flower_photos' , origin = dataset_url, untar = True ) data_dir = pt.Path(data_dir) |
You should now have a copy of the dataset after downloading. There are a total of 3,670 images. and you can count the images on the dataset by using the code below:
Python3
img_count = len ( list (data_dir.glob( '*/*.jpg' ))) print (img_count) |
Output:
3670
In the dataset we have 5 categories of flowers available roses, tulips, daisy, dandelion, and sunflowers. so you can check according to their category name and using the code below:
Python3
roses = list (data_dir.glob( 'roses/*' )) PIL.Image. open ( str (roses[ 0 ])) |
Load the Dataset
For loading the dataset you need to define some parameters for the loader. Now, we need to split the dataset and by default, we are using 60% of the flower dataset as training and 40% for testing.
Python3
batch_size = 32 img_height = 180 img_width = 180 train_ds = tf.keras.utils.image_dataset_from_directory( data_dir, validation_split = 0.4 , subset = "training" , seed = 123 , image_size = (img_height, img_width), batch_size = batch_size) |
Output:
Found 3670 files belonging to 5 classes. Using 2202 files for training.
Standardize the dataset
The RGB channel values are between 0 and 255. This is not ideal for a neural network; in general, try to keep your input values as minimal as possible.
We can standardize values to fall between [0, 1] by using a rescaling layer(tensorflow.keras.layers.Rescaling)
Python3
# create normalization layer nrmzln_layer = layers.experimental.preprocessing.Rescaling( 1. / 255 ) print ("The map function is used to apply \ this layer to the dataset. ") nrmlztn_ds = train_ds. map ( lambda x, y: (nrmlztn_layer(x), y)) image_batch, labels_batch = next ( iter (nrmlztn_ds)) first_image = image_batch[ 0 ] # pixel values are in the range of [0,1]. print ( "minimum pixel value:" , np. min (first_image), " maximum pixel value:" , np. max (first_image)) |
Output:
The map function is used to apply this layer to the dataset.
minimum pixel value: 0.0
maximum pixel value: 0.87026095