Tensorflow is a free open-source machine learning and artificial intelligence library widely popular for training and deploying neural networks. It is developed by Google Brain Team and supports a wide range of platforms. In this tutorial, we will learn to download, load and explore the famous Iliad dataset.
In the Iliad dataset, there are various works of different English translations of the same Homer’s Iliad text. Tensorflow has modified the documents for focusing on the examples of their work. The dataset is available at the following URL.
https://storage.googleapis.com/download.tensorflow.org/data/illiad/
Example: In the following example, we will take the works of three translators named: William Cowper, Edward, Earl of Derb, and Samuel Butler. Then with the help of TensorFlow, we will load them and classify their works with their translations.
Install the TensorFlow text package:
pip install "tensorflow-text==2.8.*"
Download and load the Iliad dataset
We need to label each dataset individually and so we use the Dataset.map function. This will return example-label pairs.
Python3
import pathlib import tensorflow as tf from tensorflow.keras import layers from tensorflow.keras import losses from tensorflow.keras import utils from tensorflow.keras.layers import TextVectorization import tensorflow_datasets as tfds import tensorflow_text as tf_text print ( "Welcome to neveropen" ) print ( "Loading the Illiad dataset" ) DIRECTORY_URL = 'https: / / storage.googleapis.com / \ download.tensorflow.org / data / illiad / ' FILE_NAMES = [ 'cowper.txt' , 'derby.txt' , 'butler.txt' ] for name in FILE_NAMES: text_dir = utils.get_file(name, origin = DIRECTORY_URL + name) parent_dir = pathlib.Path(text_dir).parent def labeler(example, index): return example, tf.cast(index, tf.int64) labeled_data_sets = [] for i, file_name in enumerate (FILE_NAMES): lines_dataset = tf.data.TextLineDataset( str (parent_dir / file_name)) labeled_dataset = lines_dataset. map ( lambda ex: labeler(ex, i)) labeled_data_sets.append(labeled_dataset) labeled_data_sets |
Output:
[<MapDataset element_spec=(TensorSpec(shape=(), dtype=tf.string, name=None), TensorSpec(shape=(), dtype=tf.int64, name=None))>,
<MapDataset element_spec=(TensorSpec(shape=(), dtype=tf.string, name=None), TensorSpec(shape=(), dtype=tf.int64, name=None))>,
<MapDataset element_spec=(TensorSpec(shape=(), dtype=tf.string, name=None), TensorSpec(shape=(), dtype=tf.int64, name=None))>]
Concatenate and shuffle the datasets. There are concatenated using the Dataset.concatenate function. The shuffle function is used to shuffle the data. We then print out some examples.
Python3
BUFFER_SIZE = 50000 BATCH_SIZE = 64 VALIDATION_SIZE = 5000 all_labeled_data = labeled_data_sets[ 0 ] for labeled_dataset in labeled_data_sets[ 1 :]: all_labeled_data = all_labeled_data.concatenate(labeled_dataset) all_labeled_data = all_labeled_data.shuffle( BUFFER_SIZE, reshuffle_each_iteration = False ) for text, label in all_labeled_data.take( 5 ): print ( "Sentence: " , text.numpy()) print ( "Label:" , label.numpy()) |
Output:
Sentence: b”Of brass, and color’d with a ring of gold.”
Label: 0
Sentence: b’drove the horses in among the others.’
Label: 2
Sentence: b’Into the boundless ether. Reaching soon’
Label: 0
Sentence: b”Drive to the ships, for pain weigh’d down his soul.”
Label: 1
Sentence: b”Not one is station’d to protect the camp.”
Label: 1