This article will look at the ways to load CSV data in the Python programming language using TensorFlow.
TensorFlow library provides the make_csv_dataset( ) function, which is used to read the data and use it in our programs.
Loading single CSV File
To get the single CSV data file from the URL, we use the Keras get_file function. Here we will use the Titanic Dataset.
To use this, we add the following lines in our code:
Python3
import tensorflow as tf from tensorflow.keras import layers import pandas as pd data_path = tf.keras.utils.get_file( "data_train.csv" , data_train_tf = tf.data.experimental.make_csv_dataset( data_path, batch_size = 10 , label_name = 'survived' , num_epochs = 1 , ignore_errors = True ,) |
The data now can be used as a dict where the key is the column name and values are the data records. The first item in the dataset is our data columns; the other is label data. In our data batch, each column/feature name acts as a key, and all values in the column are its value.
Python3
for batch, label in data_train_tf.take( 1 ): for key, value in batch.items(): print (f "{key:10s}: {value}" ) |
Output:
Loading Multiple CSVs Files:
The primary use of make_csv_dataset method can be seen when we have to import multiple CSV files into our dataset. We will use the fonts dataset, which contains different language fonts.
Example: In this example, we use the Keras get_file feature to read multiple datasets onto the disk, and cache_dir and cache_subdir define where to store these.
Once we have the datasets saved, then using the file_pattern command in our make_csv_dataset we can specify the path to all files to be imported. Create a new file and execute the following code:
Python3
fonts = tf.keras.utils.get_file( 'fonts.zip' , cache_dir = '.' , cache_subdir = 'fonts' , extract = True ) fonts_data = tf.data.experimental.make_csv_dataset( file_pattern = "fonts/*.csv" , batch_size = 10 , num_epochs = 1 , num_parallel_reads = 4 , shuffle_buffer_size = 10000 ) for features in fonts_data.take( 1 ): for i, (name, value) in enumerate (features.items()): if i > 15 : break print (f "{name:20s}: {value}" ) print (f "[total: {len(features)} features]" ) |
We are displaying the first 15 features of each dataset and their values. The final count of total feature is displayed using len( ) function. In this example, we have 412 features in total.
Output: