Datasets are useful for allowing comfortable access to training, test, and validation data. Instead of having to mangle with arrays, PyBrain gives you a more sophisticated data structure that allows easier work with your data.
DataSets In PyBrain
The most commonly used datasets that Pybrain supports are SupervisedDataSet and ClassificationDataSet.
SupervisedDataSet: It consists of fields of input and target. It is the simplest form of a dataset and is mainly used for supervised learning tasks. As the name says, this simplest form of a dataset is meant to be used with supervised learning tasks. It is comprised of the fields ‘input’ and ‘target’, the pattern size of which must be set upon creation:
Python3
from pybrain.datasets import SupervisedDataSet DS = SupervisedDataSet( 3 , 2 ) DS.appendLinked([ 1 , 2 , 3 ], [ 4 , 5 ]) len (DS) DS[ 'input' ] array([[ 1. , 2. , 3. ]]) |
Output:
ClassificationDataSet: It is mainly used to deal with classification problems. It takes in input, target field, and also an extra field called “class” which is an automated backup of the targets given. For example, the output will be either 1 or 0, or the output will be grouped together with values based on input given, i.e., either it will fall in one particular class.
Python3
# Importing all the necessary libraries from sklearn import datasets import matplotlib.pyplot as plt from pybrain.datasets import ClassificationDataSet from pybrain.utilities import percentError from pybrain.tools.shortcuts import buildNetwork from pybrain.supervised.trainers import BackpropTrainer from pybrain.structure.modules import SoftmaxLayer from numpy import ravel # Loading iris dataset from sklearn datasets iris = datasets.load_iris() # Defining feature variables and target variable X_data = iris.data y_data = iris.target # Defining classification dataset model classification_dataset = ClassificationDataSet( 4 , 1 , nb_classes = 3 ) # Adding sample into classification dataset for i in range ( len (X_data)): classification_dataset.addSample(ravel(X_data[i]), y_data[i]) # Spilling data into testing and training data # with the ratio 7:3 testing_data, training_data = classification_dataset.splitWithProportion( 0.3 ) # Classification dataset for test data test_data = ClassificationDataSet( 4 , 1 , nb_classes = 3 ) # Adding sample into testing classification dataset for n in range ( 0 , testing_data.getLength()): test_data.addSample(testing_data.getSample( n)[ 0 ], testing_data.getSample(n)[ 1 ]) # Classification dataset for train data train_data = ClassificationDataSet( 4 , 1 , nb_classes = 3 ) # Adding sample into training classification dataset for n in range ( 0 , training_data.getLength()): train_data.addSample(training_data.getSample( n)[ 0 ], training_data.getSample(n)[ 1 ]) test_data._convertToOneOfMany() train_data._convertToOneOfMany() # Building network with outclass as SoftmaxLayer # on training data build_network = buildNetwork( train_data.indim, 4 , train_data.outdim, outclass = SoftmaxLayer) # Building a backproptrainer on training data trainer = BackpropTrainer( build_network, dataset = train_data, learningrate = 0.01 , verbose = True ) # 20 iterations on training data trainer.trainEpochs( 20 ) # Testing data print ( 'Error percentage on testing data=>' , percentError( trainer.testOnClassData(dataset = test_data), test_data[ 'class' ])) |
Output:
Total error: 0.0892390931641 Total error: 0.0821479733597 Total error: 0.0759327938967 Total error: 0.0722385583142 Total error: 0.0690818068826 Total error: 0.0667645311923 Total error: 0.0647079622731 Total error: 0.0630345245312 Total error: 0.0608030839912 Total error: 0.0595356750412 Total error: 0.0586635639408 Total error: 0.0573043661487 Total error: 0.0559188704413 Total error: 0.0548155819544 Total error: 0.0535537679931 Total error: 0.0527051106108 Total error: 0.0515783629912 Total error: 0.0501025301423 Total error: 0.0499123823243 Total error: 0.0482250742606 Error percentage on testing data=> 20.0