mldata.org does not have an enforced convention for storing data or naming the columns in a data set. The default behavior of this function works well with most of the common cases mentioned below:
- Data values stored in the column are ‘Data’, and target values stored in the column are ‘label’.
- The first column table stores target, and the second stores’ data.
- The data array is stored as features and samples and needed to be transposed to match the sklearn standard.
Fetch a machine learning data set, if the file does not exist, it is downloaded automatically from mldata.org.
sklearn.datasets package directly loads datasets using function: sklearn.datasets.fetch_mldata()
Syntax: sklearn.datasets.fetch_mldata(dataname, target_name=’label’, data_name=’data’, transpose_data=True, data_home=None)
Parameters:
- dataname: (<str>) It is the name of the dataset on mldata.org, e.g: “Iris” , “mnist”, “leukemia”, etc.
- target_name: (optional, default: ‘label’) It accepts the name or index of the column containing the target values and needed to pass the default values of the label.
- data_name: (optional, default: ‘data’) It accepts the name or index of the column containing the data and needed to pass default values of data.
- transpose_data: (optional, default: True) The default value passed is true, and if True, it transposes the loaded data.
- data_home: (optional, default: None) It loads cache folder for the datasets. By default, all sklearn data is stored in ‘~/scikit_learn_data’ subfolders.
Returns: data, (Bunch) Interesting attributes are: ‘data’, data to learn, ‘target’, classification labels, ‘DESCR’, description of the dataset, and ‘COL_NAMES’, the original names of the dataset columns.
Let’s see the examples:
Example 1: Load the ‘iris’ dataset from mldata, which needs to be transposed.
Python3
# import fetch_mldata function from sklearn.datasets.mldata import fetch_mldata # load data and transpose data iris = fetch_mldata( 'iris' , transpose_data = False ) # iris data is very large # so print the dataset shape # print(iris) print (iris.data.shape) |
Output:
(4,150)
Example 2: Load the MNIST digit recognition dataset from mldata.
Python3
# import fetch_mldata function from sklearn.datasets.mldata import fetch_mldata # load data mnist = fetch_mldata( 'MNIST original' ) # mnist data is very large # so print the shape of data print (mnist.data.shape) |
Output:
(70000, 784)
Note: This post is according to Scikit-learn (version 0.19).