The workflow of any machine learning project includes all the steps required to build it. A proper ML project consists of basically four main parts are given as follows:
- Gathering data:
The process of gathering data depends on the project it can be real-time data or the data collected from various sources such as a file, database, survey and other sources. - Data pre-processing:
Usually, within the collected data, there is a lot of missing data, extremely large values, unorganized text data or noisy data and thus cannot be used directly within the model, therefore, the data require some pre-processing before entering the model. - Training and testing the model: Once the data is ready for algorithm application, It is then ready to put into the machine learning model. Before that, it is important to have an idea of what model is to be used which may give a nice performance output. The data set is divided into 3 basic sections i.e. The training set, validation set and test set. The main aim is to train data in the train set, to tune the parameters using ‘validation set’ and then test the performance test set.
- Evaluation:
Evaluation is a part of the model development process. It helps to find the best model that represents the data and how well the chosen model works in the future. This is done after training of model in different algorithms is done. The main motto is to conclude the evaluation and choose model accordingly again.
ML Workflow in python
The execution of the workflow is in a pipe-like manner, i.e. the output of the first steps becomes the input of the second step. Scikit-learn is a powerful tool for machine learning, provides a feature for handling such pipes under the sklearn.pipeline module called Pipeline.
It takes 2 important parameters, stated as follows:
- The Stepslist:
List of (name, transform) tuples (implementing fit/transform) that are chained, in the order in which they are chained, with the last object an estimator. - verbose:
Code:
python3
from sklearn import datasets from sklearn.model_selection import train_test_split from sklearn.preprocessing import StandardScaler from sklearn.decomposition import PCA from sklearn.tree import DecisionTreeClassifier # import some data within sklearn for iris classification iris = datasets.load_iris() X = iris.data y = iris.target # Splitting data into train and testing part # The 25 % of data is test size of the data X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.25 ) # importing pipes for making the Pipe flow from sklearn.pipeline import Pipeline # pipe flow is : # PCA(Dimension reduction to two) -> Scaling the data -> DecisionTreeClassification pipe = Pipeline([( 'pca' , PCA(n_components = 2 )), ( 'std' , StandardScaler()), ( 'decision_tree' , DecisionTreeClassifier())], verbose = True ) # fitting the data in the pipe pipe.fit(X_train, y_train) # scoring data from sklearn.metrics import accuracy_score print (accuracy_score(y_test, pipe.predict(X_test))) |
Output:
[Pipeline] ............... (step 1 of 3) Processing pca, total= 0.0s [Pipeline] ............... (step 2 of 3) Processing std, total= 0.0s [Pipeline] ..... (step 3 of 3) Processing Decision_tree, total= 0.0s 0.9736842105263158
Important property:
- pipe.named_steps: pipe.named_steps is a dictionary storing the name key linked to the individual objects in the pipe. For example:
pipe.named_steps['decision_tree'] # returns a decision tree classifier object
Hyper parameters:
There are different set of hyper parameters set within the classes passed in as a pipeline. To view them, pipe.get_params() method is used. This method returns a dictionary of the parameters and descriptions of each classes in the pipeline.
Example:
python3
from sklearn import datasets from sklearn.model_selection import train_test_split from sklearn.preprocessing import StandardScaler from sklearn.decomposition import PCA from sklearn.tree import DecisionTreeClassifier # import some data within sklearn for iris classification iris = datasets.load_iris() X = iris.data y = iris.target X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.25 ) from sklearn.pipeline import Pipeline pipe = Pipeline([( 'pca' , PCA(n_components = 2 )), ( 'std' , StandardScaler()), ( 'Decision_tree' , DecisionTreeClassifier())], verbose = True ) pipe.fit(X_train, y_train) # to see all the hyper parameters pipe.get_params() |
Output:
{'memory': None, 'steps': [('pca', PCA(copy=True, iterated_power='auto', n_components=2, random_state=None, svd_solver='auto', tol=0.0, whiten=False)), ('std', StandardScaler(copy=True, with_mean=True, with_std=True)), ('Decision_tree', DecisionTreeClassifier(ccp_alpha=0.0, class_weight=None, criterion='gini', max_depth=None, max_features=None, max_leaf_nodes=None, min_impurity_decrease=0.0, min_impurity_split=None, min_samples_leaf=1, min_samples_split=2, min_weight_fraction_leaf=0.0, presort='deprecated', random_state=None, splitter='best'))], 'verbose': True, 'pca': PCA(copy=True, iterated_power='auto', n_components=2, random_state=None, svd_solver='auto', tol=0.0, whiten=False), 'std': StandardScaler(copy=True, with_mean=True, with_std=True), 'Decision_tree': DecisionTreeClassifier(ccp_alpha=0.0, class_weight=None, criterion='gini', max_depth=None, max_features=None, max_leaf_nodes=None, min_impurity_decrease=0.0, min_impurity_split=None, min_samples_leaf=1, min_samples_split=2, min_weight_fraction_leaf=0.0, presort='deprecated', random_state=None, splitter='best'), 'pca__copy': True, 'pca__iterated_power': 'auto', 'pca__n_components': 2, 'pca__random_state': None, 'pca__svd_solver': 'auto', 'pca__tol': 0.0, 'pca__whiten': False, 'std__copy': True, 'std__with_mean': True, 'std__with_std': True, 'Decision_tree__ccp_alpha': 0.0, 'Decision_tree__class_weight': None, 'Decision_tree__criterion': 'gini', 'Decision_tree__max_depth': None, 'Decision_tree__max_features': None, 'Decision_tree__max_leaf_nodes': None, 'Decision_tree__min_impurity_decrease': 0.0, 'Decision_tree__min_impurity_split': None, 'Decision_tree__min_samples_leaf': 1, 'Decision_tree__min_samples_split': 2, 'Decision_tree__min_weight_fraction_leaf': 0.0, 'Decision_tree__presort': 'deprecated', 'Decision_tree__random_state': None, 'Decision_tree__splitter': 'best'}