Machine learning is the driving force of modern technology and smart applications. While highly efficient methods and implementations are broadly available, their successful application is hard. It is hard because a myriad of design decisions have to be made correctly before an ML pipeline achieves peak performance.
Such decisions include how to preprocess features (e.g. how to replace missing values), which model class to use (e.g. neural networks or boosted trees), and finally, how to set the hyperparameters of this model class (e.g. the learning rate and number of epochs). Manually searching this vast design space either requires a lot of experience, a lot of computing resources, or both. AutoML is here to help!
AutoML automatically finds well-performing machine learning pipelines and thus frees the human expert from this tedious task. This reduces the barrier to broadly apply machine learning and makes it available for everyone. In this post, we’ll have a look at the AutoML tool Auto-sklearn.
Auto-sklearn is an open-source tool, so we are happy to receive stars, pull requests, and issues: www.github.com/automl/auto-sklearn.
What you’ll get out of this post and what you’ll need to run the code
You’ll learn how to replace a manually designed scikit-learn pipeline with an Auto-sklearn estimator. We provide all code in this Colab Notebook.
Step 1: Load data
As a first step, we’ll use the built-in data loading method from scikit-learn to load the credit-g dataset and split it into train and test data.
import sklearn.datasets import sklearn.model_selection # We fetch the data using openml.org X, y = sklearn.datasets.fetch_openml(data_id=31, return_X_y=True, as_frame=True) # Split the data into train and test X_train, X_test, y_train, y_test = sklearn.model_selection.train_test_split( X, y, test_size=0.4, random_state=42 ) X_train.info()
This dataset describes bank customers who apply for credit. It has 1000 data points and 20 features and is a good example dataset as it contains both numerical and categorical features. The objective is to classify each request whether the credit would default or not.
Step 2: Manually build a pipeline
Now, we turn to building our pipeline. We’ll use a Support Vector Machine (SVM). However, in order to get good performance with an SVM one needs to preprocess the data, and in particular, we need to apply one-hot encoding to deal with categorical values and scale the features (such as the features credit-amount, which goes up to 20.000, and the feature duration, which does not go above 80).
Note: For demonstration, we use the default hyperparameters set by scikit-learn for this pipeline; however, in practice, these need to be tuned to achieve top performance.
from sklearn.compose import ColumnTransformer from sklearn.metrics import accuracy_score from sklearn.pipeline import Pipeline from sklearn.preprocessing import OneHotEncoder from sklearn.preprocessing import StandardScaler from sklearn.svm import SVC # Create the estimator using the default parameters from the library estimator_svc = SVC(C=1.0, kernel='rbf', gamma='scale', shrinking=True, tol=1e-3, cache_size=200, verbose=False, max_iter=-1, random_state=42 ) # build and fit the pipeline categorical_columns = [col for col in X_train.columns if X[col].dtype.name == 'category'] encoder = ColumnTransformer(transformers = [ ('cat', OneHotEncoder(handle_unknown='ignore'), categorical_columns) ], remainder='passthrough') pipeline_svc = Pipeline([ ('encoder', encoder), ('scaler', StandardScaler()), ('svc', estimator_svc), ]) pipeline_svc.fit(X_train, y_train)
After constructing the pipeline and training it on the training data, we measure the performance on the test set and obtain an accuracy of 76.75%.
# Score the model prediction = pipeline_svc.predict(X_test) performance_svc = accuracy_score(y_test, prediction) print(f"SVC performance is {performance_svc}")
We also tried other classifiers such as a Gradient Boosting Classifier and a Decision Tree and their performance was 73.5% and 70.75%
Step 3: Use Auto-sklearn as a drop-in-replacement
Finally, we’ll demonstrate how easy it is to use auto-sklearn as a drop-in replacement for the manually constructed estimator pipelines discussed above.
Instead of manually specifying a pipeline, we can just use the Auto-sklearn estimator object and all that’s left is to decide how much resources should be spent on searching for the best pipeline. We set this limit to 5 minutes and 1 CPU core. As we have a small dataset at hand, we also turn on cross-validation.
Note: Large datasets require more computational resources to achieve good results.
Then, we’ll have an estimator object that can be handled like any scikit-learn object or pipeline and predict labels for new data; in this case, this achieves a test accuracy of 77.5%, better than the manually designed pipeline and without any need for manual work.
import autosklearn.classification # Create and train the estimator estimator_askl = autosklearn.classification.AutoSklearnClassifier( time_left_for_this_task=300, seed=42, resampling_strategy='cv', n_jobs=1, ) estimator_askl.fit(X_train, y_train) # Score the model prediction = estimator_askl.predict(X_test) performance_askl = accuracy_score(y_test, prediction) print(f"Auto-Sklearn Classifier performance is {performance_askl}")
Wrapping up on Auto-sklearn
You might wonder, what does Auto-sklearn do internally? Well, the short answer is: It searches a huge space with more than 100 dimensions for a pipeline that does well on your dataset and then automatically ensembles the best performing pipelines for prediction.
If this sounds interesting to you and you want to take a deep dive into the methodology behind Auto-sklearn and other up-to-date AutoML systems, and learn how to apply Auto-sklearn to your machine learning problem, we have two events for you at the upcoming ODSC Europe on Wednesday, June 9th:
Frank Hutter will present the methods behind Auto-sklearn and other recent AutoML systems in his presentation (10.50-11.35).
- Automated Machine Learning with Python – from scikit-learn to auto-sklearn
Afterward, Matthias Feurer and Katharina Eggensperger will do a deep dive into how to apply Auto-sklearn to your machine learning problem (11.55-13.25)
Also, if you like Auto-sklearn, give us a star at www.github.com/automl/auto-sklearn!
About the authors/ODSC Europe speakers:
Matthias Feurer is a doctoral candidate at the Machine Learning Lab at the University of Freiburg, Germany. His research focuses on automated machine learning, hyperparameter optimization, and meta-learning. He is actively involved in developing open-source software for AutoML and is the maintainer and founder of Auto-sklearn and OpenML-Python. Matthias is a founding member of the Open Machine Learning Foundation, gave AutoML tutorials at GCPR and the ECMLPKDD summer school, and co-organized the AutoML workshop in 2019 and 2020. Furthermore, he was part of the winning team of the 1st&2nd AutoML challenges and the BBO challenge@NeurIPS 2020.
Katharina Eggensperger is a doctoral candidate at the Machine Learning Lab at the University of Freiburg, Germany. Her research focuses on empirical performance modeling, automated machine learning, and hyperparameter optimization. She has been an invited speaker at the BayesOpt workshop at NeurIPS 2016 and co-organized the AutoML workshop in 2019, 2020 and 2021. Furthermore, she was part of the winning team of the 1st&2nd AutoML challenges and the BBO challenge@NeurIPS 2020.
Frank Hutter is a Full Professor for Machine Learning at the Computer Science Department of the University of Freiburg (Germany), as well as Chief Expert AutoML at the Bosch Center for Artificial Intelligence.
Frank holds a PhD from the University of British Columbia (UBC, 2009) and a Diplom (eq. MSc) from TU Darmstadt (2004). He received the 2010 CAIAC doctoral dissertation award for the best thesis in AI in Canada, and with his coauthors, several best paper awards and prizes in international competitions on machine learning, SAT solving, and AI planning.