Applying a machine learning algorithm to any number of data-related tasks can be an enormous time saver, but the variable factors associated with creating an algorithm can be daunting. One must consider a variety of design-related decisions, and the risks surrounding the creation of an accurate architecture can make finding the best option a weighty and time-consuming task. Yet it is important to carefully choose your modeling techniques, as using the wrong method can lead to a higher than optimal error rate.
Dr. Marius Lindaur of the University of Freiburg speaks to these common issues. He notes that the typical methods for choosing a machine learning pipeline have typically been the following:
- Human Optimizer: Using the different pipelines yourself. This method is great for learning about machine learning, but is highly inefficient in terms of time use and is susceptible to the biases and failures of judgment inherent in human activity.
- Grid Search: Another simple approach, useful both for application of parameters to smaller-scale models and as a tool for study; however, it is capable only of comparing a few different models at once, and its simplicity makes it impossible to use on a larger scale.
- Random Search: Simple, eventually leads to the optimal parameters. Can be used for large data sets, but is inefficient and expensive.
- Bayesian Optimization: State of the art, highly efficient. Not easily made parallel, not easily scalable to very large data sets.
[Related Article: Automated Machine Learning: Myth Versus Reality]
Dr. Lindaur reports that, like most organizations, his team of researchers had been relying on Bayesian optimization, but the issues with scalability led them to try to develop new approaches.
They began experimenting with “warm starting,” or creating ML models that were predicated on already-existing models that had been trained with similar data sets. A similar idea was “ensembling,” or the combination of existing ML pipelines and weighing them based on perceived importance. Through experimenting with these two techniques in addition to their continued use of Bayesian optimization, Lindaur and his team eventually created a comprehensive model that combined all three techniques, called auto-sklearn.
Auto-sklearn, per Dr. Lindaur, outperforms almost all existing metrics and takes a comprehensive approach in comparing the vast assortment of existing ML pipelines. Their model was the winner of the inaugural AutoML challenge in 2015 and has provided the industry with a valuable new tool. Like any tool, however, auto-sklearn has its caveats. Dr. Lindaur described some of the problems that arose in the first iteration of auto-sklearn:
- Overfitting can be an issue
- Ensembling can fail
- Meta-Feature computation can be time-consuming
- Training poorly-performing models can be time-consuming and often fails
- Only one subset of ML-algorithms is important
Successive iterations of AutoSKLearn have made progress in resolving these issues. Some other areas of focus have been in focusing the pipeline ensemble, restricting it to the models that are most likely to be successful. Major upgrades in the realm of automatic machine learning have hybridizations of existing models, such as BOHB, a combination of Bayesian optimization and Hyperband techniques, or POSH-auto-sklearn, which utilizes the existing auto-sklearn system alongside BOSH and the Hydra platform.
[Related Article: Should You Build or Buy Your Data Science Platform?]
Dr. Lindaur emphasizes that AutoSKLearn is open-source and available to all. He hopes that by making his research available to the public, it will attract further improvement. The development of automatic machine learning has been driven by discursive and interactive processes, and the future of the technology relies on the continued widespread availability of such material.
Watch Dr. Lindaur’s full talk here: