It is important to make AI accessible to everyone for the sake of social and economic stability.
Kaggle days is a two-day event where data science enthusiasts can talk to each other face to face, exchange knowledge, and compete together. Kaggle days San Francisco just concluded and as is customary, Kaggle also organized a hackathon for the participants. I had been following Kaggle days on Twitter and the following tweet from Erin LeDell (Chief Machine Learning Scientist at H2O.ai) caught my eye.
[Related Article: Top 7 Machine Learning Frameworks for 2020]
I have been experimenting with H2O for quite some time and found it really seamless and intuitive for solving ML problems. Seeing it perform so well on Leaderboard, I thought it was time that I wrote an article on the same to make it easy for others to make a transition into the world of H2O.
H2O.ai: The company behind H2O
H2O.ai is based in Mountain View, California and offers a suite of Machine Learning platforms. H2O’s core strength is its high-performing ML components, which are tightly integrated. H2O.ai is a Visionary in the Gartner Magic Quadrant for Data Science Platforms in its report released in Jan’2019.
Let’s take a brief look at the offerings of H2O.ai:
H2O
H2O is an open-source, distributed in-memory machine learning platform with linear scalability. H2O supports the most widely used statistical & machine learning algorithms and also has an AutoML functionality. H2O’s core code is written in Java and its REST API allows access to all the capabilities of H2O from an external program or script. The platform includes interfaces for R, Python, Scala, Java, JSON and CoffeeScript/JavaScript, along with a built-in web interface, Flow,
Since the main focus of this article is about H2O, we shall get to know more about it later in the article.
H2O Sparkling Water
Sparkling Water allows users to combine the fast, scalable machine learning algorithms of H2O with the capabilities of Spark. Sparkling Water is ideal for H2O users who need to manage large clusters for their data processing needs and want to transfer data from Spark to H2O (or vice versa).
H2O4GPU
H2O4GPU is an open-source, GPU-accelerated machine learning package with APIs in Python and R that allows anyone to take advantage of GPUs to build advanced machine learning models.
H2O Driverless AI
H2O Driverless AI is H2O.ai’s flagship product for automatic machine learning. It fully automates some of the most challenging and productive tasks in applied data science such as feature engineering, model tuning, model ensembling and model deployment. With Driverless AI, data scientists of all proficiency levels can train and deploy modelling pipelines with just a few clicks from the GUI. Driverless AI is a commercially licensed product with a 21-day free trial version.
What is H2O
The latest version called H2O-3 is the third incarnation of H2O. H2O uses familiar interfaces like R, Python, Scala, Java, JSON and the Flow notebook/web interface, and works seamlessly with big data technologies like Hadoop and Spark. H2O can easily and quickly derive insights from the data through faster and better predictive modelling.
High-Level Architecture
H2O makes it possible to import data from multiple sources and has a fast, Scalable & Distributed Compute Engine Written in Java. Here is a high-level overview of the platform.
Supported Algorithms
H2O supports a lot of commonly used algorithms of Machine Learning.
Installation
H2O offers an R package that can be installed from CRAN and a python package that can be installed from PyPI. In this article, I shall be working with only the Python implementation. Also, you may want to look at the documentation for complete details.
Pre-requisites
- Python
- Java 7 or later, which you can get at the Java download page. To build H2O or run H2O tests, the 64-bit JDK is required. To run the H2O binary using either the command line, R or Python packages, only 64-bit JRE is required.
Dependencies :
pip install requests
pip install tabulate
pip install "colorama>=0.3.8"
pip install future
- pip install
pip install -f http://h2o-release.s3.amazonaws.com/h2o/latest_stable_Py.html h2o
- conda
conda install -c h2oai h2o=3.22.1.2
Note: When installing H2O from pip
in OS X El Capitan, users must include the --user
flag. For example –
pip install -f http://h2o-release.s3.amazonaws.com/h2o/latest_stable_Py.html h2o --user
For R installation please refer to the official documentation here.
Testing installation
Every new python session begins by initializing a connection between the python client and the H2O cluster. A cluster is a group of H2O nodes that work together; when a job is submitted to a cluster, all the nodes in the cluster work on a portion of the job.
To check if everything is in place, open your Jupyter Notebooks and type in the following:
import h2o
h2o.init()
This is a local H2O cluster. On executing the cell, some information will be printed on the screen in a tabular format displaying amongst other things, the number of nodes, total memory, Python version etc. In case you need to report a bug, make sure you include all this information. Also, the h2o.init()
makes sure that no prior instance of H2O is running.
By default, H2O instance uses all the cores and about 25% of the system’s memory. However, in case you wish to allocate it a fixed chunk of memory, you can specify it in the init function. Let’s say we want to give the H2O instance 4GB of memory and it should only use 2 cores.
#Allocate resources
h2o.init(nthreads=2,max_mem_size=4)
Now our H2O instance is using only 2 cores and around 4GB of memory. However, we will go with the default method.
Importing Data with H2O in Python
After the installation is successful, it’s time to get our hands dirty by working on a real-world dataset. We will be working on a Regression problem using the famous wine dataset. The task here is to predict the quality of white wine on a scale of 0–10 given a set of features as inputs.
Here is a link to the Github Repository in case you want to follow along or you can view it on my binder by clicking the image below.
Data
The data belongs to the white variants of the Portuguese “Vinho Verde” wine.
- Source: https://archive.ics.uci.edu/ml/datasets/Wine+Quality
- CSV FIle : (https://archive.ics.uci.edu/ml/machine-learning-databases/wine-quality/winequality-white.csv)
Data Import
Importing data from a local CSV file. The command is very similar to pandas.read_csv
and the data is stored in memory as a H2OFrame.
wine_data = h2o.import_file("winequality-white.csv")
wine_data.head(5)# The default head() command displays the first 10 rows.
EDA
Let us explore the dataset to get some insights.
wine_data.describe()
All the features here are numbers and there aren’t any categorical variables. Now let us also look at the correlation of the individual features.
import matplotlib.pyplot as plt
import seaborn as sns
plt.figure(figsize=(10,10))
corr = wine_data.cor().as_data_frame()
corr.index = wine_data.columns
sns.heatmap(corr, annot = True, cmap='RdYlGn', vmin=-1, vmax=1)
plt.title("Correlation Heatmap", fontsize=16)
plt.show()
Modeling with H2O
We shall build a regression model to predict the Quality of the wine. There are a lot of algorithms available in the H2O module both for Classification as well as Regression problems.
Splitting data into Test and Training sets
Since we have only one dataset, let’s split it into training and Testing part, so that we can evaluate the model’s performance. We shall use the split_frame()
function.
wine_split = wine_data.split_frame(ratios = [0.8], seed = 1234)wine_train = wine_split[0] # using 80% for training wine_test = wine_split[1] #rest 20% for testingprint(wine_train.shape, wine_test.shape) (3932, 12) (966, 12)
Defining Predictor Variables
predictors = list(wine_data.columns)
predictors.remove('quality') # Since we need to predict quality
predictors
Generalized Linear Model
We shall build a Generalized Linear Model (GLM) with default settings. Generalized Linear Models (GLM) estimate regression models for outcomes following exponential distributions. In addition to the Gaussian (i.e. normal) distribution, these include Poisson, binomial, and gamma distributions. You can read more about GLM in the documentation.
# Import the function for GLM from h2o.estimators.glm import H2OGeneralizedLinearEstimator# Set up GLM for regression glm = H2OGeneralizedLinearEstimator(family = 'gaussian', model_id = 'glm_default')# Use .train() to build the model glm.train(x = predictors, y = 'quality', training_frame = wine_train)print(glm)
Now, let’s check the model’s performance on the test dataset
glm.model_performance(wine_test)
Making Predictions
Using the GLM model to make predictions in the test dataset.
predictions = glm.predict(wine_test)
predictions.head(5)
Similarly, you could use other supervised algorithms like Distributed Random Forest, Gradient Boosting Machines, and even Deep Learning.you could also tune in the hyperparameters.
H2OAutoML: Automatic Machine Learning
Automated machine learning (AutoML) is the process of automating the end-to-end process of applying machine learning to real-world problems. AutoML makes machine learning available in a true sense, even to people with no major expertise in this field. H2O’s AutoML tends to automate the training and the tuning part of the models.
In this section, we shall be using the AutoML capabilities of H2O to work on the same regression problem of predicting wine quality.
Importing the AutoML Module
from h2o.automl import H2OAutoML
aml = H2OAutoML(max_models = 20, max_runtime_secs=100, seed = 1)
Here AutoML will run for 20 base models for 100 seconds. The default runtime is 1 Hour.
Training
aml.train(x=predictors, y='quality', training_frame=wine_train, validation_frame=wine_test)
Leaderboard
Now let us look at the automl leaderboard.
print(aml.leaderboard)
The leaderboard displays the top 10 models built by AutoML with their parameters. The best model is placed on the top is a Stacked Ensemble.
The leader model is stored as aml.leader
Contribution of Individual Models
Let us look at the contribution of the individual models for this meta-learner.
metalearner = h2o.get_model(aml.leader.metalearner()['name'])
metalearner.std_coef_plot()
XRT( Extremely Randomized Trees) has the maximum contribution followed by Distributed Random Forests.
Predictions
preds = aml.leader.predict(wine_test)
The code above is the quickest way to get started, however, to learn more about H2O AutoML it is worth taking a look at the in-depth AutoML tutorial (available in R and Python).
Shutting Down
h2o.shutdown()
Using Flow—H2O’s Web UI
In the final leg of this article, let us have a quick overview of H2O’s open source Web UI called Flow. FLow is a web-based interactive computational environment where you can combine code execution, text, mathematics, plots and rich media into a single document, much like Jupyter Notebooks.
Launching FLow
Once H2O is up and running all you need to do is point your browser to http://localhost:54321 and you’ll see our very nice user interface called Flow.
Flow Interface
Here is a quick glance over the flow interface. You can read more about using and working with it here.
Flow is designed to help data scientists rapidly and easily create models, import files, split data frames and do all the things that would normally require quite a bit of typing in other environments.
Working
Let’s work through our same wine example but this time with Flow. The following video explains the model building and prediction using flow and it is kind of self-explanatory.
[Related Article: Classifying Rare Events Using Five Machine Learning Techniques]
Conclusion
H2O is a powerful tool and given its capabilities, it can really transform the Data Science process for good. The capabilities and advantages of AI should be made available to everybody and not a select few. This is the real essence of Democratisation and Democratising Data Science should is essential for resolving Real problems threatening our planet.
Originally Posted Here