Introduction
Google’s BigQuery is a powerful cloud-based data warehouse that provides fast, flexible, and cost-effective data storage and analysis capabilities. One of its unique features is the ability to build and run machine learning models directly inside the database without extracting the data and moving it to another platform.
BigQuery was created to analyse data with billions of rows using SQL-like syntax. It is hosted on the Google Cloud Storage infrastructure and is accessible via a REST-oriented application programming interface (API).
Learning Objectives
In this article, we will:
- Understand what BigQuery is and what advantages it holds.
- The process of building a machine learning model in BigQuery.
- Learn the key steps of ETL, feature selection and preprocessing, model creation, performance evaluation, and prediction.
This article was published as a part of the Data Science Blogathon.
Table of Contents
Advantages of BigQuery
- Scalability: BigQuery is a fully managed, cloud-native data warehouse that can easily handle petabyte-scale datasets. This makes it an ideal platform for machine learning, as it can handle large amounts of data and provide fast, interactive results.
- Cost-effectiveness: BigQuery is designed to be cost-effective, with a flexible pricing model that allows you to only pay for what you use. This makes it an affordable option for machine learning, even for large and complex projects.
- Integration with other Google Cloud services: BigQuery integrates seamlessly with other Google Cloud services, such as Google Cloud Storage, Google Cloud AI Platform, and Google Cloud Data Studio, providing a complete machine learning solution that is easy to use and scalable.
- SQL Support: BigQuery supports standard SQL, which makes it easy for data analysts and developers to work with data and build machine learning models, even if they do not have a background in machine learning.
- Security and Privacy: BigQuery implements the highest levels of security and privacy, with support for encryption at rest and in transit and strict access controls to ensure your data is secure and protected.
- Real-time Analytics: BigQuery provides real-time analytics capabilities, allowing you to run interactive queries on large datasets and get results in seconds. This makes it an ideal platform for machine learning, enabling you to test and iterate your models quickly and easily.
- Open-source IGntegrations: BigQuery supports several open-source integrations, such as TensorFlow, Pandas, and scikit-learn, which makes it easy to use existing machine learning libraries and models with BigQuery.
Step-by-step Tutorial to Build a Machine Learning Model in BigQuery
This guide will provide a step-by-step tutorial on how to build a machine-learning model in BigQuery, covering the following five main stages:
- Extract, Transform, and Load (ETL) Data into BigQuery
- Select and Preprocess Features
- Create the Model Inside BigQuery
- Evaluate the Performance of the Trained Model
- Use the Model to Make Prediction
Step 1. Extract, Transform, and Load (ETL) Data into BigQuery
The first step in building a machine learning model in BigQuery is to get the data into the database. BigQuery supports loading data from various sources, including cloud storage (such as Google Cloud Storage or AWS S3), local files, or even other databases (such as Google Cloud SQL).
For the purposes of this tutorial, we will assume that we have a dataset in Google Cloud Storage that we want to load into BigQuery. The data in this example will be a simple CSV file with two columns: ‘age’ and ‘income’. Our goal is to build a model that predicts income based on age.
To load the data into BigQuery, we first need to create a new table in the database. This can be done using the BigQuery web interface or the command line tool.
Once the table is created, we can use the `bq` command line tool to load the data from our Cloud Storage bucket:
bq load --source_format=CSV mydataset.mytable gs://analyticsvidya/myfile.csv age:INTEGER,income:FLOAT
This command specifies that the data is in CSV format and that the columns ‘age’ and ‘income’ should be treated as an integer and float values, respectively.
Step 2. Select and Preprocess Features
The next step is to select and preprocess the features we want to use in our model. In this case, we only have two columns in our dataset, ‘age’ and ‘income’, so there’s not much to do here. However, in real-world datasets, there may be many columns with various data types, missing values, or other issues that need to be addressed.
One common preprocessing step is to normalize the data. Normalization is the process of scaling the values of a column to a specific range, such as [0,1]. This is useful for avoiding biases in the model due to differences in the scales of different columns.
In BigQuery, we can normalize the data using the `NORMALIZE` function:
WITH data AS (
SELECT age, income
FROM mydataset.mytable
)
SELECT
age,
NORMALIZE(CAST(income AS STRING)) as income_norm
FROM data
This query creates a new table with the same data as the original table but with normalized income values.
Step 3. Create the Model Inside BigQuery
Once the data is preprocessed, we can create the machine learning model. BigQuery supports a variety of machine learning models, including linear regression, logistic regression, decision trees, and more.
For this example, we will use a simple linear regression model, which predicts a continuous target variable based on one or more independent variables. In our case, the target variable is income, and the independent variable is age.
To create the linear regression model in BigQuery, we use the `CREATE MODEL` statement:
CREATE MODEL mydataset.mymodel
OPTIONS
(model_type='linear_reg',
input_label_cols=['income']) AS
SELECT
age,
income
FROM `mydataset.mytable`
This statement creates a new model called `mymodel` in the `mydataset` dataset. The `OPTIONS` clause specifies that the model type is a linear regression model, and the input label column is `income_norm.` The `AS` clause specifies the query that returns the data that will be used to train the model.
Following are the different models supported by BigQuery
- Linear Regression: This is a simple and widely used model for predicting a continuous target variable based on one or more independent variables.
- Logistic Regression: This type of regression model predicts a binary target variable based on one or more independent variables, such as yes/no or 0/1.
- K-Means Clustering: This is an unsupervised learning algorithm that is used to divide a set of data points into K clusters, where each cluster is represented by its centroid.
- Time-series Models: These models are used to forecast future values based on past values of time-series data, such as sales data or stock prices.
- Random Forest: This ensemble learning method combines the predictions of multiple decision trees to create a more accurate and robust prediction.
- Gradient Boosted Trees: This is another ensemble learning method that combines the predictions of multiple decision trees but uses a gradient-based optimization approach to create a more accurate and robust prediction.
- Neural Networks: This is a machine learning model inspired by the structure and function of the human brain. Neural networks are used for various tasks, such as image classification, natural language processing, and speech recognition.
In addition to these models, BigQuery also provides several pre-trained models and APIs that can be used for common machine learning tasks, such as sentiment analysis, entity recognition, and image labeling.
Step 4. Evaluate the Performance of the Trained Model
Once the model is created, we need to evaluate its performance to see how well it can predict the target variable based on the independent variable. This can be done by splitting the data into a training set and a test set and using the training set to train the model and the test set to evaluate its performance.
In BigQuery, we can use the `ML.EVALUATE` function to evaluate the performance of the model:
SELECT
*
FROM
ML.EVALUATE(MODEL mydataset.mymodel,
(
SELECT
age,
income
FROM
`mydataset.mytable`))
This query returns various metrics, including mean squared error, root mean squared error, mean absolute error, and R-squared, all of which measure how well the model can predict the target variable.
Step 5. Use the Model to Make Predictions
Finally, once we have evaluated the model’s performance, we can use it to predict new data. In BigQuery, we can use the `ML.PREDICT` function to make predictions:
SELECT
*
FROM
ML.PREDICT(MODEL `mydataset.mymodel`,
(
SELECT
age,
FROM
`mydataset.mytable`))
This query returns the original data along with the predicted income values.
Conclusion
In this tutorial, we showed how to build a machine learning model in BigQuery, from loading the data to making predictions. We covered the key steps of ETL, feature selection and preprocessing, model creation, performance evaluation, and prediction. By following these steps, you can build and run your own machine learning models directly inside BigQuery, taking advantage of its fast and cost-effective data processing capabilities.
Key Takwaways from this article:
- BigQuery supports several open-source integrations, such as TensorFlow, Pandas, and scikit-learn, which makes it easy to use existing machine learning libraries and models with BigQuery.
- BigQuery also provides several pre-trained models and APIs that can be used for common machine learning tasks, such as sentiment analysis, entity recognition, and image labeling.
- In the dataset we studied, query returns the original data along with the predicted income values.
The media shown in this article is not owned by Analytics Vidhya and is used at the Author’s discretion.