Gold Price Prediction using Machine Learning

In This article, We will be making a project from scratch about Gold price prediction. To build any data science project We have to follow certain steps that need not be in the same order. In our project, We will go through these steps sequentially.

Problem Formulation
Data preprocessing
Data wrangling
Model Development
Model Explainability
Model Deployment

Problem Formulation

Problem Formulation is one of the most important steps We do before starting any project. there has to be a clear idea about the goal of our data science project. In our case, the goal of this project is to analyze the price of gold. The price of gold is volatile, they change rapidly with time. Our main Aim of this project will be to predict the price of gold per unit.

Importing Libraries

We will import all the libraries that we will be using throughout this article in one place so that do not have to import every time we use it this will save both our time and effort.

Pandas – A Python library built on top of NumPy for effective matrix multiplication and dataframe manipulation, it is also used for data cleaning, data merging, data reshaping, and data aggregation
Numpy – A Python library that is used for numerical mathematical computation and handling multidimensional ndarray, it also has a very large collection of mathematical functions to operate on this array
Matplotlib – It is used for plotting 2D and 3D visualization plots, it also supports a variety of output formats including graphs
Seaborn – seaborn library is made on top of Matplotlib it is used for plotting beautiful plots.

Python3

import numpy as np 
import pandas as pd  
import seaborn as sns  
import matplotlib.pyplot as plt
 
import warnings
warnings.filterwarnings("ignore") 
sns.set_style("darkgrid", {"grid.color": ".6", 
                           "grid.linestyle": ":"})
 
from sklearn.preprocessing import StandardScaler 
from sklearn.model_selection import train_test_split 
from sklearn.preprocessing import PolynomialFeatures
from sklearn.pipeline import make_pipeline 
from sklearn.linear_model import Lasso
 
from sklearn.ensemble import RandomForestRegressor 
from xgboost import XGBRegressor
from sklearn.metrics import r2_score
from sklearn.metrics import mean_squared_error 
from sklearn.model_selection import GridSearchCV

Loading the Dataset

We will read the dataset using the pandas read_csv() function, we can also specify the parse_dates argument which will convert the data type of the Dates column in datetime dtype. One Thing to notice initially the Dates dtype is an object. But when We change it datetime dtype it can be useful in many plotting and other computation. You can download the dataset from here which has been used in this article for the model development.

Python3

# read dataset using pndas function
# use parse_dates argument to change datetime dtype
dataset = pd.read_csv("gold_price_data.csv",
                      parse_dates=["Date"])

We will use pandas’ inbuilt function to see the data type of columns and also see if the columns have null values or not. It is a Pandas inbuilt function that we use to see information about the columns of the dataset.

Python3

# information about the dataset
dataset.info()

Output:

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2290 entries, 0 to 2289
Data columns (total 6 columns):
 #   Column   Non-Null Count  Dtype         
---  ------   --------------  -----         
 0   Date     2290 non-null   datetime64[ns]
 1   SPX      2290 non-null   float64       
 2   GLD      2290 non-null   float64       
 3   USO      2290 non-null   float64       
 4   SLV      2290 non-null   float64       
 5   EUR/USD  2290 non-null   float64       
dtypes: datetime64[ns](1), float64(5)
memory usage: 107.5 KB

Data preprocessing – Missing Values/Null Values

Missing values have a very drastic effect on our model training. some of the models like LinearRegression do not fit the dataset which has missing values in it. However, there are some models which work well even with a missing dataset like RandomForest. But it is always a good practice to handle missing values first when working with the dataset. Also, one thing to note is that when we load the data using pandas it automatically detects null values and replaces them with NAN.

Python3

# Missing Values/Null Values Count
dataset.isna().sum().sort_values(ascending=False)

Output:

Date       0
SPX        0
GLD        0
USO        0
SLV        0
EUR/USD    0
dtype: int64

It will count the number of null values in each column of the dataset and display it in the notebook.

Correlation Between Columns

We should always check if there is any correlation between the two columns of our dataset. If two or more columns are correlated with each other and none of them is a target variable then we must use a method to remove this correlation. Some of the popular methods are PCA(principal component Analysis). We can also remove one of two columns or make a new one using these two.

Python3

# Calculate correlation matrix
correlation  = dataset.corr()
 
# Create heatmap
sns.heatmap(correlation, cmap='coolwarm',
            center=0, annot=True)
 
# Set title and axis labels
plt.title('Correlation Matrix Heatmap')
plt.xlabel('Features')
plt.ylabel('Features')
 
# Show plot
plt.show()

Output:

Identifying highly correlated features using the heatmap

Here the two columns SLV and GLD are strongly correlated with each other compared to others, here we will drop SLV since GLD column also has a large correlation with our target column. Here We have used the pandas Drop function to drop the column along axis=1.

Python3

# drop SlV column
dataset.drop("SLV", axis=1,
             inplace=True)

Data Wrangling

Data wrangling is one of the main steps We use in a data science project to gain insight and knowledge from the data. We see data through every aspect and try to fetch most of the information from the dataframe.

We will first set the Date column as the index of the dataframe using the date as an index will add an advantage in plotting the data

Python3

# reset the index to date column
dataset.set_index("Date", inplace=True)

We will first observe the change in Gold price with each consecutive day throughout the year.

Python3

# plot price of gold for each increasing day
dataset["EUR/USD"].plot()
plt.title("Change in price of gold through date")
plt.xlabel("date")
plt.ylabel("price")
plt.show()

Output:

Trend in gold price using a line chart

Through this graph, we are unable to find any good insight into the change in the price of gold. It looks very noisy, to see the trend in the data we have to make the graph smooth

Trend in Gold Prices Using Moving Averages

To visualize the trend in the data we have to apply a smoothing process on this line which looks very noisy. There are several ways to apply to smooth. In our project, we will take an average of 20 previous data points using the pandas rolling function. This is also known as the Moving Average.

Python3

# apply rolling mean with window size of 3
dataset["price_trend"] = dataset["EUR/USD"]\
    .rolling(window=20).mean()
 
# reset the index to date column
dataset.reset_index("Date", inplace=True)
 
# since we have used rolling method
# for 20 rows first 2 rows will be NAN
dataset["price_trend"].loc[20:].plot()
 
# set title of the chart
plt.title("Trend in price of gold through date")
 
# set x_label of the plot
plt.xlabel("date")
plt.ylabel("price")
plt.show()

Output:

smooth trend in the gold price using line chart

the smooth trend in the gold price using a line chart

Now the graph looks less noisy and here we can analyze the trend in change in the gold price.

Distribution of Columns

To see the distribution of numerical columns we will plot the histogram of each column in one figure to do this we will use the Matplotlib subplot function.

Python3

fig = plt.figure(figsize=(8, 8))
 
# suptitle of the graph
fig.suptitle('Distribution of data across column')
temp = dataset.drop("Date", axis=1).columns.tolist()
for i, item in enumerate(temp):
    plt.subplot(2, 3, i+1)
    sns.histplot(data=dataset, x=item, kde=True)
plt.tight_layout(pad=0.4, w_pad=0.5, h_pad=2.0)
plt.show()

Output:

Distribution of columns using histplot

Here we have used plt.figure to initialize a figure and set its size. One thing to note is that whenever we create a graph using matplotlib plot it automatically calls this function with the default figsize here We used sns.histplot function to create the histogram plot with kde equal True. The data distribution looks good However, we will check the skewness of each column using the pandas function.

Python3

# skewness along the index axis
print(dataset.skew(axis=0, skipna=True))

Output:

SPX        0.263937
GLD        0.375042
USO        0.981611
EUR/USD   -0.170009
dtype: float64

Column USO has the highest skewness of 0.98, so here we will apply square root transformation on this column to reduce its skewness to 0. We can use different transformation functions to lower the skewness some are logarithmic transformation, inverse transformation, etc.

Python3

# apply saquare root transformation
# on the skewed dataset
dataset["USO"] = dataset["USO"]\
    .apply(lambda x: np.sqrt(x))

Handling Outliers

Outliers can have a very bad effect on our model like in linear regression if a data point is an outlier then it can add a very large mean square error. Removing outliers is a good process in EDA. Some models like Decisiontree and ensemble methods like RandomForests are not that much by outliers. However, it is always a good practice to handle the outlier.

Plotting Boxplot to Visualize the Outliers

Boxplots are very useful in plotting the spread and skewness of the data, it is also useful in plotting the individual’s outlier data points, they consist of the box which represents points in the range of 25% to 75% quantiles. While the line in the middle of the box represents the median and the whisker at the end of the box shows the range of below 25 % and 75% excluding outliers.

Python3

fig = plt.figure(figsize=(8, 8))
temp = dataset.drop("Date", axis=1).columns.tolist()
for i, item in enumerate(temp):
    plt.subplot(2, 3, i+1)
    sns.boxplot(data=dataset, x=item, color='violet')
plt.tight_layout(pad=0.4, w_pad=0.5, h_pad=2.0)
plt.show()

Output:

Outliers in columns using Boxplot

It can be seen clearly that the column ‘USO’ has outliers present in the column, so we create a function to normalize the outlier present in the column.

Python3

def outlier_removal(column):
    # Capping the outlier rows with Percentiles
    upper_limit = column.quantile(.95)
    # set upper limit to 95percentile
    lower_limit = column.quantile(.05)
    # set lower limit to 5 percentile
    column.loc[(column > upper_limit)] = upper_limit
    column.loc[(column < lower_limit)] = lower_limit
    return column

Here We have set the upper limit of the column to 95 %of the data and the lower limit to the 5 %. that means that which are greater than 95% percentile of the data are normalized to the data 95% value same for the data points which are lower than 5% of the data.

Python3

# Normalize outliers in columns except Date
 
dataset[['SPX', 'GLD', 'USO', 'EUR/USD']] = \
    dataset[['SPX', 'GLD', 'USO', 'EUR/USD']].apply(outlier_removal)

Here using the pandas apply function We have applied the outlier_removal function to each of the rows of the columns

Modeling the Data

Before We start modeling the data must divide the data into train and test, so that after training the data We can see how much our data is learning the pattern and generalizing on new data points. it is also a way to see that our model is not learning the noise in the data or say it is not overfitting the dataset

Python3

# select the features and target variable
X = dataset.drop(['Date', 'EUR/USD'], axis=1)
 
y = dataset['EUR/USD']
# dividing dataset in to train test
x_train, x_test,\
    y_train, y_test = train_test_split(X, y,
                                       test_size=0.2)

Here we are first dropping the Date and our target variable and storing it in variable X which will be our independent variable also we are storing our target variable in the Y variable. Here we are dividing the data in the ratio of 80:20. However, we can change it according to our needs.

Scaling the Data

Before we train the model on our data we should perform scaling on our data to normalize. After scaling the data our mean of each column becomes zero and their standard deviation becomes 1. It is also called z-score normalization since we subtract the mean of the column from each element and divide it by the standard deviation of the column. It brings all the columns to the same scale and directly comparable with one another.

Python3

# Create an instance of the StandardScaler
scaler = StandardScaler()
 
# Fit the StandardScaler on the training dataset
scaler.fit(x_train)
 
# Transform the training dataset
# using the StandardScaler
x_train_scaled = scaler.transform(x_train)
x_test_scaled = scaler.transform(x_test)

It is always advisable to start fitting the data from a simple model and then move it to a complex one. One of the reasons for doing this is simple model takes less time and storage to train on the data. Also, many simple models work far better than complex ones and these models are also more interpretable than complex models.

Lasso Regression

In this model, we have used linear regression with L1 Regularization, also with help of the make_pipeline object, we will use lasso regression with 2 degrees. We will also use the GridSearch object in every model to get the best-performing hyperparameter and lower the variance.

Python3

# Create a PolynomialFeatures object of degree 2
poly = PolynomialFeatures(degree=2)
 
# Create a Lasso object
lasso = Lasso()
 
# Define a dictionary of parameter
#values to search over
param_grid = {'lasso__alpha': [1e-4, 1e-3, 1e-2,
                               1e-1, 1, 5, 10, 
                               20, 30, 40]}
 
# Create a pipeline that first applies 
# polynomial features and then applies Lasso regression
pipeline = make_pipeline(poly, lasso)
 
# Create a GridSearchCV object with 
#the pipeline and parameter grid
lasso_grid_search = GridSearchCV(pipeline,
                                 param_grid, 
                                 scoring='r2', cv=3)
 
# Fit the GridSearchCV object to the training data
lasso_grid_search.fit(x_train_scaled, y_train)
 
# Predict the target variable using
# the fitted model and the test data
y_pred = lasso_grid_search.predict(x_train_scaled)
 
# Compute the R-squared of the fitted model on the train  data
r2 = r2_score(y_train, y_pred)
 
# Print the R-squared
print("R-squared: ", r2)
 
# Print the best parameter values and score
print('Best parameter values: ',
      lasso_grid_search.best_params_)
print('Best score: ',
      lasso_grid_search.best_score_)

Output :

R-squared:  0.8387982103677535
Best parameter values:  {'lasso__alpha': 0.0001}
Best score:  0.8355941438243204

Here we are fitting our multiple regression of degree, however, to use lasso regression with multiple regression we must use the pipeline method from sklearn. We will also use the grid search method for cross-validation and selecting the best-performing hyperparameter for the training data. Grid search is one of the best ways to find a model which does not overfit the training data.

We have used the R-squared evaluation matrix throughout our model. We have used this matrix since We want to compare our model and choose which is best performing.

RandomForestRegressor for Regression

In the second model, we will use the ensemble method to fit our training data. like in Random Forest it uses several decision trees to fit on the data, one thing to note is that in random forest regressor m number of rows are used for training which is always less than n (m<n). where n is the total number of original columns present in the training dataset, also for row points random forest select these row’s element.

Python3

# Insiate param grid for which to search
param_grid = {'n_estimators': [50, 80, 100],
              'max_depth': [3, 5, 7]}
 
# create instance of the Randomforest regressor
rf = RandomForestRegressor()
 
# Define Girdsearch with random forest
# object parameter grid scoring and cv
rf_grid_search = GridSearchCV(rf, param_grid, scoring='r2', cv=2)
 
# Fit the GridSearchCV object to the training data
 
rf_grid_search.fit(x_train_scaled, y_train)
 
# Print the best parameter values and score
print('Best parameter values: ', rf_grid_search.best_params_)
print('Best score: ', rf_grid_search.best_score_)

Output:

Best parameter values:  {'max_depth': 7, 'n_estimators': 50}
Best score:  0.9463515663955014

Here We have used both RandomForest regressor and Gridsearchcv, The Gridsearch will help in selecting the best number of decision trees from 50,80,100. We have also specified the maximum depth of the tree as a parameter which can be 3,5 or 7.

The best parameter value shows that the model gives the best result when it takes the average result of one hundred Decision trees having a maximum depth of 7.

Python3

# Compute the R-squared of the
# fitted model on the test data
r2 = r2_score(y_test,
              rf_grid_search.predict(x_test_scaled))
 
# Print the R-squared
print("R-squared:", r2)

Output:

R-squared: 0.9519989288862035

These models are called Black box models since We won’t be able to visualize what is happening under the hood of the model however, we will plot the bar chart of the feature importance from the dataset.

Python3

features = dataset.drop("Date", axis=1).columns
 
# store the importance of the feature
importances = rf_grid_search.best_estimator_.\
    feature_importances_
 
 
indices = np.argsort(importances)
 
# title of the graph
plt.title('Feature Importance')
 
plt.barh(range(len(indices)),
         importances[indices],
         color='red',
         align='center')
 
# plot bar chart
plt.yticks(range(len(indices)),
           [features[i] for i in indices])
plt.xlabel('Relative Importance')
plt.show()

Output:

Feature importance graph

This feature importance graph shows that USO column plays a major effect (more than 2x) in deciding the gold price in USD.

XGBoost Model for Regression

In Boosting Technique the data is fitted in several sequential Weak learning algorithm models which are only slightly better than random guessing. In each next sequential model more Weights are given to the points are which are misclassified/regressed by previous models

In our models, we will use the XGBOOST model for fitting our training dataset.

Python3

# Create an instance of the XGBRegressor model
model_xgb = XGBRegressor()
 
# Fit the model to the training data
model_xgb.fit(x_train_scaled, y_train)
 
# Print the R-squared score on the training data
print("Xgboost Accuracy =", r2_score(
    y_train, model_xgb.predict(x_train_scaled)))

Output:

Xgboost Accuracy = 0.99813247897056

Now let’s evaluate this model as well using the testing data.

Python3

# Print the R-squared score on the test data
print("Xgboost Accuracy on test data =",
      r2_score(y_test,
               model_xgb.predict(x_test_scaled)))

Output:

Xgboost Accuracy on test data = 0.9728930078121006

Through the graph, we can see that USO column plays a major role in deciding the prediction value.

Model Explainability

In the black box model Boosting and Bagging, we will not be able to see the actual weights given to these columns hoWever there are some libraries that we can use to the fraction of Weight out of 1 given to a particular column when we predict on a single vector. We will be using eli5 package to demonstrate the model explainability. You can install this package by running the following command in the terminal.

pip install eli5

The name eli5 stands for “Explain like I’m 5” it’s a popular Python library that is used for debugging and explaining machine learning models. We will use eli5 to see the Weights of our best-performing model which is XGBOOST best on its train and test accuracy.

Python3

import eli5 as eli
# weight of variables in xgboost model
# Get the names of the features
feature_names = x_train.columns.tolist()
 
# Explain the weights of the features using ELI5
eli.explain_weights(model_xgb,
                    feature_names=feature_names)

Output:

Model Deployment using Pickle

To deploy the model We will use the pickle library from the Python language. We will deploy our best-performing model which is XGBoost. Pickle is a Python module that is used for serializing and deserializing the model i.e saving and loading the model. It stores Python objects which can be moved to a disk(serializing) and then again from disk to memory(deserialize).

Python3

# dump model using pickle library
import pickle
 
# dump model in file model.pkl
pickle.dump(model_xgb, open('model.pkl', 'wb'))

Now we can use the load function from pickle to load the model_xgb into memory to predict a new vector dataset.

Conclusions

We have made our data science project from scratch to deployment after saving the file into a disk using Pickle. dump which stores it into byte data files. We can again load these model.pkl files using Pickle. Load and after loading the file. We can take the help of Flask or another framework to run and use our model to predict live coming datasets.

Gold Price Prediction using Machine Learning

Problem Formulation

Importing Libraries

Python3

Loading the Dataset

Python3

Python3

Data preprocessing – Missing Values/Null Values

Python3

Correlation Between Columns

Python3

Python3

Data Wrangling

Python3

Python3

Trend in Gold Prices Using Moving Averages

Python3

Distribution of Columns

Python3

Python3

Python3

Handling Outliers

Plotting Boxplot to Visualize the Outliers

Python3

Python3

Python3

Modeling the Data

Python3

Scaling the Data

Python3

Lasso Regression

Python3

RandomForestRegressor for Regression

Python3

Python3

Python3

XGBoost Model for Regression

Python3

Python3

Model Explainability

Python3

Model Deployment using Pickle

Python3

Conclusions

LEAVE A REPLY Cancel reply

Most Popular

Recent Comments

EDITOR PICKS

POPULAR POSTS

POPULAR CATEGORY

ABOUT US

FOLLOW US