Every dataset has its own characteristics and we use their characteristics as a feature to get insight into the data. In this article, We will discuss an important kind of dataset which is Time series data.
What Is Time Series Data
A time series data is a series of data points listed in consecutive time order or we can say time series data is a sequence of successive equal interval points in time. A time-series analysis consists of methods for analyzing time-series data in order to extract meaningful insights and other valuable characteristics of data.
Time-series data analysis is becoming very important in so many industries like financial industries, pharmaceuticals, social media companies, web service providers, research, and many more. To understand the time-series data, Visualization of the data is essential. In fact, Any type of data analysis is not complete without visualizations. Because one good visualization can provide meaningful and interesting insights into data.
Time Series Data Visualization using Python
We will use Python libraries for visualizing the data. The link for the dataset can be found here. We will perform the visualization step by step as we do in any Time -series data project.
Importing the Libraries
We will import all the libraries that we will be using throughout this article in one place so that do not have to import every time we use it this will save both our time and effort.
- Numpy – A Python library that is used for numerical mathematical computation and handling multidimensional ndarray, it also has a very large collection of mathematical functions to operate on this array.
- Pandas – A Python library built on top of NumPy for effective matrix multiplication and dataframe manipulation, it is also used for data cleaning, data merging, data reshaping, and data aggregation.
- Matplotlib – It is used for plotting 2D and 3D visualization plots, it also supports a variety of output formats including graphs for data.
Python3
import pandas as pd import numpy as np import matplotlib.pyplot as plt |
Loading The Dataset
To load the dataset into a dataframe we will use the pandas read_csv() function. We will use head() function to print the first five rows of the dataset. Here we will use the ‘parse_dates’ parameter in the read_csv function to convert the ‘Date’ column to the DatetimeIndex format. By default, Dates are stored in string format which is not the right format for time series data analysis.
Python3
# reading the dataset using read_csv df = pd.read_csv( "stock_data.csv" , parse_dates = True , index_col = "Date" ) # displaying the first five rows of dataset df.head() |
Output:
Unnamed: 0 Open High Low Close Volume Name
Date
2006-01-03 NaN 39.69 41.22 38.79 40.91 24232729 AABA
2006-01-04 NaN 41.22 41.90 40.77 40.97 20553479 AABA
2006-01-05 NaN 40.93 41.73 40.85 41.53 12829610 AABA
2006-01-06 NaN 42.88 43.57 42.80 43.21 29422828 AABA
2006-01-09 NaN 43.10 43.66 42.82 43.42 16268338 AABA
Dropping Unwanted Columns
We will drop columns from the dataset that are not important for our visualization.
Python3
# deleting column df.drop(columns = 'Unnamed: 0' ) |
Output:
Open High Low Close Volume Name
Date
2006-01-03 39.69 41.22 38.79 40.91 24232729 AABA
2006-01-04 41.22 41.90 40.77 40.97 20553479 AABA
2006-01-05 40.93 41.73 40.85 41.53 12829610 AABA
2006-01-06 42.88 43.57 42.80 43.21 29422828 AABA
2006-01-09 43.10 43.66 42.82 43.42 16268338 AABA
Plotting Line plot for Time Series data.
Python3
df[ 'Volume' ].plot() |
Output:
Here, we have plotted the ‘Volume’ column data.
Now let’s plot all other columns using a subplot.
Python3
df.plot(subplots = True , figsize = ( 4 , 4 )) |
Output:
The line plots used above are good for showing seasonality.
Seasonality: In time-series data, seasonality is the presence of variations that occur at specific regular time intervals less than a year, such as weekly, monthly, or quarterly.
Resampling: Resampling is a methodology of economically using a data sample to improve the accuracy and quantify the uncertainty of a population parameter. Resampling for months or weeks and making bar plots is another very simple and widely used method of finding seasonality. Here we are going to make a bar plot of month data for 2016 and 2017.
Resample and Plot The Data
Python3
# Resampling the time series data based on monthly 'M' frequency df_month = df.resample( "M" ).mean() # using subplot fig, ax = plt.subplots(figsize = ( 6 , 6 )) # plotting bar graph ax.bar(df_month[ '2016' :].index, df_month.loc[ '2016' :, "Volume" ], width = 25 , align = 'center' ) |
Output:
There are 24 bars in the graph and each bar represents a month.
Differencing: Differencing is used to make the difference in values of a specified interval. By default, it’s one, we can specify different values for plots. It is the most popular method to remove trends in the data.
Example 4:
Python3
df.Low.diff( 2 ).plot(figsize = ( 6 , 6 )) |
Output:
Python3
df.High.diff( 2 ).plot(figsize = ( 10 , 6 )) |
Output:
Trend In The Dataset
We can see the change in trend in our dataset, Trend helps us see where the value of data that we are considering is going upward or downward in the long run.
Python code for Trend
Python3
# Finding the trend in the "Open" # column using moving average method window_size = 50 rolling_mean = df[ 'Open' ].rolling\ (window_size).mean() rolling_mean.plot() |
Output:
Plotting the Changes in Data
We can also plot the changes that occurred in data over time. There are a few ways to plot changes in data.
Shift: The shift function can be used to shift the data before or after the specified time interval. We can specify the time, which will shift the data by one day by default. That means we will get the previous day’s data. It is helpful to see the previous day’s data and today’s data simultaneously side by side.
In this code, .div() function helps to fill up the missing data values. Actually, div() means division. If we take df. div(6) it will divide each element in df by 6. We do this to avoid the null or missing values that are created by the ‘shift()’ operation.
Here, we have taken .div(df.Close.shift()), it will divide each value of df to df.Close.shift() to remove null values.
Python3
df[ 'Change' ] = df.Close.div(df.Close.shift()) df[ 'Change' ].plot(figsize = ( 10 , 8 ), fontsize = 16 ) |
Output:
We can also take a specific interval of time and plot to have a clearer look. Here we are plotting the data of only 2017.
Python3
df[ '2017' ][ 'Change' ].plot(figsize = ( 10 , 6 )) |
Output:
Box Plot in Time Series Dataset
We can also use box plot to see the distribution of values in a specific column. Lets tasks an example. In this we are getting a new column named Year by using datetime. And then we are taking ‘Open’ column on Y-axis.
Python3
import pandas as pd import numpy as np import matplotlib.pyplot as plt import seaborn as sns # reading the dataset using read_csv df = pd.read_csv( "stock_data.csv" , parse_dates = True ) df.drop(columns = 'Unnamed: 0' , inplace = True ) df[ 'Date' ] = pd.to_datetime(df[ 'Date' ]) # extract year from date column df[ "Year" ] = df[ "Date" ].dt.year # box plot grouped by year sns.boxplot(data = df, x = "Year" , y = "Open" ) |
Output: