Time series data are generally characterized by their temporal nature. This temporal nature adds a trend or seasonality to the data that makes it compatible for time series analysis and forecasting. Time-series data is said to be stationary if it doesn’t change with time or if they don’t have a temporal structure. So, it is highly necessary to check if the data is stationary. In time series forecasting, we cannot derive valuable insights from data if it is stationary.
Example plot of stationary data:
Types of stationarity:
When it comes to identifying if the data is stationary, it means identifying the fine-grained notions of stationarity in the data. The types of stationarity observed in time series data include
- Trend Stationary – A time series that does not show a trend.
- Seasonal Stationary – A time series that does not show seasonal changes.
- Strictly Stationary – The joint distribution of observations is invariant to time shift.
Stepwise Implementation
The following steps will let the user easily understand the method to check the given time series data is stationary.
Step 1: Plotting the time series data
Click here to download the practice dataset daily-female-births-IN.csv.
Python3
# import python pandas library import pandas as pd # import python matplotlib library for plotting import matplotlib.pyplot as plt # read the dataset using pandas read_csv() # function data = pd.read_csv( "daily-total-female-births-IN.csv" , header = 0 , index_col = 0 ) # use simple line plot to see the distribution # of the data plt.plot(data) |
Output:
Step 2: Evaluating the descriptive statistics
This is usually done by splitting the data into two or more partitions and calculating the mean and variance for each group. If these first-order moments are consistent among these partitions, then we can assume that the data is stationary. Let’s use airlines passenger count data set between 1949 – 1960.
Click here to download the practice dataset AirPassengers.csv.
Python3
# import python pandas library import pandas as pd # import python matplotlib library for # plotting import matplotlib.pyplot as plt # read the dataset using pandas read_csv() # function data = pd.read_csv( "AirPassengers.csv" , header = 0 , index_col = 0 ) # print the first 6 rows of data print (data.head( 10 )) # use simple line plot to understand the # data distribution plt.plot(data) |
Output:
Now, let’s partition this data into different groups and calculate the mean and variance of different groups and check for consistency.
Python3
# import the python pandas library import pandas as pd # use pandas read_csv() function to read the dataset. data = pd.read_csv( "AirPassengers.csv" , header = 0 , index_col = 0 ) # extracting only the air passengers count from # the dataset using values function values = data.values # getting the count to split the dataset into 3 parts = int ( len (values) / 3 ) # splitting the data into three parts part_1, part_2, part_3 = values[ 0 :parts], values[parts:( parts * 2 )], values[(parts * 2 ):(parts * 3 )] # calculating the mean of the separated three # parts of data individually. mean_1, mean_2, mean_3 = part_1.mean(), part_2.mean(), part_3.mean() # calculating the variance of the separated # three parts of data individually. var_1, var_2, var_3 = part_1.var(), part_2.var(), part_3.var() # printing the mean of three groups print ( 'mean1=%f, mean2=%f, mean2=%f' % (mean_1, mean_2, mean_3)) # printing the variance of three groups print ( 'variance1=%f, variance2=%f, variance2=%f' % (var_1, var_2, var_3)) |
Output:
The output clearly implies that the mean and variance of the three groups are considerably different from each other describing the data is non-stationary. Say for example if the means where mean_1 = 150, mean_2 = 160, mean_3 = 155 and variance_1 = 33, variance_2 = 35, variance_3 = 37, then we can conclude that the data is stationary. Sometimes this method can fail for some distributions, like log-norm distributions.
Let’s try the same example as above but take the log of the passengers’ count using NumPy’s log() function and check the results.
Python3
# import python pandas library import pandas as pd # import python matplotlib library for plotting import matplotlib.pyplot as plt # import python numpy library import numpy as np # read the dataset using pandas read_csv() # function data = pd.read_csv( "AirPassengers.csv" , header = 0 , index_col = 0 ) # extracting only the air passengers count # from the dataset using values function values = log(data.values) # printing the first 15 passenger count values print (values[ 0 : 15 ]) # using simple line plot to understand the # data distribution plt.plot(values) |
Output:
The output signifies there is some trend but not very steep as the previous case, now let’s compute the partition mean and variance.
Python3
# getting the count to split the dataset # into 3 parts parts = int ( len (values) / 3 ) # splitting the data into three parts. part_1, part_2, part_3 = values[ 0 :parts], values[parts:(parts * 2 )], values[(parts * 2 ):(parts * 3 )] # calculating the mean of the separated three # parts of data individually. mean_1, mean_2, mean_3 = part_1.mean(), part_2.mean(), part_3.mean() # calculating the variance of the separated three # parts of data individually. var_1, var_2, var_3 = part_1.var(), part_2.var(), part_3.var() # printing the mean of three groups print ( 'mean1=%f, mean2=%f, mean2=%f' % (mean_1, mean_2, mean_3)) # printing the variance of three groups print ( 'variance1=%f, variance2=%f, variance2=%f' % (var_1, var_2, var_3)) |
Output:
Ideally, we would have expected the mean and variance to be very different but they are the same, in such cases, this method can terribly fail. In order to avoid this, we have another statistical test which is discussed below.
Step 3: Augmented Dickey-Fuller test
This is a statistical test that is dedicatedly built to test whether univariate time series data is stationary or not. This test is based on a hypothesis and can tell us the degree of probability to which it can be accepted. It is often classified under one of the unit root tests, It determines how strongly, a univariate time series data follows a trend. Let’s define the null and alternate hypotheses,
- Ho (Null Hypothesis): The time series data is non-stationary
- H1 (alternate Hypothesis): The time series data is stationary
Assume alpha = 0.05, meaning (95% confidence). The test results are interpreted with a p-value if p > 0.05 fails to reject the null hypothesis, else if p <= 0.05 reject the null hypothesis. Now, let’s use the same air passengers dataset and test it using adfuller() statistical function provided by the stats model package, to check whether the data is stationary or not.
Python3
# import python pandas package import pandas as pd # import the adfuller function from statsmodel # package to perform ADF test from statsmodels.tsa.stattools import adfuller # read the dataset using pandas read_csv() function data = pd.read_csv( "AirPassengers.csv" , header = 0 , index_col = 0 ) # extracting only the passengers count using values function values = data.values # passing the extracted passengers count to adfuller function. # result of adfuller function is stored in a res variable res = adfuller(values) # Printing the statistical result of the adfuller test print ( 'Augmneted Dickey_fuller Statistic: %f' % res[ 0 ]) print ( 'p-value: %f' % res[ 1 ]) # printing the critical values at different alpha levels. print ( 'critical values at different levels:' ) for k, v in res[ 4 ].items(): print ( '\t%s: %.3f' % (k, v)) |
Output:
As per our hypothesis, the ADF statistic is much greater than the critical values at different levels, and also the p-value is also greater than 0.05 which signifies, we can fail to reject the null hypothesis at 90%, 95%, and 99% confidence, meaning the time series data is strongly non-stationary.
Now, let’s try running the ADF test to the log normed values and cross-check our results.
Python3
# import python pandas package import pandas as pd # import the adfuller function from statsmodel # package to perform ADF test from statsmodels.tsa.stattools import adfuller # import python numpy package import numpy as np # read the dataset using pandas read_csv() function data = pd.read_csv( "AirPassengers.csv" , header = 0 , index_col = 0 ) # extracting only the passengers count using # values function and applying log transform on it. values = log(data.values) # passing the extracted passengers count to adfuller function. # result of adfuller function is stored in a res variable res = adfuller(values) # Printing the statistical result of the adfuller test print ( 'Augmneted Dickey_fuller Statistic: %f' % res[ 0 ]) print ( 'p-value: %f' % res[ 1 ]) # printing the critical values at different alpha levels. print ( 'critical values at different levels:' ) for k, v in res[ 4 ].items(): print ( '\t%s: %.3f' % (k, v)) |
Output:
As you can see, the ADF test one more times shows that the ADF statistic is much greater than the critical values at different levels, and also the p-value is much greater than 0.05 which signifies, we can fail to reject the null hypothesis at 90%, 95%, and 99% confidence, meaning the time series data is strongly non-stationary.
Hence, the ADF unit root test stands out to be a robust test to check whether a time series data is stationary or not.