Sales Forecasting Project

Sales Forecasting for XYZ Cement Company


Introduction

The Project

In an era marked by fast-moving markets, dynamic consumer preferences, and unpredictable economic conditions, the ability to foresee sales and demand patterns is pivotal to a company’s sustainable growth and strategic decision-making. In this project, we embark on a comprehensive journey, drawing upon the potential of data-driven insights. Our focal point is a company from India, XYZ Cement, a cornerstone of the construction and infrastructure sectors, where the stakes of accurate forecasting are exceptionally high.

The domain of cement production and sales, by its very nature, demands astute foresight. With infrastructural projects spanning vast landscapes, real estate dynamics fluctuating, and economic conditions evolving, the need for granular insights is paramount. We will leverage an extensive public dataset acquired from Kaggle, curated to encompass critical operational variables, all observed on a monthly cadence. Our journey will span from January 2010 to November 2022, allowing us to unearth historical trends, uncover underlying patterns, and, most crucially, set the stage for proactive decision-making for the future.

Our foremost mission is to equip the XYZ Cement Company with forecasting capabilities that extend well into the future. Armed with this data-rich arsenal, we set our sights on predicting key variables, such as cement production and sales, for the forthcoming 12 months. From December 2023 to November 2024, our forecasts will serve as the compass guiding operational strategies, shaping inventory management, and orchestrating marketing endeavors.

The Objectives

  1. Develop robust predictive models for forecasting monthly cement production and sales.
  2. Illuminate seasonality and cyclical patterns in cement sales, enabling proactive planning.
  3. Provide actionable insights that empower inventory management, production planning, and marketing strategies, elevating the XYZ Cement Company’s competitive edge.

The journey ahead is one of analysis, experimentation, and innovation. We’ll harness advanced data analysis, statistical modeling, and machine learning techniques to empower the XYZ Cement Company with unparalleled foresight. Our mission: to furnish them with the strategic clarity needed to make informed decisions, optimize resources, and maintain an unwavering competitive edge in an industry marked by dynamism and transformation.

Data Exploration & Pre-Processing

Importing Python Libraries

Before we begin our Sales and Demand Forecasting Project, we want to make sure to have the Python libraries we’ll need to accomplish our goals. To accomplish all of our data loading, data cleaning, analysis, and modeling, we will import the following libraries:

  1. Pandas: Pandas is a fundamental library for data manipulation, allowing me to work with structured data efficiently.
  2. Numpy: NumPy provides support for numerical operations and array manipulation, crucial for data preprocessing and mathematical computations.
  3. seasonal_decompose from statsmodels.tsa.seasonal: This library aids in time series decomposition, helping me to understand underlying patterns in data.
  4. ARIMA model from statsmodels.tsa.arima.model: ARIMA (AutoRegressive Integrated Moving Average) models are essential for time series forecasting and analysis.
  5. SARIMAX from statsmodels.tsa.statespace.sarimax: SARIMAX extends ARIMA by incorporating seasonality, offering more sophisticated time series modeling capabilities.
  6. api from statsmodels.formula: The formula API in Statsmodels enables me to build statistical models using a formula notation.
  7. matplotlib.pyplot as plt: Matplotlib is a versatile plotting library that allows me to create insightful data visualizations and graphs.
    1. We also include %matplotlib inline in order to ensure use of the matplotlib figures in our Google Colab Python environment
  8. auto_arima from pmdarima: AutoARIMA simplifies the process of selecting optimal ARIMA model parameters, enhancing forecasting accuracy.
  9. Seaborn: Seaborn is an aesthetically pleasing data visualization library that complements Matplotlib.
  10. mean_absolute_error, mean_squared_error, mean_absolute_percentage_error from sklearn.metrics: These metrics are essential for evaluating the performance of regression and forecasting models.
  11. Warnings Module: to warn about changes in language or library features for potential backward incompatibilities with Python 3.0
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import statsmodels.formula.api as api
from sklearn.metrics import mean_absolute_percentage_error
import warnings
from statsmodels.tsa.seasonal import seasonal_decompose
from statsmodels.tsa.arima.model import ARIMA
from statsmodels.tsa.statespace.sarimax import SARIMAX
from sklearn.metrics import mean_absolute_error, mean_squared_error, mean_absolute_percentage_error
warnings.filterwarnings("ignore")

%matplotlib inline

Data Import

The dataset for our XYZ Cement Company was provided in a csv file. After importing the file into our Python environment, we need to load it into a dataframe that we’re calling ‘cement.’

cement = pd.read_csv('89cementdata.csv')
cement.head()

Within our dataset, we will utilize the following fields:

  1. Year-Month: The time period of measurement for a given record (i.e. January of 2010)
  2. Production: Cement production, in tons, per month
  3. Sales: Cement sales in tons, not dollars, per month
  4. Demand: Cement demand, in tons, per month
  5. Population: Demographic data, encapsulated in population per month, measured in billions
  6. GDP: Gross Domestic Product (GDP), in billions of US Dollars, per month
  7. Home Loans Given: Information about home loans granted per month
  8. Interest Rates: Interest rates, measured as a percentage, per month

Our eight fields, measured over a time period of 155 months, give us a total of 1,240 data points to start our forecasting analysis.

Data Cleaning & Exploration

To begin, we first need to get to know our data, making sure we have what we need, nothing is missing, and if we need to add to our data. To do this, a few simple lines of code give us a better look at what the XYZ Cement Company’s data looks like.

# Get some base information on our dataset
print ("Rows     : " , cement.shape[0])
print ("Columns  : " , cement.shape[1])
print ("\nFeatures : \n" ,cement.columns.tolist())
print ("\nMissing values :  ", cement.isnull().sum().values.sum())
print ("\nUnique values :  \n", cement.nunique())

There are no missing values, so that puts us at a good starting point. However, there’s a field we need to add at this point that will aid us with both the temporal nature of our study and forecasting, as well as give us extra ammo for our feature engineering for the machine learning elements of our project.

The time-series element of our provided data is in Year-Month, meaning the data is collected for an entire month in a given year. This format of the data cannot be easily converted to datetime, which we could utilize as an index to use in Time-Series modeling. In order to account for this, we’ll add the field ‘t,’ representing a time period (i.e. the year and month for a record), and give us the feature ‘t’ for future engineering.

# Add t for time-series analyses
cement.insert(1, 't', range(1, len(cement) + 1))
cement.head()

Next, we’re gonna take a visual look at each field as it relates to our t – time periods.

# Create a list of columns to plot
columns_to_plot = ['production', 'sales', 'demand', 'population', 'gdp', 'home_loans_given', 'int_rate']

# Loop through each column and create line plots
for column in columns_to_plot:
    plt.figure(figsize=(8,4))
    plt.plot(cement['t'], cement[column], marker='o', linestyle='-', markersize=5)
    plt.xlabel('t')
    plt.ylabel(column)
    plt.title(f'{column} vs. t')
    plt.grid(True)
    plt.show()

These plots give us valuable insights into the dataset. One noticeable pattern is the similarity among the production, demand, and sales charts, which is logical given their interdependence. Furthermore, with the exception of the interest rate chart, all other plots exhibit a consistent positive trend throughout the entire time period.

Given the consistent directional movement in most visuals and the contrasting trend of the interest rate, we’re going to create a correlation plot to quantify the degree of association (positive or negative) between these variables.

# Calculate the correlation matrix
correlation_matrix = cement.corr()

# Create a heatmap
plt.figure(figsize=(10, 8))
sns.heatmap(correlation_matrix, annot=True, cmap='icefire', fmt=".2f", linewidths=0.5)
plt.title('Correlation Heatmap')
plt.show()

The correlation plot reinforces many of our observations from the time-series charts. It quantifies the strong statistical relationships between production, sales, and demand within the cement company’s dataset. Even without examining the numerical correlations, visual distinctions in color highlight the contrast and confirm our initial impressions of the relationship with interest rates.

Furthermore, the plot reveals that the most significant negative correlation exists between interest rates and the variable “home loans given,” which aligns with typical real estate dynamics.

To further visualize this data, we’ll look at a Seaborn pairplot of our cement dataframe.

sns.pairplot(cement)

The plotted data visually reinforces the statistical insights from our heatmap.

Before we proceed to our analyses, there are a few additional data processing steps we should undertake. While our dataset is substantial, it’s not exceptionally large. Consequently, each data point carries more weight in influencing the dataset compared to a larger dataset. With this in mind, outliers need to be carefully considered. We’ll start with some boxplots of our data fields, re-using our columns_to_plot list from our time-series charts.

# ID Outliers

# Define box and whisker properties for coloring IQR
boxprops = dict(linestyle='-', linewidth=2, color='blue')
whiskerprops = dict(linestyle='-', linewidth=2, color='blue')

# Loop through each column and create box plots
for column in columns_to_plot:
    plt.figure(figsize=(8, 4))
    plt.boxplot(cement[column], boxprops=boxprops, whiskerprops=whiskerprops, patch_artist=True, vert=False)
    plt.xlabel(column)
    plt.ylabel('Values')
    plt.title(f'{column} Box Plot')
    plt.grid(True)
    plt.show()

Sales and GDP are the only fields that contain outliers. Given the dataset’s overall size, we will employ the Winsorization technique to address these outliers. Winsorization is a method of transforming extreme data values to mitigate the effect of outliers on data distribution. We will use quartiles and the interquartile range to Winsorize the data in sales and gdp.

# Calculate the first quartile (Q1) for 'sales' and 'gdp'
Q1_sales = cement['sales'].quantile(0.25)
Q1_gdp = cement['gdp'].quantile(0.25)

# Calculate the third quartile (Q3) for 'sales' and 'gdp'
Q3_sales = cement['sales'].quantile(0.75)
Q3_gdp = cement['gdp'].quantile(0.75)

# Calculate the interquartile range (IQR) for 'sales' and 'gdp'
IQR_sales = Q3_sales - Q1_sales
IQR_gdp = Q3_gdp - Q1_gdp

# Define the lower and upper bounds for identifying 'sales' and 'gdp' outliers
lower_bound_sales = Q1_sales - 1.5 * IQR_sales
upper_bound_sales = Q3_sales + 1.5 * IQR_sales

lower_bound_gdp = Q1_gdp - 1.5 * IQR_gdp
upper_bound_gdp = Q3_gdp + 1.5 * IQR_gdp

# Find 'sales' and 'gdp' outliers based on the IQR method
sales_outliers = cement[(cement['sales'] < lower_bound_sales) | (cement['sales'] > upper_bound_sales)]

# Find 'gdp' outliers based on the IQR method
gdp_outliers = cement[(cement['gdp'] < lower_bound_gdp) | (cement['gdp'] > upper_bound_gdp)]

# Print the corresponding 'year_month' and 'sales' values for the 'sales' and 'gdp' outliers
print("Sales Outliers:")
print(sales_outliers[['year_month', 'sales']])

print("\nGDP Outliers:")
print(gdp_outliers[['year_month', 'gdp']])

We have identified just four outliers, which suggests that censoring these records will be a more beneficial approach than trimming them. Now, we’ll winsorize these four data points.

# Winsorize the 'sales' and 'gdp' columns
# List of row indices you want to Winsorize
indices_to_winsorize_s = [142, 143, 154]  # Outlier Indices
indices_to_winsorize_g = [144]

# Apply Winsorization to the specified rows in 'sales' and 'gdp' columns
for idx in indices_to_winsorize_s:
    if lower_bound_sales <= cement.loc[idx, 'sales'] <= upper_bound_sales:
        continue  # Skip if already within bounds
    elif cement.loc[idx, 'sales'] < lower_bound_sales:
        cement.loc[idx, 'sales'] = lower_bound_sales
    elif cement.loc[idx, 'sales'] > upper_bound_sales:
        cement.loc[idx, 'sales'] = upper_bound_sales

for idx1 in indices_to_winsorize_g:
    if lower_bound_gdp <= cement.loc[idx, 'sales'] <= upper_bound_gdp:
        continue  # Skip if already within bounds
    elif cement.loc[idx1, 'gdp'] < lower_bound_gdp:
        cement.loc[idx1, 'gdp'] = lower_bound_gdp
    elif cement.loc[idx1, 'gdp'] > upper_bound_gdp:
        cement.loc[idx1, 'gdp'] = upper_bound_gdp

# Print the corresponding 'year_month' appropriate values for the 'sales' and 'gdp' outliers
print("Sales Outliers (Winsorized):")
print(cement.loc[indices_to_winsorize_s, ['year_month', 'sales']])

print("\nGDP Outliers (Winsorized):")
print(cement.loc[indices_to_winsorize_g, ['year_month', 'gdp']])

Looking at our updated values, it’s evident that our Winsorization process effectively substituted the extreme values with calculated upper or lower bound values, as determined earlier. While the code may appear somewhat intricate, involving two loops, it’s worth noting that this approach remains efficient when dealing with a larger dataset than we have for this project, as it allows us to handle any number of multiple replacements seamlessly.

Model Preparations

Time-Series Assessment

With our data cleaned and processed, it’s time to expand our data to prepare for our modeling and forecasting. Before we select the best model for our forecast, we want to take a closer look at the patterns of our time series. To begin, we’re going to break our time-series into different components. We’ll be looking at Trend, Seasonal, and Residual Components. We will also be plotting our observed data for comparison.

# breaking up a time series into components: trend, seasonal, and residual

decompose_ts_add = seasonal_decompose(cement['sales'], model = "additive", period = 4)
print(decompose_ts_add.trend)
print(decompose_ts_add.seasonal)
print(decompose_ts_add.resid)
print(decompose_ts_add.observed)
decompose_ts_add.plot()

When examining our Trend plot, which aims to uncover underlying data patterns or trends, we observe a smoother and consistent change in the data over time (generally rising). The trend calculations start with null (NaN) values for the initial data points, indicating potential model gaps during that period. Though this will not statistically affect our analysis, it is worth noting.

Moving on to the Seasonal plot, our goal is to identify any regular fluctuations that occur at fixed intervals, such as the typical holiday sales increases seen in retail businesses. Interestingly, the seasonal data doesn’t display clear regular fluctuations, suggesting that our sales patterns do not exhibit consistent seasonal changes. If we were looking at a cement company further from the Equator, we would likely see a seasonal trend, with outdoor construction significantly decreasing in winter.

Next, the Residual component, often referred to as the error component, is implemented to reveal unexplained or irregular variations in the time series after removing the trend and seasonal components. Given the presence of null values in the calculations and an overall sense of noise in the data, we can conclude that there are no discernible underlying patterns in the seasonality or error components of our data.

Despite the oscillations observed in these plots when compared to the original data, we can confidently state that these fluctuating elements will not significantly impact our modeling efforts. In other words, we can proceed with our analysis, assured that our time series data will not pose a substantial hindrance to our work.

Time-Series Breakout

Now that we have enhanced our confidence in our time-series data, we can extract essential time elements for our analysis. While we don’t have the luxury of datetime format (as mentioned earlier), we can employ some creative techniques to achieve our goals. One of our strategies involves utilizing ‘t’ as an engineered feature, representing time periods for both our observed and forecasted sales figures.

However, our analysis benefits from further refinement by isolating the month component. This will serve two valuable purposes: first, it allows us to update our dataframe seamlessly after modeling, and second, it reduces friction when facilitating comparisons between our existing data and the data we’ll be forecasting.

To initiate this process, we will extract the month component from our ‘year_month’ field and insert this into a new column named ‘months.’

# Extract the last three letters of the 'year_month' column
cement['months'] = cement['year_month'].str[-3:]

cement.head()

With the month element broken out, we’ll now filter our data down to what we need for our forecasting. We’ll take the data fields we need and use them to create a dataframe called ‘cementsales.’

# filtering required features for sales forecasting

cementsales = cement.loc[:, ['year_month','sales', 't', 'months']]
cementsales.head()

In addition to our previous enhancements, we will introduce two new columns: ‘tsquare’ and ‘log_cement.’ These additions serve specific purposes in our analysis. Firstly, we’ll calculate ‘tsquare’ by squaring our ‘t’ values. This enables us to explore quadratic polynomial trends when building our models. Secondly, we’ll take the logarithm of our ‘sales’ data and store it in the ‘log_cement’ column. This transformation offers valuable insights and assists us in capturing non-linear patterns within our modeling process.

# squaring t value as it will further required for model
cementsales['tsquare'] = cementsales['t'] * cementsales['t']

# taking the log of sales values as it will be required for the exponential model
cementsales['log_cement'] = np.log(cementsales['sales'])

cementsales.head()

To further assist our models, we’re going to break out our months columns, allowing us to use our categorical data from a numeric angle. This will allow us to follow a full chronology of the data (t), the categorical month (Jan), and a numerical month for model calculations.

# converting months catagorical features to numeric columns by using pandas get_dummies function

month_dummies = pd.DataFrame(pd.get_dummies(cement['months']))
month_dummies.head()

Now, let’s take this breakout and join it with our cementsales data into a single dataframe that we’ll be able to use for our test and train data in our analysis.

# combining data

cementsales = pd.concat([cementsales, month_dummies], axis= 1)
cementsales.head()

Before beginning our analysis, we’re going to set our train and test dataframes. To do this, we need to determine where to split our dataset in order to populate our model. The most common split would be 80:20 or 70:30, meaning the initial figure will be our test, and the latter will be our test. We want to forecast for the twelve months following our dataset, which will require a few additional considerations when making our split decision. Given that we want our project to reflect as close to a real world situation as possible, our forecasting goal, and the length of time encapsulated within our dataset, we will want to make sure we focus more on recent data. Our split will be based on the Trailing 12 months of data, meaning our test dataframe will contain the most recent 12 records in our dataset, leaving the remaining 143 records for our training set. After considering our goals, data, and that there is no significant seasonality or residual components to our time-series, a trailing 12 split will give us the best results within our forecasting model.

# splitting data into train & test

train = cementsales.head(143)
test = cementsales.tail(12)

print('Train Dataframe Shape: ', train.shape)
print('Test Dataframe Shape: ', test.shape)

Train Dataframe Shape: (143,18)
Test Dataframe Shape: (12,18)

Data Analysis

Selecting the Regression Analysis

To determine which Regression Model will best fit our data, we will run an Ordinary Least Squares regression analysis, utilizing the api.ols function in the statsmodel Python library. When we compare the Mean Absolute Percentage Error (MAPE) and Root Mean Square Error (RMSE) for the following types of Regression Models, we’ll be able to identify which fits the relationship between the variables best. We’ll be testing the following regression models and choose which fits our data best:

  1. Linear: The relationship of variables are assumed to be linear, aiming to bit a straight line to best describe the data
  2. Exponential: Non-linear model assuming that variable relationship grows or decays exponentially
  3. Quadratic: Non-linear regression model that introduces quadratic and linear terms, applied when there is a U-shape relationship between variables.
  4. Additive Seasonality: Adds a seasonal component to linear regression to account for fluctuation in seasonal variable relationships.
  5. Multiplicative Seasonality: Assumes a multiplicative seasonal component to overall data trend, where seasonality has a proportional impact on variable relationship.
  6. Additive Seasonality Quadratic Trend: A model combining quadratic trends and additive seasonality when there are trend and seasonal fluctuations in the data.
  7. Multiplicative Seasonality Linear Trend: Combining multiplicative and seasonality models with linear regression, when there are linear and seasonal relationships in the data.

Using our earlier plots, we’ve assumed that there is no seasonality in our data, however, testing all of these models will statistically confirm, or refute this assumption. This extra look will ensure we’re using the best model to predict the XYZ Cement Company’s future sales.

We’ll use the following code to gather our MAPE and RSME values for each model:

#Linear Model

linear_model = api.ols('sales ~ t', data = train).fit()
pred_linear =  pd.Series(linear_model.predict(pd.DataFrame(test['t'])))
rmse_linear = np.sqrt(np.mean((np.array(test['sales']) - np.array(pred_linear))**2))
mape_linear = mean_absolute_percentage_error(test['sales'], pd.DataFrame(pred_linear))

print('mape : ', mape_linear)
print('rmse : ',rmse_linear)

mape : 0.22204550033394643
rmse : 186.9057162235209

In order to have a better look at how these models compare, we’ll combine their results into a single table:

# chart of tests to select appropriate model

data = {"MODEL":pd.Series(["rmse_linear","rmse_Exp","rmse_Quad","rmse_add_sea",
                           "rmse_add_sea_quad","rmse_Mult_sea","rmse_Mult_add_sea"]),
       "RMSE_Values":pd.Series([rmse_linear,rmse_Exp,rmse_Quad,rmse_add_sea,
                                rmse_add_sea_quad,rmse_Mult_sea,rmse_Mult_add_sea]),
       "MAPE_Values": pd.Series([mape_linear,mape_Exp,mape_Quad,mape_add_sea,
                                 mape_add_sea_quad,mape_Mult_sea,mape_Mult_add_sea])}
table_rmse_mape = pd.DataFrame(data)
table_rmse_mape

Examining the chart above, it becomes evident that the Multiplicative Seasonality Linear Trend model (#6) stands out as the most suitable choice for our dataset. It not only boasts the lowest RMSE and MAPE values but also aligns well with the underlying patterns and trends in our data that we missed in our earlier visual analyses. This makes it the ideal candidate to serve as the foundation for our Regression Model, ensuring more robust and reliable forecasting results.

Regression Model

To execute our Regression Model effectively, we must create a dataframe to store our predictive data. Furthermore, we have a clear understanding of both the year_month and ‘t’ elements that will be involved in our model predictions, spanning from December ’22 to November ’23. With this knowledge in mind, we will construct our model’s dataframe, incorporating these elements and any associated fields we can generate. To streamline this process, we will import a previously crafted CSV file named ‘predictivedataproject89.’

predict_data = pd.read_csv('predictivedataproject89.csv')
predict_data.head()

We will also quickly run this dataframe through the same extraction of ‘months,’ dummying those months and then concatenating this data back into our predict_data dataframe, giving us the following:

Before proceeding with the model execution and the generation of predicted sales data, it’s crucial to re-fit our data to the model. To achieve this, we’ll employ the api.ols function, coupled with Pandas and the pd.Series function. Then, it’s time to execute our comprehensive model and evaluate the resulting output.

# Final Model - *Multiplicative seasonality linear trend*

model_full = api.ols('log_cement ~ t+Jan+Feb+Mar+Apr+May+Jun+Jul+Aug+Sep+Oct+Nov', data = 
                     cementsales).fit()
pd.Series(model_full)

0     <statsmodels.regression.linear_model.Regressio...
dtype: object
pred_new = pd.Series(model_full.predict(predict_data))
pred_new

0     6.700296
1     6.711620
2     6.711862
3     6.312997
4     6.238840
5     6.225442
6     6.218358
7     6.049030
8     6.021520
9     6.082080
10    6.111428
11    6.735828
12    6.749318
dtype: float64

Our model produces output in a logarithmic scale, meaning we need to call on Numpy’s exp function to get back to predicted cement sales in tons.

pred_new_exp = np.exp(pred_new)
pred_new_exp

0     812.645965
1     821.900729
2     822.099799
3     551.696159
4     512.263835
5     505.446526
6     501.878386
7     423.701979
8     412.204851
9     437.939280
10    450.982241
11    842.040730
12    853.476250
dtype: float64

We will incorporate the next twelve months of predicted data generated from our Multiplicative Seasonality Linear Trend Model into our predict_data dataframe. To facilitate a comprehensive comparison between our predicted values and observed data, we will visualize both datasets over time through plotting.

# Forecasted values in predicted dataframe

predict_data['prediction_values'] = pred_new_exp
predict_data

Analyzing this plot, it’s evident that the outputs from our predictive model align closely with the historical data. This confirms the reliability of the model’s output, which will be a valuable tool for guiding the strategic decision-making process at XYZ Cement Company for the upcoming year. Notably, the model anticipates a decline in sales during the middle of the summer, followed by a subsequent upward trend towards the end of the year. In the end, there was a clearer trend of seasonality than predicted, and our model was able to identify that.

ARIMA Model

In addition to our previous model comparisons, we’re conducting another forecasting analysis using an AutoRegressive Integrated Moving Average (ARIMA) model. This will provide us with an additional set of predicted values to address the business needs of XYZ Cement Company. Employing multiple models allows us to account for variations in the data that might not have been captured by a single model. It helps us leverage the strengths of each model while mitigating their individual weaknesses. Moreover, given recent economic uncertainties, combining the outputs from multiple models provides us and our client with valuable insights to enhance the business and navigate future challenges.

ARIMA models consist of several components designed to capture various patterns and trends in time-series data. The AutoRegressive (AR) component, which represents the relationship between current and past values, is the first of these components. Within the AR component, we specify the ‘p’ parameter, known as the autoregression order, indicating how many past values to consider in the model. The Integration (I) component of ARIMA deals with differencing, ensuring that statistical properties such as mean and variance remain constant over time. Here, we need to determine the ‘d’ parameter, which signifies the number of differences required for our model. Lastly, the Moving Average (MA) component incorporates the ‘q’ parameter, representing past errors in the moving average, another critical element in constructing our model.

Since we’ve already divided our cement sales data into training and testing sets, we can use them again for this ARIMA model. However, we still need to identify the appropriate ‘p,’ ‘q,’ and ‘d’ parameters. To achieve this, we’ll leverage the auto_arima function to find the optimal values that best suit our model’s needs.

# Run combinations of ARIMA to fit best p, d, and q parameters
model_fit = auto_arima(train['sales'],
                       m=12,
                       d=0,
                       D=0,
                       max_order=None,
                       max_p=7,
                       max_q=7,
                       max_d=2,
                       max_P=4,
                       max_Q=4,
                       max_D=2,
                       maxiter = 50,
                       alpha = 0.05,
                       n_jobs = -1,
                       seasonal=True,
                       trace=True,
                       error_action='ignore',
                       suppress_warnings=True,
                       stepwise=True
                      )
model_fit.summary()

In under two minutes of computations, our auto_arima code gives us the best parameters to construct and execute the model. Now let’s fit our model.

# Fit the ARIMA model
model_ARIMA = ARIMA(train['sales'],
              order=(2,0,0),
              seasonal_order=(1,0,2, 12))

# Fit our data to the ARIMA model
model_ARIMA = model_ARIMA.fit()

train_forecast = train.copy()
test_forecast = test.copy()

train_forecast['forecast_ARIMA'] = model_ARIMA.predict()
train_forecast[['sales','forecast_ARIMA']].plot(figsize=(20,8))

Upon comparing our sales data with a plot of our fitted data, we observe a close alignment between the two. This alignment provides us with the confidence to proceed with running our model. However, before proceeding, it’s essential to conduct a statistical assessment of our training data. This assessment will involve calculating metrics such as MAPE (Mean Absolute Percentage Error), RMSE (Root Mean Square Error), and the Mean Absolute Error (which measures the average magnitude of discrepancies between predicted and actual values).

# Look at RMSE, MAE, MAPA values for train data

print("RMSE of Auto ARIMA:", np.sqrt(mean_squared_error(train_forecast['sales'],                              
                                                        train_forecast['forecast_ARIMA'])))
print("MAE of Auto ARIMA:", mean_absolute_error(train_forecast['sales'],
                                                train_forecast['forecast_ARIMA']))
print("MAPE of Auto ARIMA:", mean_absolute_percentage_error(train_forecast['sales'],
                                                            train_forecast['forecast_ARIMA']))

RMSE of Auto ARIMA: 62.63325613330588
MAE of Auto ARIMA: 48.12179468611912
MAPE of Auto ARIMA: 0.12732020938536742

Let’s take a similar look at the test data before we address the model as a whole.

# Forecast and compare against test data

# start date
start = len(train)

# End date
end = len(train)+len(test)-1

test_forecast['forecast_ARIMA'] = model_ARIMA.predict(start=start, end=end)
test_forecast[['sales','forecast_ARIMA']].plot(figsize=(20,8))

# Look at RMSE, MAE, MAPA values for test data
print("RMSE of Auto ARIMA:", np.sqrt(mean_squared_error(test_forecast['sales'], test_forecast['forecast_ARIMA'])))
print("MAE of Auto ARIMA:", mean_absolute_error(test_forecast['sales'], test_forecast['forecast_ARIMA']))
print("MAPE of Auto ARIMA:", mean_absolute_percentage_error(test_forecast['sales'], test_forecast['forecast_ARIMA']))

RMSE of Auto ARIMA: 68.95796075079929
MAE of Auto ARIMA: 56.95131180064937
MAPE of Auto ARIMA: 0.09650013203550152

Upon examining the plots, we notice that the predicted test data doesn’t exhibit as strong an overlap as the predicted training data. Such discrepancies are common in models when comparing predicted versus observed data. However, it’s worth noting that for both datasets, the RMSE and MAE metrics remain relatively low, indicating the model’s accuracy. Furthermore, the MAPE metric is lower in our predicted test dataset compared to the predicted training dataset, providing additional confirmation of the model’s accuracy.

Having completed both of our sales forecasting model strategies, we now possess the confidence to proceed with our planning and strategic decision-making for our client, The XYZ Cement Company.