Introduction
Transport and logistics, food and shopping, payments, daily needs, business, news and entertainment, Gojek, an Indonesian firm does it all through a mobile app contributing more than $7 billion to the economy. It has 900K registered merchants, more than 190M app downloads, and more than 2M drivers delivering more than 180K orders within a span of 120 minutes. It’s a behemoth! Using business analytics, we will be solving a case study. Below is the last of the 20+ services it offers:
- Transport and Logistics
- Go-ride – Your two-wheeler taxi, the indigenous Ojek
- Go-car – Comfort on wheels. Sit back. Sleep. Snore.
- Go-send – Send or get packages delivered within hours.
- Go-box – Moving out? We’ll do the weights.
- Go-bluebird – Ride exclusive with the Bluebird.
- Go-transit -Your commute assistant, with or without Gojek
- Food & Shopping
- Go-mall – Shop from an online marketplace
- Go-mart – Home delivery from nearby stores
- Go-med – Buy medicines, vitamins, etc from licensed pharmacies.
- Payments
- Go-pay – Drop the wallet and go cashless
- Go-bills – Pay bills, quick and simply
- Paylater – Order now pay later.
- Go-pulsa – Data or talk time, top-up on the go.
- Go-sure – Insure things you value.
- Go-give – Donate for what matters, touch lives.
- Go-investasi – Invest smart, save better.
- Daily needs
- GoFitness allows users to access exercises such as yoga, pilates, pound fit, barre, muay thai and Zumba.
- Business
- Go-biz – A merchant #SuperApp to run and grow business.
- News & Entertainment
- Go-tix – Book your show, Skip the queue.
- Go-play – App for movies and series.
- Go-games – Gaming tips trends etc
- Go-news – Top news from top aggregators.
Data generated through these services is enormous and GO team has engineering solutions to tackle with day to day data engineering issues. Central Analytics and Science Team(CAST) enables multiple products within the Gojek ecosystem to efficiently use the abundance of data involved in the working of the app. The team has analysts, data scientists, data engineers, business analysts, and decision scientists working on developing in-house deep analytics solutions and other ML systems.
The analysts’ role is concentrated on solving day-to-day business problems, having good business knowledge, creating impact, deriving insights, RCA’s(root cause analysis), and keeping top management informed on micro as well as macro metrics, and product decisions to address business problems.
Learning Objectives
- RCA on growth drivers and headwinds faced by the organizations.
- Using Pandas for EDA, slicing, and dicing.
- Marketing budget optimization
- Profits as the north star metric(L0 metric)
- Using Pulp solver to solve LP.
- Writing LP problems using Pulps with clear and crisp instructions.
- Linear regression and cross-validation
- Simple regression exercise using the steps provided in the questionnaire.
This article was published as a part of the Data Science Blogathon.
Table of contents
- Introduction
- Problem Statement
- Dataset
- The Solution to Part One
- RCA on Growth Drivers and Headwinds Faced by the Organizations
- Completed Rides
- Canceled Rides(Lost Opportunity)
- Analysis of Orders
- Summary of Findings and Recommendations for Business Analytics
- Maximize Profits By Optimizing Budget Spends
- Understanding the Optimization Dataset
- Understanding How to Write An LP Problem is Key to Solving it
- The Solution to Part Two
- The Solution to Part Three
- Useful Resources and References
- Conclusion
Problem Statement
Part I
GOJEK directors have asked BI analysts to look at the data to understand what happened during Q1 2016 and what they should do to maximize the revenue for Q2 2016.
- Given the data in Problem A, what are the main problems that we need to focus on?
- Given the data in Table B, how will you maximize the profit if we only have a budget of IDR 40,000,000,000?
- Present your findings and concrete solutions for a management meeting.
Part II
- Problem Using multiple linear regression, predict the total_cbv.
- Create 1 model for each service.
- Forecast period = 2016-03-30, 2016-03-31, and 2016-04-01
- Train period = the rest List of predictors to use:
- Day of month
- Month
- Day of week
- Weekend/weekday flag (weekend = Saturday & Sunday)
- Pre-processing (do it in this order):
- Remove GO-TIX
- Keep only `Cancelled` order_status
- Ensure the complete combinations (cartesian product) of date and service are present
- Impute missing values with 0
- Create is_weekend flag predictor (1 if Saturday/Sunday, 0 if other days)
- One-hot encode month and day of week predictors
- Standardize all predictors into z-scores using the mean and standard deviation from train-period data only
- Evaluation metric: MAPE Validation: 3-fold scheme. Each validation fold has the same length as the forecast period.
- Question 1 – After all the pre-processing steps, what is the value of all the predictors for service = GO-FOOD, date = 2016-02-28?
- Question 2 – Show the first 6 rows of one-hot encoded variables (month and day of the week)
- Question 3 – Print the first 6 rows of the data after pre-processing for service = GO-KILAT. Sort ascendingly by date
- Question 4 – Compute the forecast-period MAPE for each service. Display in ascending order based on the MAPE
- Question 5 – Create graphs to show the performance of each validation fold. One graph one service. x = date, y = total_cbv. Color: black = actual total_cbv, other colors = the fold predictions (there should be 3 other colors). Only show the validation period. For example, if rows 11, 12, and 13 were used for validations, then do not show the other rows in the graphs. Clearly show the month and date on the x-axis
Part III
Our GO-FOOD service in Surabaya performed very well last month – they had 20% more completed orders last month than the month before. The manager of GO-FOOD in Surabaya needs to see what is happening in order to constantly maintain this success for the next month onwards.
- What quantitative methods would you use to evaluate the sudden growth? How would you evaluate the customers’ behavior?
Dataset
The Solution to Part One
Before beginning to solve, start researching blogs and whitepapers that are present on the company website(links are added below). Company archives provide useful resources that act as guides and help understand what the company stands for or what the company is expecting out of this role. Questions one and three can be considered open-ended problems. Question two is a simple exercise on regression, not necessarily focusing on the best model, but the focus is on the processes involved in building a model.
RCA on Growth Drivers and Headwinds Faced by the Organizations
Import data:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import os
#import csv
sales_df =pd.read_csv('https://raw.githubusercontent.com/chrisdmell/Project_DataScience/working_branch/09_gojek/sales_data_all.csv')
print("Shape of the df")
display(sales_df.shape)
print("HEAD")
display(sales_df.head())
print("NULL CHECK")
display(sales_df.isnull().any().sum())
print("NULL CHECK")
display(sales_df.isnull().sum())
print("df INFO")
display(sales_df.info())
print("DESCRIBE")
display(sales_df.describe())
Create pandas datetime from object format. Pandas datetimes is an easy format to work with and manipulate dates. Derive the month column from datetime. Filter out month 4(April) as well. Rename months as Jan, Feb, March.
## convert to date time
# convert order_status to strinf
##
time_to_pandas_time = ["date"]
for cols in time_to_pandas_time:
sales_df[cols] = pd.to_datetime(sales_df[cols])
sales_df.dtypes
sales_df['Month'] = sales_df['date'].dt.month
sales_df.head()
sales_df['Month'].drop_duplicates()
sales_df[sales_df['Month'] !=4]
Q1_2016_df = sales_df[sales_df['Month'] !=4]
Q1_2016_df['Month'] = np.where(Q1_2016_df['Month'] == 1,"Jan",np.where(Q1_2016_df['Month'] == 2,"Feb",np.where(Q1_2016_df['Month'] == 3,"Mar","Apr")))
print(Q1_2016_df.head(1))
display(Q1_2016_df.order_status.unique())
display(Q1_2016_df.service.unique())
#import csv
At the group level, overall revenue has grown by 14%. This is a positive outcome. Let’s break this down by various services and identify services that are performing well.
revenue_total.sort_values(["Jan"], ascending=[False],inplace=True)
revenue_total.head()
revenue_total['cummul1'] = revenue_total["Jan"].cumsum()
revenue_total['cummul2'] = revenue_total["Feb"].cumsum()
revenue_total['cummul3'] = revenue_total["Mar"].cumsum()
top_95_revenue = revenue_total[revenue_total["cummul3"]<=95 ]
display(top_95_revenue)
ninety_five_perc_gmv = list(top_95_revenue.service.unique())
print(ninety_five_perc_gmv)
top_95_revenue_plot = top_95_revenue[["Jan", "Feb", "Mar"]]
top_95_revenue_plot.index = top_95_revenue.service
top_95_revenue_plot.T.plot.line(figsize=(5,3))
## share of revenue is changed but has the overall revenue changed for these top 4 services#import csv
- For all three months, Ride, Food, Shop, and Send contribute to more than 90% net revenue share.(In Jan Ride contributed to 51% of net revenue.)
- Hence following the 80:20 rule for the most recent month, we can restrict this analysis to the top 3 services, namely – Ride, Food, Send.
- Out of the 11 available services, only 3 contribute to more than 90% of revenue. This is a cause of concern and there is immense opportunity for the rest of the services to grow.
Completed Rides
## NET - completed rides
Q1_2016_df_pivot_cbv_4 = Q1_2016_df[Q1_2016_df["order_status"] == "Completed"]
Q1_2016_df_pivot_cbv_4 = Q1_2016_df_pivot_cbv_4[Q1_2016_df_pivot_cbv_4.service.isin(ninety_five_perc_gmv)]
Q1_2016_df_pivot_cbv = Q1_2016_df_pivot_cbv_4.pivot_table(index='service', columns=['Month' ], values='total_cbv', aggfunc= 'sum')
# display(Q1_2016_df_pivot_cbv.head())
Q1_2016_df_pivot_cbv = Q1_2016_df_pivot_cbv[["Jan", "Feb", "Mar"]]
for cols in Q1_2016_df_pivot_cbv.columns:
Q1_2016_df_pivot_cbv[cols]=(Q1_2016_df_pivot_cbv[cols]/1000000000)
display(Q1_2016_df_pivot_cbv)
display(Q1_2016_df_pivot_cbv.T.plot())
## We see that go shop as reduced its revenue but others the revenue is constant.
Q1_2016_df_pivot_cbv_4 = Q1_2016_df_pivot_cbv
Q1_2016_df_pivot_cbv_4.reset_index(inplace = True)
Q1_2016_df_pivot_cbv_4["Feb_jan_growth"] = (Q1_2016_df_pivot_cbv_4.Feb / Q1_2016_df_pivot_cbv_4.Jan -1)*100
Q1_2016_df_pivot_cbv_4["Mar_Feb_growth"] = (Q1_2016_df_pivot_cbv_4.Mar / Q1_2016_df_pivot_cbv_4.Feb -1)*100
display(Q1_2016_df_pivot_cbv_4)#import csv
- Ride – which is the revenue-driving engine has grown by 19%(Jan to March) compared to Send which has grown by 25%.
- Food has degrown by 7%, given food delivery as a business is growing around the globe, and this is a major cause of concern.
Canceled Rides(Lost Opportunity)
Q1_2016_df_pivot_cbv = Q1_2016_df[Q1_2016_df["order_status"] != "Completed"]
Q1_2016_df_pivot_cbv = Q1_2016_df_pivot_cbv.pivot_table(index='service', columns=['Month' ], values='total_cbv', aggfunc= 'sum')
Q1_2016_df_pivot_cbv = Q1_2016_df_pivot_cbv[["Jan", "Feb", "Mar"]]
revenue_total = pd.DataFrame()
for cols in Q1_2016_df_pivot_cbv.columns:
revenue_total[cols]=(Q1_2016_df_pivot_cbv[cols]/Q1_2016_df_pivot_cbv[cols].sum())*100
revenue_total.reset_index(inplace = True)
display(revenue_total.head())
overall_cbv = Q1_2016_df_pivot_cbv.sum()
print(overall_cbv)
overall_cbv.plot()
plt.show()
overall_cbv = Q1_2016_df_pivot_cbv.sum()
overall_cbv_df = pd.DataFrame(data = overall_cbv).T
display(overall_cbv_df)
overall_cbv_df["Feb_jan_growth"] = (overall_cbv_df.Feb / overall_cbv_df.Jan -1)*100
overall_cbv_df["Mar_Feb_growth"] = (overall_cbv_df.Mar / overall_cbv_df.Feb -1)*100
display(overall_cbv_df)
revenue_total.sort_values(["Jan"], ascending=[False],inplace=True)
revenue_total.head()
revenue_total['cummul1'] = revenue_total["Jan"].cumsum()
revenue_total['cummul2'] = revenue_total["Feb"].cumsum()
revenue_total['cummul3'] = revenue_total["Mar"].cumsum()
top_95_revenue = revenue_total[revenue_total["cummul3"]<=95 ]
display(top_95_revenue)
ninety_five_perc_gmv = list(top_95_revenue.service.unique())
print(ninety_five_perc_gmv)
- Lost revenue has grown by 6%.
- Directors can increase their efforts to reduce this to less than 5%.
Analysis of Orders
Q1_2016_df_can_com = Q1_2016_df[Q1_2016_df.order_status.isin(["Cancelled", "Completed"])]
Q1_2016_df_can_com = Q1_2016_df_can_com[Q1_2016_df_can_com.service.isin(ninety_five_perc_gmv)]
Q1_2016_df_pivot = Q1_2016_df_can_com.pivot_table(index='service', columns=['order_status','Month' ], values='num_orders', aggfunc= 'sum')
Q1_2016_df_pivot.fillna(0, inplace = True)
multi_tuples =[
('Cancelled', 'Jan'),
('Cancelled', 'Feb'),
('Cancelled', 'Mar'),
('Completed', 'Jan'),
('Completed', 'Feb'),
('Completed', 'Mar')]
multi_cols = pd.MultiIndex.from_tuples(multi_tuples, names=['Experiment', 'Lead Time'])
Q1_2016_df_pivot = pd.DataFrame(Q1_2016_df_pivot, columns=multi_cols)
display(Q1_2016_df_pivot.columns)
display(Q1_2016_df_pivot.head(3))
Q1_2016_df_pivot.columns = ['_'.join(col) for col in Q1_2016_df_pivot.columns.values]
display(Q1_2016_df_pivot)
#import csv
Q1_2016_df_pivot["jan_total"] = Q1_2016_df_pivot.Cancelled_Jan + Q1_2016_df_pivot.Completed_Jan
Q1_2016_df_pivot["feb_total"] = Q1_2016_df_pivot.Cancelled_Feb + Q1_2016_df_pivot.Completed_Feb
Q1_2016_df_pivot["mar_total"] = Q1_2016_df_pivot.Cancelled_Mar + Q1_2016_df_pivot.Completed_Mar
Q1_2016_df_pivot[ "Cancelled_Jan_ratio" ] =Q1_2016_df_pivot.Cancelled_Jan/Q1_2016_df_pivot.jan_total
Q1_2016_df_pivot[ "Cancelled_Feb_ratio" ]=Q1_2016_df_pivot.Cancelled_Feb/Q1_2016_df_pivot.feb_total
Q1_2016_df_pivot[ "Cancelled_Mar_ratio" ]=Q1_2016_df_pivot.Cancelled_Mar/Q1_2016_df_pivot.mar_total
Q1_2016_df_pivot[ "Completed_Jan_ratio" ]=Q1_2016_df_pivot.Completed_Jan/Q1_2016_df_pivot.jan_total
Q1_2016_df_pivot[ "Completed_Feb_ratio" ]=Q1_2016_df_pivot.Completed_Feb/Q1_2016_df_pivot.feb_total
Q1_2016_df_pivot[ "Completed_Mar_ratio" ] =Q1_2016_df_pivot.Completed_Mar/Q1_2016_df_pivot.mar_total
Q1_2016_df_pivot_1 = Q1_2016_df_pivot[["Cancelled_Jan_ratio"
,"Cancelled_Feb_ratio"
,"Cancelled_Mar_ratio"
,"Completed_Jan_ratio"
,"Completed_Feb_ratio"
,"Completed_Mar_ratio"]]
Q1_2016_df_pivot_1
- In March, Food, Ride, Send had 17%,15%, and 13% of total orders canceled respectively.
- Food has increased its order completion rate, from 69% in January to 83% in March. This is a significant improvement.
## column wise cancellation check if increased
perc_of_cols_orders = pd.DataFrame()
for cols in Q1_2016_df_pivot.columns:
perc_of_cols_orders[cols]=(Q1_2016_df_pivot[cols]/Q1_2016_df_pivot[cols].sum())*100
perc_of_cols_orders
perc_of_cols_cbv.T.plot(kind='bar', stacked=True)
perc_of_cols_orders.T.plot(kind='bar', stacked=True)
- In March, of all the rides canceled, Ride has 72% share of orders, followed by Food(17%) and send(6%).
Summary of Findings and Recommendations for Business Analytics
- Ride –
- The top contributor to revenue.
- Cancellation(GMV) in March has grown by 42%
- Reduce cancelations through product intervention and new product features.
- Food –
- Canceled orders have increased, but due to cost optimization, GMV loss has been successfully arrested.
- Increase net revenue by reducing costs and cancellations.
- Drive higher customer acquisition.
- Send –
- Canceled GMV and orders, both have taken a hit and are a major cause of concern.
- Good ride completion experience, thus, increasing retention and powering revenue growth through retention.
Maximize Profits By Optimizing Budget Spends
The Business team has a budget of 40 Billon for Q2 and it has set growth targets for each service. For each service, the cost of incremental 100 rides and the maximum growth target in Q2 is given below. For Go-Box, to get 100 more bookings, it costs 40M, and the maximum growth target in Q2 is 7%.
Import budget data and use sales data from the above analysis.
budget_df =pd.read_csv('https://raw.githubusercontent.com/chrisdmell/Project_DataScience/working_branch/09_gojek/optimization_budge.csv')
print("Shape of the df")
display(budget_df.shape)
print("HEAD")
display(budget_df.head())
print("NULL CHECK")
display(budget_df.isnull().any().sum())
print("NULL CHECK")
display(budget_df.isnull().sum())
print("df INFO")
display(budget_df.info())
print("DESCRIBE")
display(budget_df.describe())
## convert to date time
# convert order_status to string
##
time_to_pandas_time = ["date"]
for cols in time_to_pandas_time:
sales_df[cols] = pd.to_datetime(sales_df[cols])
sales_df.dtypes
sales_df['Month'] = sales_df['date'].dt.month
sales_df.head()
sales_df['Month'].drop_duplicates()
sales_df_q1 = sales_df[sales_df['Month'] !=4]
## Assumptions
sales_df_q1 = sales_df_q1[sales_df_q1["order_status"] == "Completed"]
# Q1_2016_df_pivot = Q1_2016_df.pivot_table(index='service', columns=['order_status','Month' ], values='num_orders', aggfunc= 'sum')
sales_df_q1_pivot = sales_df_q1.pivot_table(index='service', columns=['order_status'], values='total_cbv', aggfunc= 'sum')
sales_df_q1_pivot_orders = sales_df_q1.pivot_table(index='service', columns=['order_status'], values='num_orders', aggfunc= 'sum')
sales_df_q1_pivot.reset_index(inplace = True)
sales_df_q1_pivot.columns = ["Service","Q1_revenue_completed"]
sales_df_q1_pivot
sales_df_q1_pivot_orders.reset_index(inplace = True)
sales_df_q1_pivot_orders.columns = ["Service","Q1_order_completed"]
optimization_Df = pd.merge(
sales_df_q1_pivot,
budget_df,
how="left",
on="Service",
)
optimization_Df = pd.merge(
optimization_Df,
sales_df_q1_pivot_orders,
how="left",
on="Service",
)
optimization_Df.columns = ["Service", "Q1_revenue_completed", "Cost_per_100_inc_booking", "max_q2_growth_rate","Q1_order_completed"]
optimization_Df.head(5)
#import csv
- For Box, Q1 revenue is 23B, the cost for incremental 100 rides is 40M, its maximum expected growth rate is 7% and 63K total rides were completed @ 370K per order.
Is it possible to achieve the maximum growth rate for all the services with an available budget of 40B?
## If all service max growth is to be achived what is the budget needed? and whats the deficiet?
optimization_Df["max_q2_growth_rate_upd"] = optimization_Df['max_q2_growth_rate'].str.extract('(\d+)').astype(int) ## extract int from string
optimization_Df["max_growth_q2_cbv"] = (optimization_Df.Q1_order_completed *(1+ optimization_Df.max_q2_growth_rate_upd/100)) ## Q2 max orders based on Q1 orders
optimization_Df["abs_inc_orders"] = optimization_Df.max_growth_q2_cbv-optimization_Df.Q1_order_completed ## Total increase in orders
optimization_Df["cost_of_max_inc_q2_order"] = optimization_Df.abs_inc_orders * optimization_Df.Cost_per_100_inc_booking /100 ## Total Cost to get maximum growth for each serivce
display(optimization_Df)
display(budget_df[budget_df["Service"] == "Budget:"].reset_index())
budget_max = budget_df[budget_df["Service"] == "Budget:"].reset_index()
budget_max = budget_max.iloc[:,2:3].values[0][0]
print("Budget difference by")
display(budget_max-optimization_Df.cost_of_max_inc_q2_order.sum() )
## Therefore max of the everything cannot be achieved#import csv
The answer is No. 247B(247,244,617,204) more budget is required to achieve growth targets for all services.
Is it possible to achieve at least 10% of the maximum growth rate for all the services with an available budget of 40B?
## Then what is the budget needed and what will the extra budget at hand??
optimization_Df["min_10_max_growth_q2_cbv"] = (optimization_Df.Q1_order_completed *(1+ optimization_Df.max_q2_growth_rate_upd/1000)) ## atleast 10% of max if achieved, this is orders
optimization_Df["min_10_abs_inc_orders"] = optimization_Df.min_10_max_growth_q2_cbv-optimization_Df.Q1_order_completed ## what is the increase in orders needed to achieve 10% orders growth
optimization_Df["min_10_cost_of_max_inc_q2_order"] = optimization_Df.min_10_abs_inc_orders * optimization_Df.Cost_per_100_inc_booking /100 ## Cost associatedfor 10% increase in orders
display(budget_max-optimization_Df.min_10_cost_of_max_inc_q2_order.sum() ) ## Total budget remaining
display((budget_max-optimization_Df.min_10_cost_of_max_inc_q2_order.sum())/budget_max) ## Budget utilization percentage
optimization_Df["perc_min_10_max_growth_q2_cbv"] =( ( optimization_Df.max_q2_growth_rate_upd/1000)) ## atleast 10% of max if achieved, 7 to percent divide by 100, 10% of this number. divide by 10, so 1000
optimization_Df["perc_max_growth_q2_cbv"] =( ( optimization_Df.max_q2_growth_rate_upd/100)) ## Max growth to be achieved
optimization_Df["q1_aov"] = optimization_Df.Q1_revenue_completed/optimization_Df.Q1_order_completed ## Q1 average order value
optimization_Df["order_profitability"] = 0.1 ## this is assumption that 10% will be profit
optimization_Df["a_orders_Q2"] = (optimization_Df.Q1_order_completed *(1+ optimization_Df.perc_min_10_max_growth_q2_cbv)) ## based on 10% growth, total new orders for qc
optimization_Df["a_abs_inc_orders"] = optimization_Df.a_orders_Q2-optimization_Df.Q1_order_completed
optimization_Df["a_Q2_costs"] = optimization_Df.Cost_per_100_inc_booking* optimization_Df.a_abs_inc_orders/100
##There is scope for improvement here, so This can be adjusted based on revenue or ranking from Q1
display(budget_max - optimization_Df.a_Q2_costs.sum())
optimization_Df#import csv
The answer is Yes. With only 28% of the available 40B budget, this can be achieved. Underutilization of the available budget is never an option, and no business leader would use only 28% of the available budget.
So, the maximum growth across all services cannot be achieved, and achieving 10% of the maximum growth rate will lead to an underutilized budget. Hence the need here is to optimize spending such that:
- The overall cash burn doesn’t cross 40B.
- The overall growth rate in Q2 across services is equal to or below the maximum growth rate.
- There are called constraints in Linear optimization.
- The objective is to Maximize profits.
Assumptions used here:
- Every service has a profit of 10%.
- AOV(revenue/orders) will remain the same as in Q1.
Pre-optimization data pipeline:
## Data prep for pulp optimization
perc_all_df = pd.DataFrame(data = list(range(1,optimization_Df.max_q2_growth_rate_upd.max()+1)), columns = ["growth_perc"])
## create a list of all percentage growth, from 1 to max to growth expected, this is to create simulation for optimization
display(perc_all_df.head(1))
optimization_Df_2 = optimization_Df.merge(perc_all_df, how = "cross") ## cross join with opti DF
## Filter and keeping all percentgaes upto maximum for each service
## Minimum percentage kept is 1
optimization_Df_2["filter_flag"] = np.where(optimization_Df_2.max_q2_growth_rate_upd >= (optimization_Df_2.growth_perc),1,0)
optimization_Df_2["abs_profit"] = (optimization_Df_2.q1_aov)*(optimization_Df_2.order_profitability)
optimization_Df_3 = optimization_Df_2[optimization_Df_2["filter_flag"] == 1]
display(optimization_Df_3.head(1))
display(optimization_Df_3.columns)
## Filter columns needed
optimization_Df_4 = optimization_Df_3[[
'Service', ## services offered
'Cost_per_100_inc_booking', ## cost of additional 100 orders
'Q1_order_completed', ## to calculate q2 growth based on q1 orders
'perc_min_10_max_growth_q2_cbv', ## minimum growth percent need
'perc_max_growth_q2_cbv', ## max growth percent allowed
'abs_profit', ## profit per order
'growth_perc' ## to simulative growth percet across
]]
display(optimization_Df_4.head(2))
optimization_Df_4["orders_Q2"] = (optimization_Df_4.Q1_order_completed *(1+ optimization_Df_4.growth_perc/100)) ## based on growth, total new orders for qc
optimization_Df_4["abs_inc_orders"] = optimization_Df_4.orders_Q2-optimization_Df_4.Q1_order_completed
optimization_Df_4["profit_Q2_cbv"] = optimization_Df_4.orders_Q2 * optimization_Df_4.abs_profit
optimization_Df_4["growth_perc"] = optimization_Df_4.growth_perc/100
optimization_Df_4["Q2_costs"] = optimization_Df_4.Cost_per_100_inc_booking* optimization_Df_4.abs_inc_orders/100
display(optimization_Df_4.head())
optimization_Df_5 = optimization_Df_4[[
'Service', ## services offered
'Q2_costs', ## cost total for the growth expected
'perc_min_10_max_growth_q2_cbv', ## minimum growth percent need
'perc_max_growth_q2_cbv', ## max growth percent allowed
'profit_Q2_cbv', ## total profit at the assumed order_profitability rate
'growth_perc' ## to simulative growth percet across
]]
optimization_Df_5
display(optimization_Df_5.head(10))
display(optimization_Df_5.shape)
Understanding the Optimization Dataset
- Service – Go product.
- 10% of max growth, is the minimum growth that each service should achieve. So Box should at least achieve 0.7% growth.
- This is a constraint.
- Max growth decided by business leaders for Box is 7%.
- This is a constraint.
- For Box, 1% to 7% is the range of growth.1% is more than 0.7% and 7% is the maximum. The optimizer will choose the best growth rate based on constraints.
- This is a decision variable. The algorithm will pick one among 7.
- For 1% growth(Incremental), the cash burn is 255M.
- This is a constraint.
- If incremental growth is 1%, then overall profit(organic + inorganic) is 2.4B.
- This is the objective.
## Best optimization for our case case. This is good.
prob = LpProblem("growth_maximize", LpMaximize) ## Initialize optimization problem - Maximization problem
optimization_Df_5.reset_index(inplace = True, drop = True)
markdowns = list(optimization_Df_5['growth_perc'].unique()) ## List of all growth percentages
cost_v = list(optimization_Df_5['Q2_costs']) ## List of all incremental cost to achieve the growth % needed
perc_min_10_max_growth_q2_cbv = list(optimization_Df_5['perc_min_10_max_growth_q2_cbv'])
growth_perc = list(optimization_Df_5['growth_perc'])
## lp variables
low = LpVariable.dicts("l_", perc_min_10_max_growth_q2_cbv, lowBound = 0, cat = "Continuous")
growth = LpVariable.dicts("g_", growth_perc, lowBound = 0, cat = "Continuous")
delta = LpVariable.dicts ("d", markdowns, 0, 1, LpBinary)
x = LpVariable.dicts ("x", range(0, len(optimization_Df_5)), 0, 1, LpBinary)
## objective function - Maximise profit, column name - profit_Q2_cbv
## Assign value for each of the rows -
## For all rows in the table each row will be assidned x_0, x_1, x_2 etc etc
## This is later used to filter the optimal growth percent
prob += lpSum(x[i] * optimization_Df_5.loc[i, 'profit_Q2_cbv'] for i in range(0, len(optimization_Df_5)))
## one unique growth percentahe for each service
## Constraint one
for i in optimization_Df_5['Service'].unique():
prob += lpSum([x[idx] for idx in optimization_Df_5[(optimization_Df_5['Service'] == i) ].index]) == 1
## Do not cross total budget
## Constraint two
prob += (lpSum(x[i] * optimization_Df_5.loc[i, 'Q2_costs'] for i in range(0, len(optimization_Df_5))) - budget_max) <= 0
## constraint to say minimum should be achived
for i in range(0, len(optimization_Df_5)):
prob += lpSum(x[i] * optimization_Df_5.loc[i, 'growth_perc'] ) >= lpSum(x[i] * optimization_Df_5.loc[i, 'perc_min_10_max_growth_q2_cbv'] )
prob.writeLP('markdown_problem') ## Write Problem name
prob.solve() ## Solve Problem
display(LpStatus[prob.status]) ## Problem status - Optimal, if problem solved successfully
display(value(prob.objective)) ## Objective, in this case what is the maximized profit with availble budget - 98731060158.842 @ 10% profit per order #import csv
print(prob)
print(growth)
Understanding How to Write An LP Problem is Key to Solving it
- Initialize the problem
- prob = LpProblem(“growth_maximize”, LpMaximize)
- growth_maximize is the name of the problem.
- LpMaximize is letting the solver know that it’s a maximization problem.
- Create a variable of the decision function
- growth = LpVariable.dicts(“g_”, growth_perc, lowBound = 0, cat = “Continuous”)
- For Pulp, pulp dicts needs to be created
- g_ is the prefix for the variable.
- growth_perc is the name of the list
- low bound is the minimum growth percent, it can start from 0.
- The variable is continuous.
- There are 60 unique growth percentages from 1%(minimum) to 60%(maximum). (Food has a 60% maximum growth rate).
- Variables – 0 <= x_0 <= 1 Integer for row 0 to 0 <= x_279 <= 1 Integer for row 279.
- Add objective function to the problem
- prob += lpSum(x[i] * optimization_Df_5.loc[i, ‘profit_Q2_cbv’] for i in range(0, len(optimization_Df_5)))
- An equation is created by pulp -> 2423147615.954*x_0 + 2447139176.5080004*x_1 + 225916468.96*x_3+ …. + 8576395.965000002*x_279. There are 280 rows in the dataset, so for each profit value, a variable is created.
- Add constraint:
- One – One growth percentage for each service
- for i in optimization_Df_5[‘Service’].unique(): prob += lpSum([x[idx] for idx in optimization_Df_5[(optimization_Df_5[‘Service’] == i) ].index]) == 1
- For each service, only select one growth percent.
- For Box out of 1 to 7 select only one.
- The equation for box – _C1: x_0 + x_1 + x_2 + x_3 + x_4 + x_5 + x_6 = 1
- The equation for GLAM – _C2: x_10 + x_11 + x_12 + x_13 + x_14 + x_15 + x_16 + x_7 + x_8 + x_9 = 1
- As there are 11 services so 11 constraints are created, one for each service.
- Two – Do not cross the total budget of 40B
- prob += (lpSum(x[i] * optimization_Df_5.loc[i, ‘Q2_costs’] for i inrange(0, len(optimization_Df_5))) – budget_max) <= 0
- The sum of all costs minus the total budget should be less than or equal to zero.
- Equation _C12: 255040000 x_0 + 510080000 x_1 + …. + 16604 x_279 <= 0
- _C12: is the only constraint here because, there is one total budget of 40B, and there is no constraint on how much each service can spend.
- Three – constraint to say minimum should be achieved
- for i in range(0, len(optimization_Df_5)): prob += lpSum(x[i] * optimization_Df_5.loc[i, ‘growth_perc’] ) >= lpSum(x[i] * optimization_Df_5.loc[i, ‘perc_min_10_max_growth_q2_cbv’] )
- For each row, the minimum growth percent constraint equation is created. There are 279 rows, so 279 constraints are created.
- _C13: 0.003 x_0 >= 0 from row 0 to _C292: 0.315 x_279 >= 0 to row 279.
- “Optimal‘” is the desired output.
- display(LpStatus[prob.status])
- 98731060158.842 is the maximized profit.
- display(value(prob.objective))
- One – One growth percentage for each service
var_name = []
var_values = []
for variable in prob.variables():
if 'x' in variable.name:
var_name.append(variable.name)
var_values.append(variable.varValue)
results = pd.DataFrame()
results['variable_name'] = var_name
results['variable_values'] = var_values
results['variable_name_1'] = results['variable_name'].apply(lambda x: x.split('_')[0])
results['variable_name_2'] = results['variable_name'].apply(lambda x: x.split('_')[1])
results['variable_name_2'] = results['variable_name_2'].astype(int)
results.sort_values(by='variable_name_2', inplace=True)
results.drop(columns=['variable_name_1', 'variable_name_2'], inplace=True)
results.reset_index(inplace=True)
results.drop(columns='index', axis=1, inplace=True)
# results.head()
optimization_Df_5['variable_name'] = results['variable_name'].copy()
optimization_Df_5['variable_values'] = results['variable_values'].copy()
optimization_Df_5['variable_values'] = optimization_Df_5['variable_values'].astype(int)# optimization_Df_6.head()
#import csv## with no budget contraint
optimization_Df_10 = optimization_Df_5[optimization_Df_5['variable_values'] == 1].reset_index()
optimization_Df_10["flag"] = np.where(optimization_Df_10.growth_perc >= optimization_Df_10.perc_min_10_max_growth_q2_cbv,1,0)
display(optimization_Df_10)
display(budget_max - optimization_Df_10.Q2_costs.sum())
display( optimization_Df_10.Q2_costs.sum())
- The maximum growth rate for respective services is in the chart above. For Box it’s 1%, for Clean it’s 1%, for Food it’s 17%, etc.
- The total cash burn is – 39999532404.0
- Underutilized budget – 467596.0
- Maximized profit – 98731060158.0
The Solution to Part Two
sales_df =pd.read_csv('https://raw.githubusercontent.com/chrisdmell/Project_DataScience/working_branch/09_gojek/model_analytics__data.csv')
time_to_pandas_time = ["date"]
for cols in time_to_pandas_time:
sales_df[cols] = pd.to_datetime(sales_df[cols])
sales_df['Month'] = sales_df['date'].dt.month
Q1_2016_df = sales_df[sales_df['Month'] !=900]
Q1_2016_df['Month'] = np.where(Q1_2016_df['Month'] == 1,"Jan",np.where(Q1_2016_df['Month'] == 2,"Feb",np.where(Q1_2016_df['Month'] == 3,"Mar","Apr")))
Q1_2016_df['test_control'] = np.where(Q1_2016_df['date'] <= "2016-03-30","train", "test")
display(Q1_2016_df.head(5))
display(Q1_2016_df.order_status.unique())
display(Q1_2016_df.service.unique())
display(Q1_2016_df.date.max())
#import csv
- Import dataset
- Convert date to pandas datetime
- Derive month columns
- Derive train and test columns
display(Q1_2016_df.head())
display(Q1_2016_df.date.max())
Q1_2016_df_2 = Q1_2016_df[Q1_2016_df["date"] <= "2016-04-01"]
display(Q1_2016_df_2.date.max())
Q1_2016_df_2 = Q1_2016_df_2[Q1_2016_df["order_status"] == "Cancelled"]
Q1_2016_df_date_unique = Q1_2016_df_2[["date"]].drop_duplicates()
Q1_2016_df_date_service = Q1_2016_df_2[["service"]].drop_duplicates()
Q1_2016_df_CJ = Q1_2016_df_date_unique.merge(Q1_2016_df_date_service, how = "cross") ## cross join with opti DF
display(Q1_2016_df_date_unique.head())
display(Q1_2016_df_date_unique.shape)
display(Q1_2016_df_date_unique.max())
display(Q1_2016_df_date_unique.min())
display(Q1_2016_df_2.shape)
Q1_2016_df_3 = Q1_2016_df_CJ.merge(Q1_2016_df_2, on=['date','service'], how='left', suffixes=('_x', '_y'))
display(Q1_2016_df_3.head())
display(Q1_2016_df_3.shape)
display(Q1_2016_df_CJ.shape)
Q1_2016_df_3["total_cbv"].fillna(0, inplace = True)
print("Null check ",Q1_2016_df_3.isnull().values.any())
nan_rows = Q1_2016_df_3[Q1_2016_df_3['total_cbv'].isnull()]
nan_rows
display(Q1_2016_df_3[Q1_2016_df_3.isnull().any(axis=1)])
Q1_2016_df_3["dayofweek"] = Q1_2016_df_3["date"].dt.dayofweek
Q1_2016_df_3["dayofmonth"] = Q1_2016_df_3["date"].dt.day
Q1_2016_df_3["Is_Weekend"] = Q1_2016_df_3["date"].dt.day_name().isin(['Saturday', 'Sunday'])
Q1_2016_df_3.head()
- Filter for only canceled orders.
- For all services, cross join with dates from Jan 01 to Apr 01, so that predictions for all days are available.
- Replace NULL with 0.
- Derive day of the month
- Derive day of the week.
- Create binary weekend/weekday column
Q1_2016_df_4 = Q1_2016_df_3[Q1_2016_df_3["service"] != "GO-TIX"]
Q1_2016_df_5 = pd.get_dummies(Q1_2016_df_4, columns=["Month","dayofweek"])
display(Q1_2016_df_5.head())
import numpy as np
import pandas as pd
# from sklearn.datasets import load_boston
from sklearn.preprocessing import Normalizer
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import KFold
from sklearn.model_selection import cross_val_score
from numpy import mean
from numpy import std
from sklearn.metrics import make_scorer
from sklearn.model_selection import cross_val_predict
Q1_2016_df_5.columns
all_columns = ['date', 'service', 'num_orders', 'order_status', 'total_cbv',
'test_control', 'dayofmonth', 'Is_Weekend', 'Month_Apr', 'Month_Feb',
'Month_Jan', 'Month_Mar', 'dayofweek_0', 'dayofweek_1', 'dayofweek_2',
'dayofweek_3', 'dayofweek_4', 'dayofweek_5', 'dayofweek_6']
model_variables = [ 'dayofmonth', 'Is_Weekend', 'Month_Apr', 'Month_Feb',
'Month_Jan', 'Month_Mar', 'dayofweek_0', 'dayofweek_1', 'dayofweek_2',
'dayofweek_3', 'dayofweek_4', 'dayofweek_5', 'dayofweek_6']
target_Variable = ["total_cbv"]
all_columns = ['service',
'test_control', 'dayofmonth', 'Is_Weekend', 'Month_Apr', 'Month_Feb',
'Month_Jan', 'Month_Mar', 'dayofweek_0', 'dayofweek_1', 'dayofweek_2',
'dayofweek_3', 'dayofweek_4', 'dayofweek_5', 'dayofweek_6']
- Filter out GO-TIX
- One hot encode – Month and day of the week
- Import all the necessary libraries
- Create a list of columns, train, predictor, etc.
model_1 = Q1_2016_df_5[Q1_2016_df_5["service"] =="GO-FOOD"]
test = model_1[model_1["test_control"]!="train"]
train = model_1[model_1["test_control"]=="train"]
X = train[model_variables]
y = train[target_Variable]
train_predict = model_1[model_1["test_control"]=="train"]
x_ = X[model_variables]
sc = StandardScaler()
X_train = sc.fit_transform(X)
X_test = sc.transform(x_)
- Filter data for one service – GO-FOOD
- Create train and test dataframes
- Create X – with train columns, and y with predictor column.
- Use Standardscalar for z-score transformation.
#define custom function which returns single output as metric score
def NMAPE(y_true, y_pred):
return 1 - np.mean(np.abs((y_true - y_pred) / y_true)) * 100
#make scorer from custome function
nmape_scorer = make_scorer(NMAPE)
# prepare the cross-validation procedure
cv = KFold(n_splits=3, random_state=1, shuffle=True)
# create model
model = LinearRegression()
# evaluate model
scores = cross_val_score(model, X, y, scoring=nmape_scorer, cv=cv, n_jobs=-1)
# report performance
print('Accuracy: %.3f (%.3f)' % (mean(scores), std(scores)))
y_pred = cross_val_predict(model, X, y, cv=cv)
- cross_val_score doesn’t have MAPE as an in-built scorer, so define MAPE.
- Create CV instance
- Create LR instance
- Use cross_val_score to get the average MAPE scores across CV Folds for GO-Foods.
- For each service, this code can be lopped, create a function to create
def go_model(Q1_2016_df_5, go_service,model_variables,target_Variable):
"""
Q1_2016_df_5
go_service
model_variables
target_Variable
"""
model_1 = Q1_2016_df_5[Q1_2016_df_5["service"] ==go_service]
test = model_1[model_1["test_control"]!="train"]
train = model_1[model_1["test_control"]=="train"]
X = train[model_variables]
y = train[target_Variable]
train_predict = model_1[model_1["test_control"]=="train"]
x_ = X[model_variables]
X_train = sc.fit_transform(X)
X_test = sc.transform(x_)
# prepare the cross-validation procedure
cv = KFold(n_splits=3, random_state=1, shuffle=True)
# create model
model = LinearRegression()
# evaluate model
scores = cross_val_score(model, X, y, scoring=nmape_scorer, cv=cv, n_jobs=-1)
# report performance
print('Accuracy: %.3f (%.3f)' % (mean(scores), std(scores)))
y_pred = cross_val_predict(model, X, y, cv=cv)
return y_pred,mean(scores), std(scores)
a,b,c = go_model(Q1_2016_df_5, "GO-FOOD",model_variables,target_Variable)
b
- Modeling steps converted to a function:
- Q1_2016_df_5 – Base data
- go_service – go-tix, go-send etc
- model_variables – variables used to train the model
- target_Variable – predictor variable(total_cbv).
- For each service, the method can be run to get the average forecast MAPE across all 11 services.
The Solution to Part Three
Question 3 is an open-ended question and readers are encouraged to solve it on their own. Some of the hypotheses are:
- As this is specific to one particle area and geography, it’s safe to assume that the APP more or less remained the same, and product interventions could have played only a minor role. And if there was product intervention, it was just specific to this particular area.
- Good quality/famous restaurants and food chains were onboarded, and users now have lot of good selection to order from or order from familiar restaurants.
- The delivery speed was significantly improved by onboarding a higher number of delivery agents.
- Re-trained delivery agents effectively to reduce cancelations.
- Worked with restaurant partners, to handle peak-time chaos in a better way.
Useful Resources and References
- Working In The ‘Central Analytics and Science Team’
- How We Estimate Food Debarkation Time With ‘Tensoba’
- Business Case Study Assignments For Entry Level Data Analysts
- Solving Business Case Study Assignments For Data Scientists
- Using Data To Appreciate Our Customers
- Under the Hood of Gojek’s Automated Forecasting Tool
- Experimentation at Gojek
- GO-JEK’s Impact for Indonesia
- GO-FAST: The Data Behind Ramadan
- Pulp optimization.
- Linear programming using pulp.
- Marketing campaign optimization.
- Simple ways to optimize something using python.
Conclusion
Case studies, when done right, following the steps given above, will have a positive impact on the business. Recruiters aren’t looking for answers but an approach to those answers, the structure followed, the reasoning used, and business and practical knowledge using business analytics. This article provides an easy-to-follow framework for data analysts using a real business case study as an example.
Key Takeaways:
- There are two approaches to answer this case study, bottom-up, top-down. Here, bottom-up approach has been considered, because of unfamiliarity with data and unavailability of business context.
- Slicing and dicing the sales numbers across dimensions, identifying trends and patterns across services, is the best approach to figure out the challenges for growth.
- Be crisp and to the point, while providing recommendations.
- Let the data tell a story, instead of just proving data points – Eg: The top three services contribute towards more than 90% of revenue. While at the group level, growth is on the positive side, at various services, there are challenges with ride completion, driver cancellation, etc. For Food – Reducing cancelations by Y% will drive higher revenues in Q2 by x% etc.
- Optimization using pulp is intimidating when there are more than 3 constraints. Writing down an LP problem on a piece of paper, then coding it out will surely make the task easier.
Good luck! Here’s my Linkedin profile if you want to connect with me or want to help improve the article. Feel free to ping me on Topmate/Mentro; you can drop me a message with your query. I’ll be happy to be connected. Check out my other articles on data science and analytics here.
The media shown in this article is not owned by Analytics Vidhya and is used at the Author’s discretion.