What’s Starbucks’ deal?

8 min readMar 26, 2021

A summary of my Udacity Data Science NanoDegree Capstone Project

Every coffee drinker is familiar with this logo :)

Project overview

This is a part of the Udacity Data Science NanoDegree. The task at hand was to use the datasets provided by Starbucks to find some insights into the appeal of their promotional offers. I tried to figure out which offers are the most attractive for the customers by analyzing the data.

Problem Statement

There are several types of promo offers available for Starbucks clients. With the help of the provided data, we need to find out which type of offer is the most appealing and which customer group is more likely to profit from the promotional offers. We need to find out the relations between customers’ characteristics (age, gender, income, etc) and their willingness to take advantage of the offer.

Metrics used for evaluation

In the last part of the project, I used modeling in order to check whether or not we can predict customers’ behavior based on available data.

The metrics I used to evaluate the models are:

M2E (mean squared error)
R2 (Coefficient of determination)
Accuracy of prediction

Data Preprocessing And Analysis

For this project, I had three available datasets in a form of .json files:

portfolio.json
profile.json
transcript.json

Portfolio.json contains the data about the offers — their types, duration, etc.

Profile.json contains information about the users — their gender, age, income and the time when they became the member.

Transcript.json is basically a log of transaction and offer completions.

I started by exploring the datasets in order to figure out what kind of data I was dealing with. Also, I wanted to find potential problems with the datasets, such as NaN values.

There were several issues with the profile.json dataset.

First of all, there were missing values in the profile.json dataset:

Only 14825 Non-nulls in gender and income columns

Also, there was a significant outlier in the age column — 118 years appeared 2175 times:

‘118' seems to be a place holder for NaN

This is the exact same number as the number of missing values in the income and gender columns. Therefore, we can assume that ‘118’ in this context means ‘NaN’. Moreover, all the NaNs appear always in the same rows:

Another problem with this dataset is visible in the became_member_on column — the date is shown as an integer. I converted it into a date and split it into years and months and dropped the original column.

When it comes to missing values, I decided not to drop the rows filled with them for now. I changed the ‘None’ gender into ‘N/A’, but it didn’t really matter later on. As for Income and Age, I thought about filling them with modes or means, but it would only shift the position of the outliers, as seen in the plots below:

118 value for age replaced with the mean

In the portfolio dataset, I split the ‘channels’ column into columns for each value from the original column. The values of the new columns are encoded: they are either 1 or 0. I have left the original column — it will be useful later on.

In the transcript dataset, I split the ‘value’ column into ‘offer_id’, ‘amount’, and ‘reward’ columns.

Then, I merged the datasets. At first, I thought I could merge the portfolio and profile on ‘id’, but it seems that ‘id’ in ‘portfolio’ stands for the offer id so the following would give us an empty dataframe:
profile_portfolio_merged = portfolio.merge(profile, on=’id’, how=’inner’)
Instead, I needed to use the transcript df as a ‘bridge’ to merge all the datasets.

After merging them, I removed the duplicated columns and renamed some of the remaining ones to avoid confusion.

Visualization

At this point, I started to create visualizations for various statistics that we could get from the dataset. I started with the ones that didn’t require me to drop the rows containing the Null values.

First, I checked the overall completion status rate for each offer type:

Interestingly, even though more people viewed the BOGO (‘buy one, get one’) offers, it was the discount offers that had the highest completion rate. That may suggest that in general, the customers find the discount offers more appealing.

Then, I dropped the rows with the informational offers — I didn’t really need them, as I was mainly interested in the offers that could actually be completed — discount and BOGO.

I checked the numbers of completed, received, and viewed offers per set of channels through which they were sent to customers — this is why I left the original ‘channels’ column before.

Overall completion status rate for each set of channels

Not surprisingly the offers that were sent through all the channels have been viewed and completed the most.

Next, I checked the completion rate related to each level of difficulty.

Surprisingly, it’s not the easiest offers that caught customers’ eyes and attention. It seems that Starbucks customers like a challenge.

I did a similar plot with replacing the difficulty with duration:

It looks like the 7 days promotional offers were the most effective.

After creating those plots, I dropped the rows that contained the NaN values and explored the relations between customers’ profiles and completion rates of the offers.

I started with a simple bar plot showing the completion rates per gender. In this one, I decided to include the informational offers again:

The results show that even though more men received and viewed the offers (let’s not forget that, as we’ve seen earlier, the majority of clients are male), the women benefited from almost as many offers as them. The women clearly had a better viewed-completed ratio.

Then, I took the income and age columns and divided the customer into 4 groups for each of those parameters:

df_all_cleaned[‘age_group’] = pd.cut(df_all_cleaned[‘age’], bins=[11, 19, 40, 60, 118],labels=[‘teenager’, ‘adult’, ‘middle-aged’, ‘elderly’])
df_all_cleaned[‘income_group’] = pd.cut(df_all_cleaned[‘income’], bins=[0, 30000, 60000, 90000, 120000],labels=[1, 2, 3, 4])

Then, I checked the completion rate related to age and to income.

Completion status count per income_group

Surprisingly, the offers reached more middle-aged and elderly people than young adults.

Data Modeling and Implementation

That’s the final part of the project. I tried to use many models and choose the best one. Originally, I wanted to test the models with F1 score, but that would require changing all the data into True/False values. That’s why I decided to use R2, MSE, and Accuracy as metrics for the model.

Before implementing the models, I made some final adjustments to the dataset.

First, I one-hot-encoded two columns: ‘Event’ and ‘Gender’:

columns_to_dummies = [‘event’ , ‘gender’]
df_model = pd.get_dummies(df_all_cleaned, columns=columns_to_dummies, prefix = columns_to_dummies)

Then, I encoded the values of the ‘offer_type’ and ‘age_group’ columns with numerical values, so that they could be useful for the model:

df_model[‘offer_type’].replace([‘bogo’,’discount’,’informational’], [1,2,3], inplace=True)
df_model[‘age_group’].replace([‘teenager’,’adult’,’middle-aged’,’elderly’], [1,2,3,4], inplace=True)

Here is the Train/Test split I did originally:

I chose test_size to be .2, as it’s a reasonable place to start.

Then, I performed the tests using 5 different algorithms:

Random Forest Regressor
KNeighbors Classifier
Linear Regression
Decision Tree Classifier
GaussianNB

Random Forest and Decision Tree don’t work with this data — they behaved as if the model was overfitting. Linear regression performed poorly.
KNeighbors did pretty well, but the GaussianNB performed best:

R2 = 0.7615
mean squared error =  0.1209
Accuracy : 87.97%

Refinement and Improvement

Unfortunately, there’s only one parameter we can tune — var_smoothing, and running it through GridSearchCV didn’t really improve the original model.

I decided to evaluate the Train/Test Split and see if choosing different parameters would help the model. I removed the ‘income’ column from the X set and replaced it with the ‘income_group’. I also thought about adding the ‘possible_reward’ column, but quickly dismissed this thought, as this column is directly related to ‘offer_type’, which is in the y set.

Then, I ran all the algorithms again. Here are the results:

GaussianNB algorithm:

R2 = 0.6311
mean squared error =  0.1870
Accuracy : 81.15%

Linear Regression:

R2 = 0.4219
mean squared error =  0.2930
Accuracy : 41.14%

KNeighbors:

R2 = 0.8514
mean squared error =  0.0753
Accuracy : 96.13%

Both Decision Tree Classifier and Random Forest Regression didn’t work again — the algorithms showed perfect accuracy and R2 score, which is an obvious sign of overfitting for these algorithms.

KNeighbors did pretty good, but I think that the accuracy is suspiciously high, so I suspect overfitting.
The GaussianNB performed best — the score is pretty high but still reasonable. I decided to go with that algorithm once again.

Then, I tried to perform a Grid Search on this model and this time it worked perfectly.

The ‘var_smoothing’ parameter was set to 2.310129700083158e-05 and it gave me better results for the model:

R2 = 0.7615
mean squared error =  0.1209
Accuracy : 87.97%

Conclusions:

The analysis has produced some interesting results. I was surprised to see that women are more likely to benefit from the offer - I presumed it would be more equally distributed between the genders. I was also really surprised to see that the richest group of the Starbucks app has a pretty high offer completion rate.

I am pleased with the scores of the model I chose. Almost 88% accuracy with the .2 test set is not a bad score at all.

As for possible improvements, I think the model could be improved further if we work on the Train/Split some more. It’s possible that different encoding of the values from ‘age_group’ and ‘income_group’, as well as normalizing the values in the whole set, could potentially improve the model. Using the MinMaxScaler could be a good idea.

The code with my findings can be found in my Github project repository:

https://github.com/91Mikand/NanoProject4