foods-receipe-rating-analysis

foods-receipe-rating-analysis

by hao shen (haoshen@umich.edu)

Introduction

This analysis is based on the Recipes and Rating dataset, which contains a large amount of recipes information as well as it’s reviews from many different users. In this analysis, we are going to see if we can predict the recipe rating based on the number of steps the recipe takes and the minutes it take to prepare. the original recipe dataset the following features:

Column Description
'name' Recipe name
'id' Recipe ID
'minutes' Minutes to prepare recipe
'tags' Food.com tags for recipe
'nutrition' Nutrition information in the form [calories (#), total fat (PDV), sugar (PDV), sodium (PDV), protein (PDV), saturated fat (PDV), carbohydrates (PDV)]; PDV stands for “percentage of daily value”
'n_steps' Number of steps in recipe
'steps' Text for steps in recipe, in order
'n_ingredients' Number of recipe ingredients
'ingredients' List of recipe ingredients
'rating' Rating given

Data Cleaning and Exploratory Data Analysis

Data Cleaning

first, since the recipe dataset and review dataset are given as two different dataset, but since they have the same recipe id, we can first merge them together to connect the rating with recipe. And since there are more than one review for a certain recipe, we will take the mean rating for a certain recipe, and add a new column to the recipe dataframe called “avg_rating”.

The result dataframe looks like this:

name id minutes n_steps avg_rating
arriba baked winter squash mexican style 137739 55 11 5
a bit different breakfast pizza 31490 30 9 3.5
all in the kitchen chili 112140 130 6 4
alouette potatoes 59389 45 11 4.5
amish tomato ketchup for canning 44061 190 5 5
apple a day milk shake 5289 0 4 5
aww marinated olives 25274 15 4 2

Univariate Analysis

Distribution of minutes to prepare

first, let’s look into the ‘minutes’ column to see how it is distributed:

from the distribution we can see that most recipe’s preparation time lies on the range 0-25, and range 25-50

Distribution of number of steps:

Bivariate Analysis

minutes vs average rating:

we can see rating 5 have outliers that having extremly large minutes, this might be useful in out future analysis

number of steps vs average rating:

Interesting Aggregates

from the original merged dataset, by grouping by the rating, we can investigate what mean number of steps and mean minutes looks like for different class rating. (since rating is integer and is from 0-5)

the chart looks like the following:

rating n_steps minutes
0 10.4182 35499.7
1 9.92347 119.812
2 9.70268 104.012
3 9.46229 100.119
4 9.26096 95.1594
5 9.65566 47462.3

from this chart we can see as the rating increase, the number of steps is tend to decreasae. As the the rating increase, the minutes is also tend to decrease. but the reason why rating 5 have a abnormal large minutes is because there are many outliers, some recipe have rating of 5 but have extremly larger minutes (as shown in the minutes vs average rating graph)

Imputation

I didn’t confuct imputation because as there are no missing value on the avg_rating, minutes and n_steps

Framing a Prediction Problem

“how preparation minutes and number of steps affect the avg_rating?” this is a regression problem, i am using root meaning square error to evaluate my model performance.

at the time of prediction, we know the number of steps and minutes it take to prepare the recipe. becuase this is what the input values are, providede by the recipe creator.

I am choosing root mean square error as my method to evaluate my model performance becasue it measures the average magnitude of prediction errors, and it is easy to interpret, for example we can interpret out RMSE as: “On average, our predictions are off by X stars”

Baseline Model

for the baseline model, I am using RandomForestRegressor, my features are “minutes” and “n_steps”, and both “minutes” and “n_steps” are quantitative features.

my baseline models’ RMSE is 1.0135.

I believe my baseline model did a good job, because our model predicting recipe ratings with an RMSE of 1.0. This means our predictions are typically within one star of what users actually rated, which is good enough to tell which recipes people will like versus dislike. The rating of a certain recipe can be very personal means its very different from person to person. So having a 1 point off is a indeed doing a good job.

Final Model

for my final model, I first standardize the minutes and n_steps, becuase from the previous distribution of minutes, we can see there are many outliers having extremly large value of minutes, to weaken the impact of these outliers, using standardization is a good choice. Also becasue I will try train our data on neural network, and having a standardized minutes and n_steps will help the performance of neural network, because neural network uses gradient descent algorithms, and gradient descent work best when all features are on the same scale.

RandomForestRegression is an ensemble machine learning algorithm that builds multiple decision trees on random subsets of the data , and then averages their predictions to improve accuracy and reduce overfitting. And it is robust for regression tasks, it can handl non-linear relationships and feature interactions effectively.

Best Performing Hyperparameters: Best Hyperparameters: {‘model__max_depth’: 10, ‘model__min_samples_leaf’: 4}

Method for Hyperparameter Selection and Model Choice: I choose RandomForestRegression because there are likely a non-linear relationship between features(minutes and n_steps) and and the target predict variable(rating)

I used GridSearchCV approach for hyperparameter tuning.

The hyperparameter grid included: max_depth: [None, 10, 20, 30], this controlls the maximum depth of each tree to balance model complexity and overfitting. min_samples_leaf: [1, 2, 4], this specifies the minimum number of samples required at a leaf node to prevent overly specific splits.

The grid search was conducted with 5-fold cross-validation to ensure robust evaluation across different subsets of the training data.

Finally, the combination has the lowest negative root mean squared error on the validation folds will be selected as the optimal model.

how my Final Model’s performance is an improvement over my Baseline Model’s performance: we can conduct a RMSE Comparison to compare the performance. The baseline model achieve a RMSE of 1.0135, and our final model achieve a RMSE of 0.9960, this means our performance improves around 1.73%.