This post was originally published by Andrew Udell at Towards Data Science
In order to demonstrate a random forest regression, a data set of e-commerce sales from popular online retailer, Wish, will be used. The data comes from Kaggle and only features sales information on summer clothing. Among the attributes include product descriptions, rating, whether ad boosts were used, whether urgency text was added to the product listing, and the number of units sold, among others.
To show the power of the random forest regression, the number of units sold will be predicted. Making good, accurate predictions would be invaluable to not only inventory planners, who need to make estimates on how much product to order or produce, but also sales, who need to understand how product moves in an e-commerce setting.
All data imports and manipulations will be done through python along with the pandas and numpy libraries.
import pandas as pd import numpy as np# import the data saved as a csv df = pd.read_csv("Summer_Sales_08.2020.csv")
The first two lines simply import the pandas and numpy libraries. The final line reads a CSV file previously saved and renamed to “Summer_Sales_08.2020” and creates a data frame.
df["has_urgency_banner"] = df["has_urgency_banner"].fillna(0)df["discount"] = (df["retail_price"] - df["price"])/df["retail_price"]
When reviewing the data, the “has_urgency_banner” column, which indicates whether an urgency banner was applied to the product listing, was coded improperly. Instead of using 1’s and 0’s, it simply leaves a blank when a banner wasn’t used. The first line code fills those blanks with 0’s.
The second line creates a new column called “discount” which calculates the discount on the product compared to the listed retail price.
df["rating_five_percent"] = df["rating_five_count"]/df["rating_count"] df["rating_four_percent"] = df["rating_four_count"]/df["rating_count"] df["rating_three_percent"] = df["rating_three_count"]/df["rating_count"] df["rating_two_percent"] = df["rating_two_count"]/df["rating_count"] df["rating_one_percent"] = df["rating_one_count"]/df["rating_count"]
The original data set includes several columns dedicated to the products’ ratings. In addition to an average rating, it also included the total number of ratings and the number of five, four, three, two, and one star reviews. Since the total number of reviews will already be considered, it’s better to look at star ratings as a percent of total ratings, so direct comparisons between products may be made.
The lines above simply create five new columns giving the percent of five, four, three, two, and one star reviews for every product in the data set.
ratings = [ "rating_five_percent", "rating_four_percent", "rating_three_percent", "rating_two_percent", "rating_one_percent" ]for rating in ratings: df[rating] = df[rating].apply(lambda x: x if x>= 0 and x<= 1 else 0)
While pandas doesn’t throw an error when dividing by 0, it creates issues when trying to analyze the data. In this case, products with 0 ratings would cause issues when calculated in the previous step.
The above code snippet goes through all the freshly made columns and checks that the values entered are between 0 and 1, inclusive. If they aren’t, they’re replaced with 0, which is an adequate substitute.
import seaborn as sns# Distribution plot on price sns.distplot(df['price'])
A Distribution Plot of Price. Figure produced by author.
The above code produces a distribution plot of the price across all the products in the data set. The most obvious and interesting insight is that there are no products that cost €10. This is probably a deliberate effort made by merchants to get their products on “€10 & Under” lists.
sns.jointplot(x = "rating", y = "units_sold", data = df, kind = "scatter")
A Scatter Plot Between Rating and Units Sold. Figure produced by author.
The above figure reveals that the vast majority of sales are made on items with between three and four and half star ratings. It also reveals most product have fewer than 20,000 units sold with a few items making 60,000 and 100,000 units sold, respectively.
As an aside, the tendency of the scatter plot to organize in lines is evidence that the units sold is more likely an estimate than hard numbers.
sns.jointplot(x = "rating_count", y = "units_sold", data = df, kind = "reg")
A Scatter Plot between the Number of Ratings and Units Sold. Figure produced by author.
This graph demonstrates the other side of ratings. There’s a loose, but positive relationship between the number of ratings and the likelihood a product sells. This might be because consumers look at both the overall rating and the number of ratings when considering a purchase or because high-selling products just naturally produce more ratings.
Without additional data on when purchases were made and when ratings were posted, it’s difficult to discern the cause of the correlation without additional domain knowledge.
In brief, a random forest regression is the average result of a series of decision trees. A decision tree is like a flow chart that asks a bunch of questions and makes a prediction based on the answer to those questions. For example, a decision tree trying to predict if a tennis player will go to the court might ask: Is it raining? If so, is the court indoors? If not, can the player find a partner?
A simple decision tree. Figure produced by author.
The decision tree will then answer each of those questions before it arrives at a prediction. While easy to understand and, according to some experts, better model actual human behavior than other machine learning techniques, they often overfit the data, which means they can often give wildly different results on similar data sets.
To address this issue, multiple decision trees are taken from the same data set, bagged, and an average of the result is returned. This becomes known as the random forest regression.
A simple random forest. Figure produced by author.
Its main advantage is making accurate predictions on highly non-linear data. In the Wish data set, a non-linear relationship is seen in the ratings. There isn’t a nice, easily seen correlation, but the cutoff below three stars and above four and half is plainly visible. The random forest regression can recognize this pattern and incorporate it in its results. In a more traditional linear regression, however, it only muddies its prediction.
In addition, the random forest classifier is efficient, can handle many input variables, and usually makes accurate predictions. It’s an incredibly powerful tool and doesn’t take too much code to implement.
from sklearn.model_selection import train_test_split from sklearn.ensemble import RandomForestRegressor# Divide the data between units sold and influencing factors X = df.filter([ "price", "discount", "uses_ad_boosts", "rating", "rating_count", "rating_five_percent", "rating_four_percent", "rating_three_percent", "rating_two_percent", "rating_one_percent", "has_urgency_banner", "merchant_rating", "merchant_rating_count", "merchant_has_profile_picture" ])Y = df["units_sold"]# Split the data into training and testing sets X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=0.33, random_state = 42)
Before running any model, the first two lines import relevant libraries. The next set of lines creates two variables, X and Y, which is then split into training and testing data. With a test size of 0.33, this ensures that roughly two-thirds of the data will be used to train the data and one third will be used to test it for accuracy.
# Set up and run the model RFRegressor = RandomForestRegressor(n_estimators = 20) RFRegressor.fit(X_train, Y_train)
Next, the model is actually initialized and run. Note that the parameter n_estimators indicates the number of decision trees to be used.
predictions = RFRegressor.predict(X_test) error = Y_test - predictions
Finally, the newly fitted random forest regression is applied to the testing data and the difference is taken to produce an error array. That’s all there is to it!
The Wish data set presents a playground of numbers that can be used to solve real world problems. With only minimal data manipulation, the random forest regression proved to be an invaluable tool in analyzing this data and providing tangible results.
This post was originally published by Andrew Udell at Towards Data Science