Building a Product Recommendation System for E-Commerce: Part I — Web Scraping

towards-data-science

This post was originally published by ‍ Kessie Zhang at Towards Data Science

Image for post

Image by noshad ahmed from Pixabay

Today, if we think of the most successful and widespread applications of machine learning in business, recommender systems could be one of the first examples people have in mind. Each time you purchase something online, you might see the “products you might also like” section. Recommender systems help users discover items they might like but have not yet found, which could help companies maximize revenue from upselling and cross-selling. As a Data Science Intern at ScoreData, I wanted to take the opportunity to try to build a recommendation model and analyze data on ScoreData’s ML platform (ScoreFast™). Since we don’t have customers’ purchase history from any of the E-commerce websites, I decided to build a content-based recommendation system using product descriptions and reviews. The idea underlying them is that if a user was interested in a product, we could recommend several products that are similar to the product the user liked.

The Goals of this project were to:

  • Gather product information and reviews data from BackCountry.com through web scraping using Selenium, Beautifulsoup
  • Perform an exploratory data analysis using ScoreFast™ platform
  • Convert text data into vector
  • Build a KNN predictive model to find the most similar products
  • Use Sentiment Analysis on product reviews
  • Use each review’s sentiment score to predict its review’s rating
  • Generate word clouds to find the customers’ pain points

In this blog, I will only cover the data collection part. If you are interested to know more about the model building process, please check out my next blog.

I started the project by scraping the relevant data from an E-commerce site. In this post, I’ll share how to extract information from a website using Beautiful Soup, Selenium, and Pandas.

At the beginning of this project, I wanted to experiment with a smaller dataset first. Therefore, I only gathered data from the Women’s Dresses & Skirts category on Backcountry.com. I first went on the Women’s Dresses & Skirts page and extracted all the product URLs that were stored in the anchor element (<a>). For each product URL, I gathered product information such as product description, product details, tech specs, rating, number of reviews, and reviews contents.

In order to extract these data from the site, we first need to understand and inspect the web page to find the location where the information we want is stored. Sometimes, the information we want won’t show up unless we scroll down the page. In this case, we need to use a web driver to scroll down the page.

Screenshot of Inspect (by Author)

#get all the link
link_list = []options = Options()  
options.add_argument("--headless")driver = webdriver.Chrome('/usr/local/bin/chromedriver',options=options)for url in url_list:   
driver.get(url)   
soup = BeautifulSoup(driver.page_source, 'lxml')scroll_to = 0
for i in range(5):
scroll_to += 500
driver.execute_script('window.scrollTo(0, ' + str(scroll_to) + ');')
time.sleep(1.5)innerHTML = driver.execute_script("return document.body.innerHTML") 
atag=soup.findAll('div', attrs={'class' : 'ui-pl-visible-content'})for link in atag:
product_url=link.find('a')['href']
product_url='https://www.backcountry.com'+ product_url
link_list.append(product_url)driver.close()

Notice that if we add a “headless” argument, the web driver will be able to run in the backend. Otherwise, for each loop, a new window will pop up, and close itself after it successfully extracts the information you specified.

I’ve written a product_information function and product_review function to gather the data. Here’s the code for the product_information function.

def product_information(url):

product={}
options = Options()
options.add_argument(“–headless”)
options.headless = True
options.add_argument(“–window-size=1920,1200”)

driver = webdriver.Chrome(‘/usr/local/bin/chromedriver’,options=options)
# driver = webdriver.Chrome(‘/usr/local/bin/chromedriver’)

driver.get(url)

#scroll down to where the review count is located on the page
scroll_to = 0
for i in range(5):
scroll_to += 300
driver.execute_script(‘window.scrollTo(0, ‘ + str(scroll_to) + ‘);’)
time.sleep(1.5)

innerHTML = driver.execute_script(“return document.body.innerHTML”) #use this if java-rendered page
soup = BeautifulSoup(innerHTML, ‘lxml’)

# product name
product_name= soup.find(‘h1’, {‘class’: ‘product-name qa-product-title’})

if product_name is None:
product[‘product_name’]=None

else:
product['product_name']=product_name.text

# price
price= soup.find(‘span’, {‘class’: ‘product-pricing__retail’})

if price is None:
price= soup.find('span', {'class': 'product-pricing__sale'}) #product-pricing__sale on sale

if price is None:
product[‘price’]=None
else:
price = price.text
product[‘price’]=price
else:
product[‘price’]=price.text

#   product description
#   ui-product-details__description
#   not all the product has description
product_description= soup.find('div', {'class': 'ui-product-details__description'})

if product_description is None:
product[‘product_description’]=None
else:
product[‘product_description’]=product_description.text

# product info
# prod-details-accordion__list
product_details= soup.find(‘ul’, {‘class’: ‘prod-details-accordion__list’})
product_details = list(product_details.stripped_strings)
product[‘product_details’]=product_details

#tech specs
product_first=soup.find_all(‘div’, {‘class’: ‘ui-product-details__techspec-row’})

if product_first is None:
product[‘tech_spec’]=None

else:
tech_spec={}

#should be able to find rows from the tech spec table
for i in product_first:
tech_name=i.find(‘dt’, {‘class’: ‘ui-product-details__techspec-name’})
tech_name=tech_name.text
tech_value=i.find(‘dd’, {‘class’: ‘ui-product-details__techspec-value’})
tech_value=tech_value.text

tech_spec[tech_name]=tech_valueproduct['tech_spec']=tech_spec

# review_count
review_count= soup.find(‘span’, {‘class’: ‘review-count’})
if review_count is None:
product[‘review_count’]=None

else:
review_count = review_count.text
review_count = int(review_count.split(' ')[0])
product['review_count']=review_countdriver.close()

return product


After I gathered the data, I noticed that a lot of the products don’t have any reviews. Therefore, I decided to scrape more data from the top 9 most popular outdoor brands’ best-selling products.

Once we stored the relevant data into a dictionary, we can unpack a nested dictionary to a pandas DataFrame.

product_name=[]
brand_name=[]
price=[]
product_description=[]
product_details=[]
tech_spec=[]
review_count=[]for link in link_list:
product_name.append(product_dict[link]['description']['product_name'])
brand_name.append(product_dict[link]['brand_name'])
price.append(product_dict[link]['description']['price'])
product_description.append(product_dict[link]['description']['product_description'])
product_details.append(product_dict[link]['description']['product_details'])
tech_spec.append(product_dict[link]['description']['tech_spec'])
review_count.append(product_dict[link]['description']['review_count'])

key_list=set()
for idx, spec in enumerate(tech_spec):
for key in spec.keys():
key_list.add(key)
key_list=list(key_list)

#convert all the unique techs into key
key_dictionary = defaultdict(list)for idx, spec in enumerate(tech_spec):
for key in key_list:
if key not in spec.keys():
key_dictionary[key].append(None)
else:
key_dictionary[key].append(spec[key])

tech=pd.DataFrame.from_dict(key_dictionary)


Now that we’ve combined all the data into one data frame, in the next step, I did some exploratory data analysis on ScoreFast to better understand the data.

In the next blog, I will explain more about how I built the product recommendation using this dataset.

Thanks for reading, and we hope everyone is staying safe and healthy. We are all hoping we can get back to normal soon. In the meantime, please check out our other blogs and stay tuned for more!

Spread the word

This post was originally published by ‍ Kessie Zhang at Towards Data Science

Related posts