Human rights first: A Data Science Approach

towards-data-science

This post was originally published by Daniel Benson at Towards Data Science

The Journey Is As Much The Process And Struggle As It Is The End Result

The Data Science side of the project included a number of time consuming features. We began by exploring the notebooks we inherited from the previous team and recreating and modifying their work for a thorough understanding. We asked questions such as, “How did they go about cleaning up the data?”, “What features did they feel were important?”, “Why these features?”, “What parameters did they use to create their model?”, “How accurate is their model?”. Using these questions as a layout to our exploration, we created new google colab notebooks and recreated the inherited notebooks one by one, putting together tests and making modifications as needed to ensure our thorough understanding. This process included using the Reddit API wrapper PRAW to pull news articles and reddit posts from “news” subreddits as well as pre-collected data from reddit, twitter, internet sources, and various news sites as well as cleaning up the data and performing some feature engineering as needed.

Below is the code we used to access the reddit API and pull the top 1000 hottest submissions from the “news” subreddit; these were then appended to a list called “data” and used to create a new dataframe:

# Grabbing 1000 hottest posts on Reddit 

data = []

# Grab the data from the “news” subreddit
for submission in reddit.subreddit(“news”).hot(limit=1000):
data.append([submission.id, submission.title, submission.score,
submission.subreddit, submission.url, submission.num_comments,
submission.selftext, submission.created])

# Create and assign column names
col_names = [‘id’, ‘title’, ‘score’, ‘subreddit’, ‘url’,
‘num_comments’, ‘text’, ‘created’]
df_reddit = pd.DataFrame(data, columns=col_names)


Next, we decided to recycle the previous team’s data collection, cleaning and feature engineering but modifying their Natural Language Processing model to include a number of tags that the previous team had left out. We followed this up by putting together our baseline predictive model using TfidVectorizer and a RandomForestClassifier with a RandomizedSearchCV for early parameter tuning. Using these methods we were able to create a csv file we felt comfortable sending over to the web team for use in their baseline choropleth map. The code used to build our model can be found in the embedding below.

# Build model pipeline using RFC

pipeline = Pipeline([
(‘tfidf’, TfidfVectorizer()),
(‘classifier’, RandomForestClassifier(random_state=42, n_jobs=-1, max_depth=5, n_estimators=45,)),
])
pipeline.fit(X_train,y_train)
predictions = pipeline.predict(X_test)

param_distributions = {
'classifier__max_depth': [1, 2, 3, 4, 5]}

search = RandomizedSearchCV(
pipeline,
param_distributions=param_distributions,
n_iter=10,
cv=3,
scoring=’accuracy’,
verbose=10,
return_train_score=True,
n_jobs=-1
)

search.fit(X_train, y_train);

>> Best hyperparameters {'classifier__max_depth': 5}
>> Best Score 0.9075471698113208

On top of my contributions in the exploration, cleaning, and modeling phases I took the lead in working with the Data Science API. Our project used FastAPI to get the Data Science app created and working and Docker to hold an image of our app for deployment to AWS Elastic Beanstalk. Within my local environment I included the previously mentioned csv file along with a file containing data cleaning and feature engineering methods put together by myself, a fellow team member, and the previous Data Science team. Using this I was able to create two new csv files, one containing the raw final data and the other containing the final data cleaned up and pre-processed for jsonification. This data was converted to a json object before being added to a get endpoint for access by the web team’s back end. The router that was set up to achieve this task can be found in the following embedding:

from fastapi import APIRouter, HTTPException
import pandas as pd
import numpy as np
# from .update import backlog_path  # Use this when able to get the # backlog.csv filled correctly
from ast import literal_eval
import os
import json
# Create router access
router = APIRouter()@router.get('/getdata')
async def getdata():
"""
Get jsonified dataset from all_sources_geoed.csv
"""
# Path to dataset used in our endpoint    locs_path = os.path.join(os.path.dirname(__file__), '..', '..', 
'all_sources_geoed.csv')    df = pd.read_csv(locs_path)    # Fix issue where "Unnamed: 0" created when reading in the           
# dataframe    df = df.drop(columns="Unnamed: 0")    # Removes the string type output from columns src and tags, 
# leaving them as arrays for easier use by backend    for i in range(len(df)):
df['src'][i] = ast.literal_eval(df['src'][i])
df['tags'][i] = ast.literal_eval(df['tags'][i])    """
Convert data to useable json format
### Response
dateframe: JSON object
"""
# Initial conversion to json - use records to jsonify by 
# instances (rows)    result = df.to_json(orient="records")    # Parse the jsonified data removing instances of '"' making it 
# difficult for backend to collect the data    parsed = json.loads(result.replace('"', '"'))    return parsed

One of the major challenges we faced was during the deployment stage of the project. I was able to get the data set up and deployed onto AWS Elastic Beanstalk but several times there was a problem with the jsonification of the data making it unusable for the web team. First, the data was returning with several out-of-place forward slashes “” and backslashes “/”. Secondly some of the data features, specifically “src” and “tags” were being returned as strings instead of arrays. The DS team sat down together in chat to research and brainstorm how to fix this issue. After a number of trial and errors in our deployment we found the preprocessing steps we needed to ensure the data being sent was formatted correctly. The embedded code for this process can be found below:

import os
import pandas as pd
import re# set up various things to be loaded outside of the function
# geolocation data
locs_path = os.path.join(os.path.dirname(__file__), 
'latest_incidents.csv')# Read in the csv file into a dataframe
sources_df = pd.read_csv(locs_path)# Remove instances occurring in which backslashes and newlines are 
# being created together in the data.
sources_df["desc"] = sources_df["desc"].replace("n", "  ")# Remove the "Unnamed: 0" column creating when reading in the csv
sources_df = sources_df.drop(columns=["Unnamed: 0"])# Fix instances occurring in which new lines are being created in 
# the data
for i in range(len(sources_df)):
sources_df["desc"][i] = str(sources_df["desc"][i]).replace("n",  
" ")# Create csv file from dataframe
sources_df.to_csv("all_sources_geoed.csv")
Spread the word

This post was originally published by Daniel Benson at Towards Data Science

Related posts