This post was originally published by at Towards Data Science
2020 US elections are in full flow, and It’s been a very critical year for the United States due to the global pandemic hitting the nation so hard. Elections happen in the US every five years and are a real big deal to decide the fate of this country. I have been working on the Illuminating project for the 2 years of my master’s degree in Data Science as a Research Assistant. Though the project originated initially in 2014 during the gubernatorial elections in the states. The project has shaped all these years to turn into a social media election analytics powerhouse now. It helps journalists keep track of the campaigning strategies of the candidates. The mastermind behind this project is Jennifer Stromer-Galley, Jeff Hemsley, and their whole illuminating team. The team comprises of Social Scientists, Behavioral Scientists, Linguistic Scientists, Data Engineers, and Machine learning engineers.
What is Illuminating?
Illuminating is a computational journalism project that empowers journalists covering US political campaigns. Its goal is to help journalists by providing a usable yet comprehensive summary of the content and character of campaign communication online that goes beyond counting likes or retweets. Illuminating provides an interactive database that allows for easy and quick tracking of what candidates are saying on social media through their free campaign accounts on Facebook and Twitter and their paid ads on Facebook and Instagram.
Tech that used to power Illuminating (Old)
As this project is old, the tech that used to power illuminating was pretty straightforward and solid. We used to fuel our twitter and Facebook collections with in-house open-source tools, which are basically scripts that hit the APIs and collect data and few other scripts that tags the data using SVM(support vector machines) to classify content into various categories and push them into our previous MySQL databases. All these processes used to run on our servers(believe me, we have many servers for the collections to make sure we don’t miss anything). Sounds pretty simple! But it wasn’t as we had to make sure all our collectors were running 24X7, and we don’t get any anomalous data.
New Tech that powers the Next-Gen Illuminating 2020
We have been working on how to make this infrastructure resilient and flawless. All this using open source technologies, we had a pretty forward straight option: Apache Airflow. Airflow is extensively used in the industry for creating data pipelines and machine learning pipelines, Thanks to Airbnb!
We have Airflow deployed on our servers that host data pipelines and machine learning pipelines entirely in Python (I love Python❤). It’s fully automated and amazing. I was introduced to Airflow during my internship at ViacomCBS digital in the summer of 2019, and I instantly fell in love with it. Though it has a steep learning curve but completely worth it.
SVM was not bad with natural language processing classification tasks. Still, we had to level up our game and try some of the most sought after algorithms in the linguistic domain for the classification task. There was a clear winner at that time: BERT. We used a pre-trained BERT (Bidirectional Encoder Representations from Transformers) Base model (12-layer, 768-hidden, 12-heads, 110M parameters), which is trained on Wikipedia and English textbooks, Thanks to Google! We fine-tune BERT’s last layer according to our classification tasks.
We switched from SQL database MySQL to NoSQL database MongoDB. Why MongoDB? It has a greater benefit due to its ability to handle extensive unstructured data. It is magically faster. People are experiencing real-world MongoDB performance mainly because it allows users to query in a different manner that is more sensitive to workload. We have millions or maybe billions of records on our infrastructure spanning from all elections since 2014 that includes data from social media platforms like Twitter, Facebook, and Instagram.
Current Architecture and Data Collection
Illuminating 2020 currently focuses on the Presidential candidates’ political advertisements and categorizing them with our previously developed codebook. We use Apache Airflow to fuel our data pipelines that collect streaming ads data from the Facebook Ad Library APIs. The database gets updated every 4 hours with new metrics and new posts and gets tagged by our machine learning models. This data contains ads from Facebook and Instagram for all the presidential candidates with valid Facebook and Instagram accounts. We pull ads, and their metadata from the main candidate pages as well as the ads the Trump and Biden campaigns purchased on other affiliated pages. We do not pull ads for other entities advertising on the candidates’ behalf, such as political action committees. We only pull data for candidates who ran long enough to be included in debates. The Facebook Ad Library API provides spending and impressions data for each ad in ranges, including a minimum and maximum amount. We automated almost everything from collection to machine learning to sampling the data with Apache Airflow.
In the airflow world, the pipelines are called DAGs(Directed Acyclic Graph). Each pipeline consists of multiple tasks that involve data collection and tagging through our various machine learning models for categories, civility, and topics on-the-go and pushing everything into our MongoDB servers.
We basically maintain every kind of metadata in our database, which basically automates everything in our Airflow. When we add a drop date to the candidate’s info collection. It automatically stops the airflow pipeline from collecting those candidates, which means pipelines are programmed smartly.
After pushing the data to our MongoDB, We fuel our Illuminating web-app from there.
Illuminating 2020 is a one-stop shop for all things ad analytics for 2020 presidential campaign and currently focuses on Biden/Trump’s ad strategies. It is an absolute powerhouse! you can get more information about the type of classification we do on ads here.
I was a master’s student at School of Information Studies at Syracuse University and worked on this great academic research project on the data engineering and machine learning infrastructure.
This post was originally published by at Towards Data Science