Gather your Data: The “Not-So-Spooky” APIs!

towards-data-science

This post was originally published by Divya Naidu at Towards Data Science

When python plays with the internet files.

A data analytics cycle starts with gathering and extraction. I hope my previous blog gave an idea about how data from common file formats are gathered using python. In this blog, I’ll focus on extracting the data from files that are not so common but has the most real-world applications.

Whether you are a data professional or not, you might have come across the term API by now. Most people have a rather ambiguous or incorrect idea about this fairly common term.

Extracting data using API in python

person using smartphone and laptop computer

Photo by Yogas Design on Unsplash

API is the acronym for Application Programming Interface, which is a software intermediary (A Middleman) that allows two applications to talk to each other.

Each time you use an app like Tinder, send a WhatsApp message, check the weather on your phone, you’re using an API. They allow us to share important data and expose practical business functionality between devices, applications, and individuals. And although we may not notice them, APIs are everywhere, powering our lives from behind the scenes.

We can illustrate API as a bank’s ATM(Automated Teller Machine). Banks make their ATMs accessible for us to check our balance, make deposits, or withdraw cash. So here ATM is the middleman who is helping the bank as well as us, the customers.

Similarly, Web applications use APIs to connect user-facing front ends with all-important back end functionality and data. Streaming services like Spotify and Netflix use APIs to distribute content. Automotive companies like Tesla send software updates via APIs. For further instances, you can check out the article 5 Examples of APIs We Use in Our Everyday Lives.

In Data Analysis, API’s are most commonly used to retrieve data, and that will be the focus of this blog.

When we want to retrieve data from an API, we need to make a request. Requests are used all over the web. For instance, when you visited this blog post, your web browser made a requested to the Towards Data Science web server, which responded with the content of this web page.

API requests work in the same way — you request to an API server for data, and it responds to your request. There are primarily two ways to use APIs :

  1. Through the command terminal using URL endpoints, or
  2. Through programming language-specific wrappers.

For example, Tweepy is a famous python wrapper for Twitter API whereas twurl is a command line interface (CLI) tool but both can achieve the same outcomes.

Here we focus on the latter approach and will use a Python library (a wrapper) called wptools based around the original MediaWiki API. The MediaWiki action API is Wikipedia’s API that allows access to some wiki-features like authentication, page operations, and search.

Wikipedia page screenshot

Photo by Luke Chesser on Unsplash

Wptools make read-only access to the MediaWiki APIs. You can get info about wiki sites, categories, and pages from any Wikimedia project in any language via the good old Mediawiki API. You can extract unstructured data from page Infoboxes or get structured, linked open data about a page via the Wikidata API, and get page contents from the high-performance RESTBase API.

In the below code, I have used python’s wptools library to access the Mahatma Gandhi Wikipedia page and extract an image file from that page. For a Wikipedia URL ‘https://en.wikipedia.org/wiki/Mahatma_Gandhi’we only need to pass the last bit of the URL.

The get function fetches everything including extracts, images, infobox data, wiki data, etc. present on that page. By using the .data() function we can extract all the required information. The response that we get for our request to the API is most likely to be in JSON format.

Reading data from JSON in python

A MacBook with lines of code on its screen on a busy desk

Photo by Christopher Gower on Unsplash

JSON is an acronym for JavaScript Object Notation. It is a lightweight data-interchange format. It is as easy for humans to read & write as for the machines to parse & generate. JSON has quickly become the de-facto standard for information exchange.

When exchanging data between a browser and a server, the data can only be text. JSON is the text, and we can convert any JavaScript object into JSON, and send JSON to the server.

For example, you can access GitHub’s API directly with your browser without even needing an access token. Here’s the JSON response you get when you visit a GitHub user’s API route in your browserhttps://api.github.com/users/divyanitin :

{
"login": "divyanitin",
"url": "https://api.github.com/users/divyanitin",
"html_url": "https://github.com/divyanitin",
"gists_url": "https://api.github.com/users/divyanitin/gists{/gist_id}",
"type": "User",
"name": "DivyaNitin",
"location": "United States",
}

The browser seems to have done just fine displaying a JSON response. A JSON response like this is ready for use in your code. It’s easy to extract data from this text. Then you can do whatever you want with the data.

Python supports JSON natively. It comes with a json built-in package for encoding and decoding JSON data. JSON files store data within {} similar to how a dictionary stores it in Python. Similarly, JSON arrays are translated as python lists.

In my above code, wiki_page.data[‘image’][0] access the first image in the image attribute i.e a JSON array. With Python json module you can read JSON files just like simple text files.

The read function json.load() returns a JSON dictionary which can be easily converted into a Pandas dataframe using the pandas.DataFrame() function. You can even load the JSON file directly into a dataframe using the pandas.read_json() function.

Reading files from the Internet (HTTPS)

smartphone showing Google site

Photo by Edho Pratama on Unsplash

HTTPS stands for HyperText Transfer Protocol Secure. It is a language that web browsers & web servers speak to each other. A web browser may be the client, and an application on a computer that hosts a web site may be the server.

We are writing a code that works with remote APIs. Your maps app fetches the locations of nearby Indian restaurants or the OneDrive app starts up cloud storage. All this happens just by making an HTTPS request.

Requests’ is a versatile HTTPS library in python with various applications. It works as a request-response protocol between a client and a server. It provides methods for accessing Web resources via HTTPS. One of its applications is to download or open a file from the web using the file URL.

To make a ‘GET’ request, we’ll use the requests.get() function, which requires one argument — the URL we want to request to.

In the below script, the open method is used to write binary data to the local file. In this, we are creating a folder and saving the extracted web data on the system using the os library of python.

The json and requests import statements load Python code that allows us to work with the JSON data format and the HTTPS protocol. We’re using these libraries because we’re not interested in the details of how to send HTTPS requests or how to parse and create valid JSON, we just want to use them to accomplish these tasks.

A popular web architecture style called REST (Representational State Transfer) allows users to interact with web services via GET and POST calls (two most commonly used).

GET is generally used to get information about some object or record that already exists. In contrast, POST is typically used when you want to create something.

REST is essentially a set of useful conventions for structuring a web API. By “web API,” I mean an API that you interact with over HTTP, making requests to specific URLs, and often getting relevant data back in the response.

For example, Twitter’s REST API allows developers to access core Twitter data and the Search API provides methods for developers to interact with Twitter Search and trends data.

This blog focuses on the so-called internet files. I introduced the extraction of data using simple APIs. Most of the APIs require authentication, just like my earlier analogy ATM requires us to enter the pin for authenticating our access to the bank.

You can check out my other article Twitter Analytics: “WeRateDogs” which was focused on data wrangling and analysis of twitter API. I have used all the above-mentioned scripts in this project, you can find the code for the same on my GitHub.

As known to all, one of the most common internet files is HTML. Extracting data from the internet has the common term Web Scraping. In this, we access the website data directly using the HTML. I would be covering the same, from the “basic” to the “must-knows” in my next blog.

Spread the word

This post was originally published by Divya Naidu at Towards Data Science

Related posts