My odyssey, finding the most popular Python function

towards-data-science

This post was originally published by at Towards Data Science

We all love Python, but how often do we use which mighty functionality? An article about my quest to figure it out

The most mentioned Python functions mentioned inside Pythonrepositories calculated via GitHub commits. Image by Author

The other day while I was running some zip() with some lists through a map(). I couldn’t stop noticing how much my Python style over the years has changed.

We all asked ourselves this question before, what is it other people do with this beautiful language? What functions do they use?

As a data scientist, I aimed at something slightly more measurable. What is the most mentioned Python functionality in GitHub commits?

In the following articles, I will

  1. Discuss the limitations of such a question and in how many ways I failed to find the answer
  2. Show how I collected the data from GitHub
  3. And most importantly, teach you how to lure Medium readers to your article with cool racing bars

Initially, I started this project to figure out how often Python functions are called. Quickly we noticed that on Github, you could look this up in no time. Use the search function!

Amount of print() functions on GitHub, Image by Author

Problem Solved!

Well not quite…

The issue is that these results are volatile. By calling this search several times, we can get any number of results! This means when calling it again.

Amount of print() functions on GitHub when calling it again, Image by Author.

We get a very different result…

Github API

GitHub has a fantastic search API!

Problem Solved!

Well not quite…

The issue here is that they only offer the first 34k results or something like this for the code, after trying for quite some time to get something useful out of it. I had to realize that they won’t allow me to do it in this way. And our questions sadly can’t be answered using the easy way.

Github Search function via Commits

After quite some time, I detected that one could search by commits in the Python Language by time!

Problem Solved!

Well not quite…

While this way of searching seems to be quite reliable. It produces a lot of false positives. For example, it will show commits to repositories that only commit a little bit of Python. The commit may then include the words or functions in some sense.

While this is not ideal, I decided to take this route since it allowed for a comparison over time. Also, I tried all other ways I could think of, if you found a better way please let me know in the comments. Generally, this data has to be taken with a lot of skepticism, but I hope it teaches us some valuable lessons. Most certainly, it creates a killer plot 😉

We have our approximation of how to find the answer. Now, all we have to do is call the GitHub API!

Problem Solved!

Well not quite…

The issue seemed to be that this API is supposed to be more for actual searches inside your repositories. GitHub seems to have a hard limit on the number of links they return to you. They seem to look for X seconds and then stop, and return whatever they got so far. This makes a lot of sense since dealing with such vast amounts of data is very expensive. Sadly it also makes our journey to an answer so much harder.

Since we refuse to give up, we decide to call their website and parse the answer from the returned HTML! While this is neither elegant nor simple, we ain’t no quitters.

Let’s build our link. An example link might look like

https://github.com/search?q={function}%28+language%3A{Language}+type%3Acommits+committer-date%3A%3C{before_year}-01-01&type=commits

Example link, Image by Author

As we can see we look for basically 3 things.

function: What function do we want to know about? e.g. len()
language: What programming language? e.g. Python
before_year: Before what year? e.g. 2000

When feeding these parameters to GitHub it will tell us how many functions have been committed before that date!

After calling this link, it returns us an HTML file that we can filter to get our answer. The code for doing such things can be

import urllib.requestlanguage='Python'
befor_year=2000# create the url using a year and a language
url_base = f"https://github.com/search?l=Python&q={search_term}%28+language%3A{language}+type%3Acommits+committer-date%3A<{befor_year}-01-01&type=commits"fp = urllib.request.urlopen(url_base)
byte_html = fp.read()
org_html = byte_html.decode("utf8")
fp.close()

To filter the resulting HTML, we can then, for example, use regex. We could also use BeautifulSoup or some other lovely HTML-parsing library, but it simplifies the readability for this article quite a bit to use regex. In this specific case, we only care about one number which makes is faster to simply look for that single number.

import re
find_count = re.compile(r'([0-9,]+) (available|commit)')

The above regex ‘find_count’ finds the string “44,363 commits”. Using the matching group (everything that is in the “()”), we can then grep the number combination from that string “44,363”.

The full code to do such a thing quickly and fast is,

As we can see, we iterate over all terms and years to collect one data point for each of the functions. Then we parse the result from the HTML and store it. The entire rest of the processes is there to ensure that we comply with GitHub rate-limiting and do not get banned will accumulating our data!

GitHub seems to not enjoy us calling their relatively expensive functions all the time 😉 I ran this for 20 years and 20 functions, and it took over 80 minutes, which I found quite surprising.

Finally, we have collected the data we desired and can now show off with some cool plots!

We have now a data frame which looks roughly like this,

date,print(),len(),join()
2000-01-01,677545,44165,23534
2001-01-01,859815,66593,40032
2002-01-01,1091170,93604,59618
2003-01-01,1391283,117548,80327
2004-01-01,1755368,152962,125238
2005-01-01,2049569,185497,173200

For each year the amount of function commits per function. This data collection is especially nice to visualize.

To visualize data over time, I think racing bars are the coolest. While they may not be the most informative ones, they look incredible!

What we need is a CSV that has, for each date, several categories. Once we have such a CSV, we can easily use the fantastic bar_chart_race library.

Note: The library seems to be not entirely uptodated when install via pip, therefore install via github

python -m pip install git+https://github.com/dexplo/bar_chart_race

Now, all that s left to do is pass our CSV to the function, creating a beautiful gif.

def plot_search_term_data(file):
"""
This function plots our df
:param file: file name of the csv, expects a "date" column
"""
df = pd.read_csv(file).set_index('date')
bcr.bar_chart_race(
df=df,
filename=file.replace('.csv', '.gif'),
orientation='h',
sort='desc',
n_bars=len(df.columns),
fixed_order=False,
fixed_max=True,
steps_per_period=10,
period_length=700,
interpolate_period=False,
period_label=
{'x': .98, 'y': .3, 'ha': 'right', 'va': 'center'},
period_summary_func=
lambda v, r: {'x': .98, 'y': .17,
's': f'Calls{v.sum():,.0f},
'ha': 'right', 'size':11},
perpendicular_bar_func='median',
title='Most Mentioned Python Functionality Over Time',
bar_size=.95,
shared_fontdict=None,
scale='linear',
fig=None,
writer=None,
bar_kwargs={'alpha': .7},
filter_column_colors=False)

The most mentioned Python functions mentioned inside Pythonrepositories calculated via GitHub commits. Image by Author

We have seen how we can gather data directly from HTML using regex instead of the usual bs4. While this approach should not be used for more significant projects, using it for simple quests such as this one is a must. We also have seen that the most prominent data source may not always work.

Finally, we discovered a new lovely library and way how to create beautiful racing bars that will capture your viewer’s interest!

Spread the word

This post was originally published by at Towards Data Science

Related posts