Data Scientists, start using Profilers


This post was originally published by Austin Poor at Towards Data Science

Find the parts of your algorithm that are ACTUALLY slowing you down

Data scientists often need to write a lot of complex, slow, CPU- and I/O-heavy code — whether you’re working with large matrices, millions of rows of data, reading in data files, or web-scraping.

Wouldn’t you hate to waste your time refactoring one section of your code, trying to wring out every last ounce of performance, when a few simple changes to another section could speed up your code tenfold?

If you’re looking for a way to speed up your code, a profiler can show you exactly which parts are taking the most time, allowing you to see which sections would benefit most from optimization.

A profiler measures the time or space complexity of a program. There’s certainly value in theorizing about the big O complexity of an algorithm but it can be equally valuable to examine the real complexity of an algorithm.

Where is the biggest slowdown in your code? Is your code I/O bound or CPU bound? Which specific lines are causing the slowdowns?

Once you’ve answered those questions you’ll A) have a better understanding of your code and B) know where to target your optimization efforts in order to get the biggest boon with the least effort.

Let’s dive into some quick examples using Python.

You might already be familiar with a few methods of timing your code. You could check the time before and after a line executes like this:

In [1]: start_time = time.time()
...: a_function() # Function you want to measure
...: end_time = time.time()
...: time_to_complete = end_time - start_time
...: time_to_complete
Out[1]: 1.0110783576965332

Or, if you’re in a Jupyter Notebook, you could use the magic %time command to time the execution of a statement, like this:

In [2]: %time a_function()
CPU times: user 14.2 ms, sys: 41 µs, total: 14.2 ms
Wall time: 1.01 s

Or, you could use the other magic command %timeit which gets a more accurate measurement by running the command multiple times, like this:

In [3]: %timeit a_function()
1.01 s ± 1.45 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

Alternatively, if you want to time your whole script, you can use the bash command time, like so…

$ time python

real 0m1.041s
user 0m0.040s
sys 0m0.000s

These techniques are great if you want to get a quick sense of how long a script or a section of code takes to run but they are less useful when you want a more comprehensive picture. It would be a nightmare if you had to wrap each line in time.time() checks. In the next section, we’ll look at how to use Python’s built-in profiler.

When you’re trying to get a better understanding of how your code is running, the first place to start is cProfile, Python’s built-in profiler. cProfile will keep track of how often and for how long parts of your program were executed.

Just keep in mind that cProfile shouldn’t be used to benchmark your code. It’s written in C which makes it fast but it still introduces some overhead that could throw off your times.

There are multiple ways to use cProfile but one simple way is from the command line.

Before we demo cProfile, let’s start by looking at a basic sample program that will download some text files, count the words in each one, and then save the top 10 words from each to a file. Now that being said, it isn’t too important what the code does, just that we’ll be using it to show how the profiler works.

Demo code to test our profiler

Now, with the following command, we’ll profile our script.

$ python -m cProfile -o profile.stat

The -o flag specifies an output file for cProfile to save the profiling statistics.

Next, we can fire up python to examine the results using the pstats module (also part of the standard library).

In [1]: import pstats
...: p = pstats.Stats("profile.stat")
...: p.sort_stats(
...:   "cumulative"   # sort by cumulative time spent
...: ).print_stats(
...:   ""    # only show fn calls in
...: )Fri Aug 07 08:12:06 2020    profile.stat46338 function calls (45576 primitive calls) in 6.548 secondsOrdered by: cumulative time
List reduced from 793 to 6 due to restriction <''>ncalls tottime percall cumtime percall filename:lineno(function)
1   0.008   0.008   5.521   5.521<module>)
1   0.012   0.012   5.468   5.468
5   0.000   0.000   4.848   0.970
5   0.000   0.000   0.460   0.092
5   0.000   0.000   0.112   0.022
1   0.000   0.000   0.000   0.000

Wow! Look at all that useful info!

For each function called, we’re seeing the following information:

  • ncalls: number of times the function was called
  • tottime: total time spent in the given function (excluding calls to sub-functions)
  • percall: tottime divided by ncalls
  • cumtime: total time spent in this function and all sub-functions
  • percall: (again) cumtime divided by ncalls
  • filename:lineo(function): the file name, line number, and function name

When reading this output, note the fact that we’re hiding a lot of data —in fact, we’re only seeing 6 out of 793 rows. Those hidden rows are all the sub-functions being called from within functions like urllib.request.urlopen or re.split. Also, note that the <module> row corresponds to the code in that isn’t inside a function.

Now let’s look back at the results, sorted by cumulative duration.

ncalls tottime percall cumtime percall filename:lineno(function)
1   0.008   0.008   5.521   5.521<module>)
1   0.012   0.012   5.468   5.468
5   0.000   0.000   4.848   0.970
5   0.000   0.000   0.460   0.092
5   0.000   0.000   0.112   0.022
1   0.000   0.000   0.000   0.000

Keep in mind the hierarchy of function calls. The top-level, <module>, calls read_books and save_results. read_books calls get_book, split_words, and count_words. By comparing cumulative times, we see that most of <module>’s time is spent in read_books and most of read_books’s time is spent in get_book, where we make our HTTP request, making this script (unsurprisingly) I/O bound.

Next, let’s take a look at how we can be even more granular by profiling our code line-by-line.

Once we’ve used cProfile to get a sense of what function calls are taking the most time, we can examine those functions line-by-line to get an even clearer picture of where our time is being spent.

For this, we’ll need to install the line-profiler library with the following command:

$ pip install line-profiler

Once installed, we just need to add the @profile decorator to the function we want to profile. Here’s the updated snippet from our script:

Note the fact that we don’t need to import the profile decorator function — it will be injected by line-profiler.

Now, to profile our function, we can run the following:

$ kernprof -l -v

kernprof is installed along with line-profiler. The -l flag tells line-profiler to go line-by-line and the -v flag tells it to print the result to the terminal rather than save it to a file.

The result for our script would look something like this:

The key column to focus on here is % Time. As you can see, 89.5% of our time parsing each book is spent in the get_book function — making the HTTP request — further validation that our program is I/O bound rather than CPU bound.

Now, with this new info in mind, if we wanted to speed up our code we wouldn’t want to waste our time trying to make our word counter more efficient. It only takes a fraction of the time compared to the HTTP request. Instead, we’d focus on speeding up our requests — possibly by making them asynchronously.

Here, the results are hardly surprising, but on a larger and more complicated program, line-profiler is an invaluable tool in our programming tool belt, allowing us to peer under the hood of our program and find the computational bottlenecks.

In addition to profiling the time-complexity of our program, we can also profile its memory-complexity.

In order to do line-by-line memory profiling, we’ll need to install the memory-profiler library which also uses the same @profile decorator to determine which function to profile.

$ pip install memory-profiler$ python -m memory_profiler

The result of running memory-profiler on our same script should look something like the following:

There are currently some issues with the accuracy of the “Increment” so just focus on the “Mem usage” column for now.

Our script had peak memory usage on line 28 when we split the books up into words.

Hopefully, now you’ll have a few new tools in your programming tool belt to help you write more efficient code and quickly determine how to best spend your optimization-time.

You can read more about cProfile here, line-profiler here, and memory-profiler here. I also highly recommend the book High Performance Python, by Micha Gorelick and Ian Ozsvald [1].

Thanks for reading! I’d love to hear your thoughts on profilers or data science or anything else. Comment below or reach out on LinkedIn or Twitter!

Spread the word

This post was originally published by Austin Poor at Towards Data Science

Related posts