5 tips for pandas users

towards-data-science

This post was originally published by Zolzaya Luvsandorj at Towards Data Science

We will use seaborn’s dataset on tips to exemplify my tips. (Did you get it? … my pun… 😆):

# Import packages
import pandas as pd
import seaborn as sns# Import data 
df = sns.load_dataset('tips')
print(f"{df.shape[0]} rows and {df.shape[1]} columns")
df.head()

Details about this dataset including data dictionary can be found here (this source is actually for R, but it appears to be referring to the same underlying dataset). I have quoted their data description below for quick access:

“One waiter recorded information about each tip he received over a period of a few months working in one restaurant.”

📍 Tip #1: Filter with query()

Let’s start with my favourite tip! Say we wanted to filter the data to those who tipped more than $6 with $30+ total bill. One common way to accomplish this is to use:

df.loc[(df['tip']>6) & (df['total_bill']>=30)]

This does the job, but don’t you think it’s little too verbose: each condition requires reference to the dataframe and a parenthesis wrapping if there are multiple conditions. Now, let me show you how we could achieve the same outcome with more elegant code with query():

df.query("tip>6 & total_bill>=30")

You see how clean, simple and readable this looks? We are not repeatedly typing df or overloading with brackets and parentheses anymore. With less keystrokes, it’s also quicker to write code and code will be less prone to mistakes. A few more additional tips on query():

# reference global variable name with @
median_tip = df['tip'].median()
display(df.query("tip>@median_tip").head())
# wrap column name containing . with backtick: `
df.rename(columns={'total_bill':'total.bill'}, inplace=True)
display(df.query("`total.bill`<20").head())
df.rename(columns={'total.bill':'total_bill'}, inplace=True)

# wrap string condition with single quotes (this is what I like)
display(df.query("day=='Sat'").head())
# could also do it the other way around (i.e. 'day=="Sat"')

📍 Tip #2: Show multiple dataframes with display()

I have already given this away in the previous code, so you could probably guess what this one is about. Assume we wanted to inspect both head and tail of the df in one cell of Jupyter Notebook. If we run the following code, it will only show the tail:

df.head()
df.tail()

We can get around this with display():

display(df.head())
display(df.tail())

In the last line, display() is redundant but it is there for consistency. It works the same way if we take out display() from the last line:

display(df.head())
df.tail()

📍 Tip #3a: Use a list of booleans when sorting by multiple columns

I have two tips on sorting. The first one is for sorting multiple columns.

Have you ever had to sort your data with multiple columns in different directions? Here is an example of what I mean: Sort the data by total bill in ascending order and break ties with amount of tip in descending order.

Before I knew tip #3a, I would create an interim column to flip the scale of either total bill or tip to make all the relevant columns to have the same direction and sort afterwards (I have flipped tip in this example):

df['rev_tip'] = -df['tip']
df.sort_values(by=['total_bill', 'rev_tip'], ascending=True).head()

This is a workaround but not very elegant way to tackle the task. Let’s delete rev_tip with del df['rev_tip']. Instead, we could pass a list of booleans to indicate the order for each variable for sorting:

df.sort_values(by=[‘total_bill’, ‘tip’], ascending=[True, False]).head()

Not only do we not need to create an extra column, the last code also looks cleaner and more readable.

It’s also possible to use the numerical representation of booleans. That is, if we change to ascending =[1,0], it will also give us the same output.

📍 Tip #3b: Use nsmallest() or nlargest()

This second tip will come in handy if you ever had to quickly check out data extract for records that have the smallest or largest values in a particular column. Using nsmallest(), we could check out 5 records with the smallest total bill like this:

df.nsmallest(5, 'total_bill')

This is a short form for:

df.sort_values(by='total_bill').head()

Similarly, the outputs of these two lines are identical:

display(df.nlargest(5, 'total_bill'))
display(df.sort_values(by='total_bill', ascending=False).head())

📍 Tip #4. Customise describe()

Any pandas user is probably familiar with df.describe(). This shows summary stats for numerical columns. But we can get more than that by specifying its arguments.

Firstly, let’s check out the column types:

df.info()

In our dataframe, we have numerical and categorical columns. Let’s see summary stats for all columns by adding include='all':

df.describe(include='all')

This is cool but a little messy. Let’s show the summary stats by column types separately with the following script:

display(df.describe(include=['category'])) # categorical types
display(df.describe(include=['number'])) # numerical types

Do you like this better? If we had both strings and categorical columns and wished to display the summary stats for both in one table, we can use either: include=['category', 'object'] or exclude=['number']. If you are curious to learn more, check out the documentation.

📍 Tip #5: Update default display settings

This last tip is probably more well-known than the rest. Let’s see some examples of useful display settings to change.

Firstly, we can check out the current default limit for maximum number of columns and rows to be displayed with the code below:

print(f"{pd.options.display.max_columns} columns")
print(f"{pd.options.display.max_rows} rows")

current pandas version: 1.0.3

This means if we try to display a dataframe with more than 20 columns, we only get to see the first 10 and final 10 (total of 20 columns shown) while the rest will be truncated as three dots. The same logic applies to rows. Often, we may want to see more than these maximums. If we want to change this behaviour, we can do so like this:

pd.options.display.max_columns = None
pd.options.display.max_rows = None

Here, we are asking pandas to display every row and column without any limit. This may or may not be a good idea depending on how big your dataframe is. We can also set these options to a number of our choice:

pd.options.display.max_columns = 50
pd.options.display.max_rows = 100

Secondly, depending on the scale of the numerical variables you are working on, you may sometimes encounter scientific notations for very large or very small numbers when working with pandas. If you find it easier to read numbers as 1200 and 0.012 compared to 1.2e3 and 1.2e-2 respectively, you are likely to find this line of code handy:

pd.options.display.float_format = ‘{:.4f}’.format # 4 decimal places

This ensures that you will see real numbers instead of scientific notations.

If you are curious to learn more about other options to customise, check out the documentation.

Voila❕ These were my current top tips for a pandas user!

Spread the word

This post was originally published by Zolzaya Luvsandorj at Towards Data Science

Related posts