A quick way to reformat columns in a Pandas DataFrame

towards-data-science

This post was originally published by Byron Dolon at Towards Data Science

PYTHON

Using df.melt to compress multiple columns in to one.

It may be tempting to dive straight into analysis, but an important step before any of that is pre-processing.

Pandas offers a lot of built-in functionality that allows you to reformat a DataFrame just the way you need it. Most data sets require some form of reshaping before you can perform calculations or create visualizations.

In this piece, we’ll be looking at how you can use one the df.melt function to combine the values of many columns into one.

This means you’re changing a DataFrame from a “wide” format to a “long” format. This is helpful when you have many columns that technically have the same type of values.

For example, say you owned a coffee shop and started with two types of drinks. To keep track of the ingredients for each drink, you made a table that looked like this:

This might work for now, but what if you wanted to include a drink with three ingredients? You’d have to create another column like “Ingredient 3”, but what if you wanted a drink with 4, 5 or even 10 ingredients? Making new columns every time you need to enter more data isn’t the most elegant of solutions.

Instead, it would be great if you had a table like this:

Now, your table is a lot more flexible. Your drinks can have as many or as little ingredients as you want, and you don’t need to alter the table structure for different kinds of data entries.

Let’s take a look at how to do this with df.melt on a more complicated example. We’ll be using a modified version of this video game sales data, so download the csv file if you want to follow along. I’ve loaded and pre-processed it for this exercise, so you can use the code below to get started.

Side note: if you’re unfamiliar with the df.loc[] functionality, you can check out this piece I wrote on it and other vectorized Pandas solutions below.

You Don’t Always Have to Loop Through Rows in Pandas!

A look at alternatives to “for loops” with vectorized solutions.

You can see in the table above we have several aggregated sales values for different regions. While this is easy to read, it’s not the best table structure possible.

Our goal is to put all the sales values into one column, and all the sales regions into another column.Before we do that, let’s briefly introduce the df.melt function. The function has four key parameters:

  • id_vars -> the columns to use to identify each row (similar to an index column) — pass a list;
  • value_vars -> the columns in your table that you want to compress (or unpivot). You can leave this blank if you want all columns besides id_vars to be compressed — pass a list;
  • var_name -> the name for your new “category” column, of which the values are the column names you passed to “value_vars” — pass a scalar value;
  • value_name -> the name for your new “values” column — pass a scalar value.

For our table, we need to identify each row with the “Platform” column, so we’ll pass that into id_vars. Our value columns will be all the sales columns except for “Global Sales”, because that’s not technically a region category. We’ll pass a list of the column names into value_vars to implement this.

To improve readability, we’ll also name our two new columns “Sales Region” and “Sales (millions)” by including them in the var_name and value_name parameters respectively.

The final code looks like this:

And there it is! We’ve moved all the old columns names into a new category column and combined their values into a single values column.

For better presentation, we can also sort the table with one additional line of code:

Spread the word

This post was originally published by Byron Dolon at Towards Data Science

Related posts