4 techniques to enhance your research in Machine Learning projects

towards-data-science

This post was originally published by Fran Pérez at Towards Data Science

This is the folder layout I tend to use at the beginning of any ML project. This layout is open to extension (such as adding a tests folder, deploy folder, etc) as soon as the project needs to grow up.

project          # project root
├── data         # data files
├── models       # machine learning models
├── notebooks    # notebook files
└── src          # helper functions

Unlike regular software development projects, ML projects have 3 foundational stones: the source code (notebooks and src), the data consumed/produced by the code, and the model built/consumed by the code and the data.

After ingesting the data, my recommendation is to process the data in stages, where each stage has its own folder. For example:

data
├── raw        # original files
├── interim    # preprocessed files
└── processed  # result files

From this layout, you can follow the flow of the data, as in a pipeline: from raw to interim, and then to processed.

Firstly, the 📁 raw folder stores the data in its original format. In case you can work with offline data, it is very handy to keep always a frozen (read-only) copy of your data. Second, the 📁 interim folder is meant to store data resulting from the data transformations. Probably, these transformations might end up enlarging your dataset. This is the reason I tend to use binary formats, which gain better performance around serialization/deserialization tasks. One of the most used binary formats is parquet (check out how to read/save parquet data using pandas).

time to load a .csv file vs time to load a parquet file

Lastly, the 📁 processed folder is used to store the results of the machine learning model.

Even though the raw folder can store files in many formats (.csv, .json, .xls, etc), my recommendation is to use some a common format in the interim folder (for example: binary formats such as .parquet, .feather, or raw formats such as .csv, .png) and use a customer-friendly format in the processed folder (for example: .csv or excel file allows stakeholders to review the results of your model). Sometimes makes sense to include summary plots about the results of your model (for example: when building a recommender system, does the distribution of your recommendations follows a similar pattern than your sales distribution?)

While working in the Research stage, I use Jupyter Notebooks as my execution platform/IDE. This is the reason most of the code that supports the Machine Learning Lifecycle is hosted in Jupyter Notebooks

Machine Learning (simplified) lifecycle

So, the notebooks folder resembles (up to some degree) the ML lifecycle:

notebooks
├── 1__ingestion                 # |-> data/raw
├── 1_1__eda
├── 2__preprocessing             # |<- data/raw  
│                                  |-> data/interim
├── 2_1__eda
├── 3_1__model_fit_variant_1     # |-> model/variant_1.pkl
├── 3_2__model_fit_variant_2     # |-> model/variant_2.pkl
├── 3_3__models_validation       
└── 4__model_predict             # |<- data/interim, model/*.pkl  
|-> data/processed

I won’t delve into detail into what is every notebook responsible, as I think most of you should be related to the Machine Learning lifecycle.

And in any case, you should apply the layout and naming conventions that fit your way of working (or use more complex layout templates if you wish). Maybe you will need a couple or more of iterations to find your own blueprint but take it as part of the learning process. For example, I like to split the EDA into two parts, the first one uses only raw data and the second one focuses on the “new data” produced after the pre-processing stage. But if you like doing a single EDA, this is fine also. These project layouts shown here are meant to make you do things with a purpose and do not act following your free will. This will be important once you hand over the project to the next stage (Development), as your teammates will be able to recognize the shape and the components of your project.

The result of the modeling notebooks (the ML model after training) can be stored in this folder. Most of ML frameworks (like scikit-learn, spacy, PyTorch) have built-in support for model serialization (.pkl, .h5, etc); otherwise, check out the magnificent cloudpickle package.

One of the differences between the Research and Development stages is that during the former the src will be pretty slim (containing helper and other common functions used by notebooks) whilst during the latter this folder will be filled with other folders and .py files (code ready for production deployment).

Windows Subsystem for Linux (v2) is the new kid in the block. If you are already using Linux or MacOS, you can skip this section. Otherwise, if you fall into the Windows user category, you should keep reading. Most of the python packages are compatible with Windows systems; but you never know when you will face adversity of non-compatible OS packages (for example: apache airflow doesn’t run in Windows environment). During these times, you will learn to love WSL, because it behaves as a fully-fledged Linux system without ever leaving your Windows environment. The performance is quite decent, and most IDE’s are compatible with WSL.

Windows Terminal running WSL

For example, Visual Studio Code has native support for WSL. This means loading any python project/folder, using your regular plugins and execute or debug code. Because WSL mounts the host drive in the /mnt folder, you will still have access to the windows host folders. If you end up using the same project both in Windows and WSL, consider you might hit some inter-operability issues. For example, git might inaccurately detect files as changed due to file permissions or CRLF line endings. To fix these issues, you can execute the following commands in WSL:

git config --global core.filemode false
git config --global core.autocrlf true

The future ahead of WSL is promising: native access to GPU (= train deep learning models using the GPU) and Linux GUI (= support not only for terminal applications but for GUI applications as well). Finally, don’t miss your chance to use the amazing Windows Terminal altogether with WSL.

Without a doubt, Jupyter Notebooks is my preferred tool for doing exploration and research. But at the same time, Jupyter Notebooks is not the best tool suited for taking your model to production. Between these opposite terms (Research/Development), there is a common ground where you can enhance how you use Jupyter Notebooks.

I recommend installing Jupyter Notebook using Anaconda and Conda environments. But you can use any other package management tool (such as virtualenv, pipenv, etc). But you must use someone, and therefore, used it in your projects too.

How to install Jupyter Notebook (or rather, how I installed it in my machine):

Install Jupyter Notebook using Anaconda (therefore, you first need to install Anaconda); then install Jupyter Notebook in the base/default (conda) environment executing the following commands:

conda activate base
conda install -c conda-forge notebook

This sounds against all good practices (Jupyter Notebook should be a project dependency), but I consider that as Visual Studio Code (or name-your-preferred-IDE-here) itself, Jupyter Notebook should be a dependency at the machine level, not at the project level. This makes posterior customizations much easier to manage: for example, in case of using Jupyter Extensions (more on this in the next section), you will configure the Extensions only once, and then they will be available for all kernels/projects.

After installing Jupyter Notebook, is the turn for Jupyter Notebook extensions; run the following commands in your console:

conda install -c conda-forge jupyter_contrib_nbextensions
jupyter contrib nbextension install --user
conda install -c conda-forge jupyter_nbextensions_configurator

Then, each time you create a new conda environment (you should create a new environment every time you start a new project), you need to make it available as Jupyter kernel, executing the following command:

python -m ipykernel install --user --name retail --display-name "Python (retail)"

Finally, to launch Jupyter Notebook, you should be in the base environment, and then execute:

project          # project root (launch jupyter notebook from here,
│                # using the base/default conda environment)
├── data
├── models
├── notebooks
└── srcconda activate base
jupyter notebook

Once Jupyter Notebook is launched into your web browser, you can select the required environment (kernel) for your notebook:

Jupyter Notebook — Change Kernel

The first time you set the Kernel in your notebook, it will be recorded in the notebook metadata and you won’t need to set it up every time you launch the notebook.

Use Jupyter Notebook Extensions. Just for the sake of enabling the Collapsing headers extension. When you are working with large notebooks, this is extremely helpful for organizing the information into your notebook, and not losing your mind paging back and forth inside the notebook. I consider it a must. PERIOD.

One of the most important things you should provide when delivering a notebook is executability (once the dependencies — kernel and source files — are set, the notebook must be runnable from top to bottom) and reproducibility (when the notebook is executed should return always the same results).

But as we are in the Research stage, we can allow some degrees of uncertainty. A great tool that can support this is the Freeze text extension, allowing us to literally preserve the results of your past experiments. Using the toolbar, you can turn a cell read-only (it can be executed, but its input cannot be changed) or frozen (It cannot be either altered or executed). So in case you can’t enforce reproducibility, at least you can keep some base results to compare your current execution with.

Freeze text Jupyter Notebook Extension

For example, in the above figure, you can compare the accuracy of the last epoch and the execution time. Also, consider that for the sake of logging/tracking your experiments, there are much better tools like mlFlow and wandb (although I consider these tools more relevant in the Development stage).

Finally, I encourage you to check out other available extensions (such as scratchpad, autopep, code folding, …). If you followed my installation setup, there should be a tab named Nbextensions available to you to configure Notebook extensions:

Jupyter Notebook Extensions manager

Otherwise, you can enable extensions via the command line.

Jupyter Notebooks play nicely neither with testing nor with source control. In the case of testing, there is some help using papermill and nbdev. Also, I highly recommend using the old school tricks as the assert command to verify your code assumptions. For example, after every pd.merge, it is always a good practice to check the cardinality of the resulting dataframe (initial number rows == final number rows):

nrows_initial = df.shape[0]
df = pd.merge(df, df_sub, how="left")
assert nrows_initial == df.shape[0]

In the case of source control, you can check nbdime, for diff-ing and merging notebooks. Usually git offers a poor diff-ing experience for detecting changes in notebook files, but in contrast nbdime is a powerful tool, that you can use from the command line (integration with git, bash, and PowerShell is provided) or from a web interface, which provides a much richer experience (integration with Jupyter Notebook is provided as well). I really appreciated the fact that nbdime classifies updates by changes in input cells, changes in output cells and changes to cell’s metadata.

nbdime — diffing and merging of Jupyter Notebooks

Another recommendation when using Jupyter Notebooks is to leverage the use of built-in %magic commands. My favorite magic commands are:

%load_ext autoreload
%autoreload 2
%matplotlib inline

The %autoreload magic command turns very handy when re-loading modules and packages in memory without restarting the kernel. For example, if you’re working with code that is stored in a classic .py file, when the source file is updated, as soon as you execute a cell in the notebook, the new source file will be re-loaded in the current kernel and the changes will be available to use. Another plus of this technique is that you can install new packages on the notebook’s environment and (most of the time) it will get available for importing in the current notebook (again, without restarting the kernel). On the other hand, the magic command %matplotlib is used for redirecting matplotlib output to the current canvas notebook.

Another lesser-known magic commands are:

%%time when you need to profile the time spent executing a cell in a notebook. I like to use this command with complex cells that require long execution times, as I have an idea of how much is going to take to finish the execution. If you want more information on this, you can read the excellent Profiling and Timing Code – Python Data Science Handbook chapter.

[credits:] profiling execution time using %%time

Another way to profile your code is by using tqdm, which shows a nice progress bar when executing “batches of code”. The output adapts nicely depending on the execution context (python, interactive or Jupyter Notebook).

tqdm

In case you need to execute “batches of code” in parallel and show their progression, there you can use pqdm.

%debug (yes you can debug), after a cell fails to execute due to an error, execute the %%debug magic command in the next cell. After this, you will get access to the (magnificent [sarcasm intended] ) pdb debugging interface; bear in mind, no fancy UI for setting breakpoints, just “old fashioned” commands alas ‘s’ for step, ‘q’ for quit, etc; you can check the rest of pdb commands for your own amusement.

In my last projects, I’ve got a strong inspiration from this post A framework for feature engineering and machine learning pipelines which explains how to build a machine learning pre-processing pipelines. The most important idea is to NOT postpone what can be done previously, and transform the data following this order:

  1. Pre-process: column-wise operations (i.e. map transformations)
  2. Feature engineering: row-wise operations (i.e. group by transformations)
  3. Merge: dataframe wise operations (i.e. merge transformations)
  4. Contextual: cross-dataframe operations (i.e. map operations with cross-context)

Start doing EDA from scratch is laborious because you need to query the data beforehand in order to know what to show (or what to look for). This situation ends up authoring repetitive queries to show the histogram of a numerical variable, check missing values in a column, validate the type of a column, etc. Another option is to generate this information automatically, using packages like pandas-profiling , which reports all sorts of information.

pandas-profiling

The classic way for checking for missing data is using the pandas API as in df.isna().sum(); in the same way, you can query data frequencies executing df.species.value_counts(). But the output of these commands is “incomplete”, as only returns absolute figures. Welcome sidetable, which enriches past queries in a nice tabular way:

pandas value_counts() vs sidetable freq()

The sidetable and pandas-profiling API are totally integrated into the pandas DataFrame API, and have support for Jupyter Notebooks.

PS: this area is “hot” at the current moment, so expect more packages to come in the future (klib)

Seaborn is an old companion for many of us. The next version (0.11) is going to bring something I’ve been expecting for a while: stacked bar charts.

stacked bar charts

If you read this while the feature is still not published, remember you can install “development” packages directly from GitHub, using the following command:

pip install  https://github.com/mwaskom/seaborn.git@4375cd8f636e49226bf88ac05c32ada9baab34a8#egg=seaborn

You can also use this kind of URLs in your requirements.txt or environment.yml file, although I recommend to pin down the commit hash of the repository (as shown in the latter snippet). Otherwise, you will install “the last repository version which is available at installation time”. Also, be wary of installing “beta” or “development” versions in production. You are warned

[updated] Seaborn version 0.11 is currently available, so you don’t need to install the development version from GitHub. Nevertheless, I will leave the notes about installing development versions for the sake of the knowledge.

During my last project, I got to know a really handy package for visualizing maps alongside statistical information: kepler.gl, an original node.js module that recently was ported to Python, and also got a friendly extension to load maps into Jupyter notebooks.

The most important features that I love about kepler.gl are:

  1. Tightly integrated with pandas API.
  2. I was able to make an impressive 3D map visualization including a dynamic timeline that was automatically animated in a short amount of time.
  3. The UI has many GIS features (layers and such), so the maps are highly interactive and customizable. But the best part is that you can save these UI settings and export them as a python object (a python dictionary to be precise); next time you load the map, you can pass this python object and avoid to set the map up again from scratch.
from keplergl import KeplerGl
sales_map = KeplerGl(height=900, 
data={"company sales" : df,
"box info" : df.drop_duplicates(subset=["product_id"]), 
"kid_info" : df.drop_duplicates(subset=["user_id"]) },
config=config # configuration dictionary
)
sales_map
Spread the word

This post was originally published by Fran Pérez at Towards Data Science

Related posts