Create reproducible Machine Learning experiments using Sacred

mediumThis post was originally published by Max Leander at Medium [AI]

To give an example of how to use this powerful framework, I am going to use the dataset from a Kaggle competition, Real or Not? NLP with Disaster Tweets. This competition is a binary classification problem where you are supposed to decide whether a tweet is describing an actual disaster or not. Here are two examples:

Real disaster tweet:

Not a disaster tweet:

Sooner or later, the Data Scientist will notice that the performance of the models heavily depend on specific configurations and countless modifications of the data.

Let’s say that we want to run some experiments where we build a model to classify these tweets and measure the classifier’s F1-score using k-fold cross-validation. Most data scientists would probably fire up a Jupyter notebook and start to explore the data (which indeed is always the right thing to do, btw), run some ad-hoc experiments, build and evaluate models. Sooner or later, the Data Scientist will notice that the performance of the models heavily depend on specific configurations and countless modifications of the data. This is where the power of reproducibility starts to pay off.

The following are the main features and advantages of using Sacred:

  • Easily define and encapsulate the configuration of each experiment
  • Automatically collect metadata of each run
  • Log custom metrics
  • Collect logs in various places using observers
  • Ensure deterministic runs with automatic seeding

We start off by creating a base experiment in Sacred as follow:

A Sacred experiment is defined by a configuration, so let’s create one:

Notice that the config attribute of the experiment object is used as a function decorator. This enables Sacred to automatically detect that the function should be used to configure the experiment.

This very simple config defines a scikit-learn pipeline with two steps: compute the TF-IDF representation of all tweets and then classify them using Logistic Regression. I added a variable for one of the hyper parameters, max_features, to showcase how you can easily create new experiments by modifying the config.

Now, before you can run this experiment, a main function must be defined:

As you can see, we once again use an attribute of the experiment object as a decorator, in this case automain. This lets the main function automatically access any variables defined within this experiment’s config. In this case, we only pass classifier which will be evaluated with respect to how well it can classify the Twitter data using 5-fold cross-validation on the training set. In the last line of code, the metric that we want to measure is logged using the log_scalar method.

To run the experiment, simply call its run() method. To run it with different parameter values, you can conveniently pass a dict config_updates specifying the exact configuration for this experiment run. Pretty neat!

I usually put the experiments themselves in different files, and then have a separate script which runs all of the experiments at once.

If you run the above, you will not see a lot of results. You first need to attach an observer to the experiment. The observer will then send the logs to some destination, usually a database. For local and non-production usage, you can use the FileStorageObserver to simply write to disk.

If you include this line in the runner script above and run it, a new folder logreg is created with one sub-folder per run. One for the default run, and one with the updated max_features value. Each has created four separate files, with the following content:

  • config.json: The state of each object in the configuration, and the seed parameter which is automatically used in all non-deterministic functions to ensure reproducibility.
  • cout.txt: All standard output produced during the run.
  • metrics.json: Custom metrics that were logged during the run, e.g. the F1-score in our case.
  • run.json: Metadata e.g. about the source code (git repo, files, dependencies, etc.), the running host, start/stop time, etc.

For the sake of completeness, I will create a final example to show how you can run multiple experiments from the same runner script:

Now, let’s run both experiments with some config updates…

By looking at the metrics.json file of each run, we can conclude that the default logistic regression model was the best performing, with an F1-score of ~0.66, while the random forest with 100 estimators was the worst one, with an F1-score of ~0.53.

Of course, all of that json-formatted output is not very appealing to look at, but there are several visualization tools you can use with Sacred. This is however outside the scope of this article, but do have a look here:

Experiment safely!

This article is part of a series on best practices when building and designing machine learning systems. Read the first part here:

Spread the word

This post was originally published by Max Leander at Medium [AI]

Related posts