Managing software dependencies for Data Science projects

towards-data-science

This post was originally published by Eduardo Blancas at Towards Data Science

A step-by-step guide for maintaining project dependencies clean and reproducible.

Photo by Emile Perron on Unsplash

Virtual environments are a must when developing software projects. They allow you to create self-contained, isolated Python installations that prevent your projects from clashing with each other and let other people reproduce your setup.

However, using virtual environments is just the first step towards developing reproducible data projects. This post discusses another important subject: dependency management, which relates to properly documenting virtual environments to ease reproducibility and make them reliable when deployed to production.

For a more in-depth description of Python environments, see this other post.

tl; dr; Package managers use heuristics to install package versions that are compatible with each other. A large set of dependencies and version constraints might lead to failures when resolving the environment.

When you install a package, your package manager (pip, conda) has to install dependencies for the requested package and dependencies of those dependencies, until all requirements are satisfied.

Packages usually have version constraints. For example, at this time of writing scikit-learn requires numpy 1.13.3 or higher. When packages share dependencies, the package manager has to find versions that satisfy all constraints (this process is known as dependency resolution), doing so is computationally expensive, current package managers use heuristics to find a solution in a reasonable amount of time.

With large sets of dependencies, it can happen that the solver is unable to find a solution. Some package managers will throw an error, but others might just print a warning message. To avoid these issues, it is important to be conscious about project dependencies.

tl; dr; Group your dependencies in development (i.e. packages needed to train a model) and production (i.e. packages needed to make predictions).

When working on a Data Science project, there might be packages that you only need for development, but that won’t be required in the production environment. For example, if you are developing a model, you may generate some evaluation charts in a jupyter notebook using matplotlib, but for serving predictions through an API you don’t need any of those.

This gives you a chance to simplify dependencies in the production environment. The next section discusses how to do this.

tl; dr; Keep separate files for dev/prod dependencies. Manually add/remove packages and keep them as flexible as possible (by not pinning specific versions), to ease dependency resolution and to test your project against the latest compatible version available.

pip and conda are the most widely used package managers for Python; both can setup dependencies from a text file. My recommendation is to use conda (through miniconda), because it can handle more dependencies than pip. You can even install non-Python packages such as R. If there is a package that you cannot install using conda install, you can still use pip install inside the conda environment.

In conda, you can document your dependencies in a YAML file like this:

While you can auto-generate these files, it is best to maintain them manually. A good practice is to add a short comment to let other people know (or even your future self!) why you need a package. During development, we might experiment with some package but discard it shortly thereafter, the best way to proceed is to remove it from the environment file but if you forget to do so, the comment will be helpful in the future when deciding which dependencies to drop.

Keep your dependencies flexible and only pin specific versions when you have to, the more version constraints you add to your environment, the higher the chance of running into situations where the solver is unable to satisfy all constraints.

tl;dr; Always look for errors when setting up environments, sometimes you might have to pin versions to resolve issues.

Once you specify you create the environment file, you can create the conda virtual environment with the following command:

If you have a reasonable set of dependencies, your environment should install just fine, but there are a few factors that might give errors/warnings. Always check the command output for issues. Depending on the solver configuration, the command could just refuse to create the environment or print a warning message.

I mentioned that solvers attempt to find a solution that satisfies all package requirements, but that’s under the assumption that package maintainers have their requirements up-to-date. Say package X depends on package Y, but X didn’t set any version constraints for Y. A new version of Y is released that breaks X. Next time you install X, you will end up with a broken installation if the solver installs the incompatible version of Y (from the solvers perspective, this is not a problem because X did not set any constraints for Y). These cases are laborious to debug because you have to find a working version by trial and error, then pin the right one in the dependencies file.

The more time it passes without testing your environment setup from scratch, the higher the risk to break your environment setup. For this reason, it is important to continuously test your dependencies files.

tl;dr; Continuously run your project’s tests in a recently created environment to detect breaks due to package updates.

Since your development packages are not pinned to a specific version, the package manager will attempt to install the latest compatible version, this is good from a development perspective, because packages get improvements: new features, bug fixes and/or security patches and it’s a good idea to keep your dependencies updated; but they might also introduce breaking changes. To detect them, make sure you run your project’s tests in a fresh, recently installed environment. The process is as follows:

  1. Start with a clean virtual environment
  2. Install dependencies for dependencies file
  3. Configure your project
  4. Run tests

Preferably, automate this process to run every time you modify your source code (this is called Continuous Integration). If that’s not an option, manually run the steps described above on a regular basis.

tl; dr; Use nox and pytest to run your tests inside a fresh environment, package your project so you can easily import project’s modules in your tests

The best tools I’ve found for automating environment setup and testing execution are nox and pytest.

Nox is a Python package that can automate conda environment creation, then run your tests inside that environment:

Note: At this time of writing, nox (version 2020.8.22) does not officially support installing from environment.yml files but this workaround does the trick, click here for more info.

Once your environment is prepared, include the command to start the testing suite. I recommend you to use pytest, here’s how a testing file looks like:

To execute the tests from the file above, you just have to run pytest, inside nox, this translates to adding: session.run('pytest')

So far we’ve covered steps 1, 2 and 4. But skipped 3 (configure project). As you can see in the previous code snippet, to test your code, you have to make it importable, the cleanest way to do so is by packaging your project (to see our post on project packaging, click here).

Once you packaged your project you can set it up with pip install path/to/project.

tl;dr; Develop separate test suites where you test your development pipeline in an environment with development dependencies and serving API with the production dependencies.

You can use nox to test your development and production environments separately. Here’s some sample code:

To run your tests, simply execute nox in the terminal.

tl; dr; Provide auto-generated lock files that contain the list of dev/prod dependencies with pinned versions, to avoid breaking your project in the future due to API changes.

Without pinning specific versions, there is no way to know the exact set of versions that the solver will install, this is fine for development purposes since it gives you a chance to try out new versions and check if you can upgrade them, but you don’t want this uncertainty in a production environment.

The way to approach this is to provide another file that lists all the dependencies but includes specific versions, this is called a lock file. This way, your environment will always resolve to the same set of versions.

You can generate a file with pinned dependencies with this command:

The output file looks like this:

Lock files are very important to guarantee deterministic environments in production, make sure you generate them from environments that you already tested. For example, you could add the conda env export command at the end of your testing session:

Apart from pinned versions, lock files usually provide other features such as hashing each package and comparing it with the downloaded version to prevent installation of tampered software. At this time of writing, there is a new project called conda-lock that aims to improve support for conda lock files.

Using virtual environments for your projects is a great first step towards developing more robust pipelines but that’s not the end of it. Including dependency installation as part of your testing procedure, keeping dependencies small, separating production from development dependencies and providing lock files are necessary steps to ensure your project has a robust setup.

Spread the word

This post was originally published by Eduardo Blancas at Towards Data Science

Related posts