Engineering best practices for Data Science projects

towards-data-science

This post was originally published by Nikita sharma at Towards Data Science

In this post, we will learn some best practices to improve our code quality and reliability for the production Data Science code.

This is the first step for having better code. It is the process of simplifying the design of existing code, without changing its behavior.

Issues addressed

  • Improved code readability — Make it easy to understand for our teams
  • Reduced complexity — smaller and more maintainable functions/modules

Action items

  • Breaking down code into smaller functions
  • Comment functions
  • Better naming standards
  • Remove unused code bits

A unit test is a method of testing each function present in a code. The purpose is to validate that each function in the code performs as expected.

Issues addressed

  • It helps to fix bugs early
  • It helps the new starters to understand what code does
  • enables quick code changes
  • Ensures bad code is not merged in

Action items

  • Create functions that accept all required parameters as arguments, rather than computing within functions. This makes them more testable
  • If the function reads spark data frame within the function, change the function to accept a data frame as a parameter. We can then pass handcrafted data frames to test these functions.
  • We will write a bunch of unit tests for each function
  • We will use python framework like unittest, pytest, etc. for unit testing
  • Tests will be part of the code base and will ensure no bad code is merged
  • These tests will be used further by our CI/CD pipeline to block the deployment of bad code

Integration testing tests the system as a whole. It is checking if all the functions are working fine when combined together.

Issues addressed

  • It makes sure that the whole project works properly.
  • It detects the errors related to multiple modules working together.

Action items

  • We will create a local infrastructure to test the whole project
  • External dependencies can be created locally on Docker containers
  • Test framework like pytest or unittest will be used for writing integration tests
  • Code will be run against local infra and tested for correctness

Writing projects on jupyter notebooks don’t essentially follow the best naming or programming patterns, since the focus of notebooks is speed. Linting helps us to identify the syntactical and stylistic problems in our python code.

Issues addressed

  • It helps to detect styling errors
  • Forces better / optimal writing style
  • Detects structural problems like the use of an uninitialized or undefined variable
  • Makes code pleasant to work with

Action items

  • Flake8 or black will be used to detect both logical and code style best practices.
  • Next step, Lint tests will be integrated into CI/CD to fail builds on bad writing style

Code coverage helps us find how much of our code did we test via our test cases. It’s a good quality indicator to inform which parts of the project need more testing.

Issues addressed

  • monitor how much of the code is tested.

Action items

  • Tools like coverage.py or pytest-cov will be used to test our code for the coverage.

We will set permissions to control who can read and update the code in a branch on our Git repo. This will keep our master (deployment branch) clean and force a Pull Request + Build tests based process to get code merged in master.

Issues addressed

  • Master is always clean and ready to be deployed
  • Force best practices — Pull Request + Automated build tests
  • Accidentally deleting the branch will be avoided
  • Avoiding bad code to merge on the master

Action items

We will set the branch setting with the following :

  • We can’t directly merge the code in master without a Pull Request
  • At least 1 approval is needed to merge the code to master
  • Code will only merge once all automated test cases are passed

When our pull request is created, it is a good idea to test it before merging to avoid breaking any code/tests.

Issues addressed

  • Automate run on the test
  • Avoiding bad code to merge on the master

Action items

  • CI/CD setup on Github.
  • Automatic tests should be triggered on any new branch code push
  • Automatic tests should be triggered on Pull requests created
  • Deploy code to production environment if all tests are green

This is a very important step in the Software engineering world, but almost always gets skipped for Data Science projects. We will monitor our job and will raise an alert if we got some runtime errors in our code.

Issues addressed

  • More Visibility, rather than black-box code executions
  • Monitor input and output processing stats
  • Monitor infra availability/dependency
  • Past runs failures/successes trend
  • Alert us when we ML pipeline fails/crashes

Action items

  • If you have a monitoring tool (highly recommended) — send events for input/output stats to monitor
  • If no monitoring tool available — log all the important stats in your log files.
  • If no monitoring tool — We could potentially add the important stats of a run to a DB for future reference
  • Build Slack/Microsoft teams integration to alert us Pipeline pass/fail status

That’s all for this post. Hope these are useful tips. Please share your thoughts and the best practices you applied to your Data Science projects.

Spread the word

This post was originally published by Nikita sharma at Towards Data Science

Related posts