Tips & resources to for building authentic Data Science portfolio projects

towards-data-science

This post was originally published by Vagif Aliyev at Towards Data Science

Learn about different & resources that will make your next portfolio project stand out from the crowd.

Image for post
Photo by Kelly Sikkema on Unsplash

Introduction

Portfolio projects are key for any Data Science. Not only do they showcase your work, abilities and strengths to recruiters, but they are also a great way to apply your learning.

Unfortunately, many people don’t really craft their portfolio projects to the best it can be. Instead, they write some code, put it up on Github, and possibly write an article about it, or, they do a Kaggle competition and drop the code on Github.

This unfortunately is not a great way of creating portfolio projects. Why? because the model you have created is just living in a notebook, thus you don’t learn about model deployment. Second, most of these competitions have clean datasets for you to use, which is quite a contrast to real world data, so you are not really doing the hard part of obtaining the data.

I this article, I will share with you the following:

  • How to come up with great ideas for a portfolio.
  • Where you can find free open datasets for your next portfolio project
  • How to approach a portfolio project
  • 2 essential steps to take in order to make the most of your portfolio project

So, sit back, relax, and enjoy the article!

Step 1: Thinking of great ideas for a portfolio project

Image for post
Photo by AbsolutVision on Unsplash

This is arguably the hardest part of a portfolio project, because sometimes you may have all the skills to do a project, but you don’t know where and how to apply it!

Tip 1: Getting Inspiration

Image for post
Photo by Hello I’m Nik 🎞 on Unsplash

One approach is to take inspiration from other projects that people have made. Here are some great examples:

  • Stanford’s ML Class projects: This is actually the work of the machine learning class at Stanford. The consists of a project and the reports and posters can be seen online. This is one of my favourites as some of the projects here are amazing, and they have kindly provided the Github code too.
  • AI Generated Recipes: This project made by Derek Jia is a really good example of how you can combine two things you love into a really inspiring project.
  • Lazy Predict: This project is a very good of example that not all portfolio projects have to be deployed through Flask or made into a Docker image. Here, the author has made a PyPi package that aids beginners in selecting the best model for a project.
  • SnapStudy: not only a great idea, but also a great name! This project allows the user to take a picture of his/her notes, and automatically generates Flashcards to help them remember! How cool is that!
  • Stock Trading Bot: This is also a really good example of an end-to-end Reinforcement Learning project. Containing very clear instructions and information about the project, this is a great example of how one can apply their Reinforcement Learning knowledge to the real world.

Tip 2: Brainstorming and solving problems in life

Image for post
Photo by Kelly Sikkema on Unsplash

Brainstorming is a really powerful tool in generating ideas for projects. It is scientifically proven that writing things down can make you more productive, and often times I myself get surprised at the ideas I can come up with when writing them down on some sticky notes!. This tutorial is a perfect guide on how one make the most out their brainstorming sessions.

Another approach is to find out what problems you may face in your day-to-day life, and see if you can solve them using Data Science. Questions you can ask could be the following:

  1. Could I utilise NLP to write essays for me?
  2. Could I create a spam email classifier using supervised learning?
  3. Could I create a app that recommends me what Netflix movie I should watch? Or what Spotify I should listen to?
  4. Could I create an app to recommend me recipes based on food that I like?
  5. Could I create an NLP app that can read my handwriting?
  6. Could I create a Reinforcement Learning Bot to trade stocks for me?

And so on. The points is that there are lots of things that may bother you, but instead of complaining about them, find out if you can utilise ML & Data Science to come up with innovative ideas!

Step 2 : Open datasets for your portfolio project

Image for post
Photo by Mika Baumeister on Unsplash

The most important part of any project is data. Without data, everything else is useless. However, finding the right data for you projects can be a tough task to do. That’s why I have compiled a list of free, open and good quality datasets for you to use for your next portfolio project.

Without further ado, here is the list of free open datasets:

  1. FiveThirtyEight
  2. BuzzFeed News
  3. Socrata
  4. Awesome Data
  5. Google Public Datasets
  6. Quandl
  7. Data.gov
  8. Academic Torrents
  9. data.world
  10. AWS Public Datasets
  11. r/Datasets
  12. Data is Plural
  13. Wikipedia Datasets
  14. IMF Data
  15. World Bank Data
  16. NASA Datasets
  17. CERN Open Data Portal
  18. Global Health Observatory Data Repository

This is a list I have compiled from my own personal experience of using these resources. I encourage you to check some out them out and see what is hot and what is not. So now you definitely can’t use the excuse of “I can’t find good data!”

Once you have a dataset downloaded and ready to analyse, pause for a second, and ask some key questions:

  1. How was this datasets created?
  2. Where does the data come from?
  3. What data types are present in the data?
  4. Is there missing values? If so, are they MCAR,MAR or MNAR?
  5. Does the data have categorical values?

Step 3: How to approach a portfolio project

Image for post
Photo by Gary Bendig on Unsplash

When I was starting out with building portfolio projects, I would usually skip all the EDA and preprocessing and just run an XGBoost model(yes, how stupid indeed.) However, I knew that I needed to follow a clear structure to my projects. And then I met CRISPDM.

Cross-industry standard process for data mining

Image for post

Photo By Wikipedia

Cross-industry standard process for data mining(CRISPDM) is a process model that describe the 6 steps describe the Data Science Life Cycle. It essentially helps you planorganisestructure and implement your projects.

Great. But what do the steps really involve? Good question. So, let’s discuss exactly that!

The steps involved in CRISPDM

Image for post
Photo by Ambrose Chua on Unsplash

1. Business Understanding/ Goal Defining

Here, you define the actual objective of the project, and what you are aiming to achieve by creating this project. This can be split into 3 stages:

  • Objective: Here, you define the goal of the project and the task you are trying to achieve. You also investigate and see if data can be used to solve this problem
  • Resource Assessment: Here, you investigate the resources needed for your project, and where and how you will obtain them. Will you work locally, or on the cloud? Will you data need to be pooled from many locations?
  • Tools Assessment: Here, you select the best tools and technologies that you think will best suit this project, based on the objective of your project

2. Data Understanding

Ok, so now that you have understood what you are trying to achieve, you begin to do your analysis. The steps can be broken down to the following tasks:

Data collection: you begin by collecting your data. You may avail of REST API’s, Data Warehousing services, or other methods to collect and cement your data together into a usable dataset

Data Description: You begin by doing some EDA, looking at the basic structure of the data, getting a feel for the data and understanding the type of data you are working with

Data Exploring: Here, you take a deeper dive into your data. You begin to perform more complex queries, look for hidden patterns in the data, and try to find aspects of the data that give key insights onto the problem at hand

Data Assessment: You begin to assess the quality of the data. You ask questions such as:

  • Is the data clean?
  • Is there missing values?
  • Will I need to drop any features?
  • How will I need to preprocess my data?
  • Will feature scaling be required?

3. Data Preparation

This is the step where you prepare the final data to be used for the model. The stages here are the following:

  • Data selection: selecting the data that you will need, and removing unneeded features from your data.
  • Data cleaning: this stage is usually the longest. You may have to impute, correct, drop and format values
  • Data construction: feature engineering. This is where you use the insights gained from you Data Analysis to see how you can intelligently utilise existing features to construct new features that will hopefully benefit your model. You many also need to transform certain values using different transformations
  • Data Integration : combine different datasets together to form one whole dataset
  • Data Formatting: you may need to format certain values, for example, you may want to encode/adjust string and categorical values to numeric values in order to gain useful information from the features

4. Modeling

Probably the most enjoyable and anticipated part of the project. This stages is split into 4 areas:

  1. Select modelling techniques: Determine which algorithms to try (e.g. regression, neural net). This can be done by trying several baseline approaches and evaluating all of them simultaneously
  2. Generate test design: Pending your modelling approach, you might need to split the data into training, test, and validation sets. You may decide to use cross-validation to getter a better sense of how you model will generalise to new data. Make sure to not evaluate you data on the test set!
  3. Build model: As glamorous as this might sound, this might just be executing a few lines of code like “reg = LinearRegression().fit(X, y)”. This can also consist of combining a group of models together, known as an ensemble.
  4. Assess model: Generally, multiple models are competing against each other, and a Data Scientist needs to interpret the model results based on domain knowledge, the pre-defined success criteria, and the test design.
  5. Tune model: After you have selected an ideal model, you may want to tune the parameters of the model to get the optimal fit to your data.

5. Evaluation

This stage of the cycle focuses on which model best fits the objective and the project and planning the next steps. This stage can be split up into the following:

  • Evaluate Models: You may use different metrics based on your domain problem to investigate and see which model performs best. You must select the model that best meets the objective of the project.
  • Review Project: This is an important step. You look back through the cycle and see if you could have done anything better. Did you do a thorough EDA? Were the best features selected? Summarise your findings and adjust your project where needed
  • Determine next steps: based on the performance of the model and how well it meets the goal of the project, you may decide to go ahead with deployment, or you may go back through the cycle and see if you can approve certain aspects of the project.

Deployment

Sorry to burst your bubble, but a model is not very useful to people inside your Jupyter Notebook! You need to have a plan on deploying the model .The steps involved in this are the following:

  • Plan Development: Develop a clear and concise plan on how you are going to deploy the model
  1. Plan monitoring and maintenance: Develop a thorough monitoring and maintenance plan to avoid issues during the operational phase (or post-project phase) of a model.
  2. Produce final report: The project team documents a summary of the project which might include a final presentation of data mining results.
  3. Review project: Conduct a project retrospective about what went well, what could have been better, and how to improve in the future.

These principles will greatly help you in clearly structuring your portfolio projects, so that you are not jumping from one stage to another. If at any stage you are unsure about something, always traverse back through the cycle and see if you can improve on anything.

2 key assets to use in order to make the most of your project

Image for post
Photo by Silas Köhler on Unsplash

The reason I put these two methods as a section of their own is simply because I have not seen enough portfolio projects availing to them. These 2 methods are very underrated and can help immensely in the following areas:

  1. Debugging
  2. Reducing Errors

Key asset 1: Logging

Image for post

Photo by Mildly Useful on Unsplash

Undoubtedly the most underused strategy that I never see being used is logging. It is so easy, yet hardly used. Essentially, logging keeps track of your code and logs down key messages that you can later go through if you begin to face an issue in your project.

I use this all the time, and I have caught numerous nasty bugs that would have otherwise gotten into production and would have forced me to spend endless hours of fixing them. If you want to learn how to use logging, here is a great article by Real Python on logging, and honestly, it is the only tutorial you will ever need to read as it is truly comprehensive.

Key asset 2: Setting up a CI/CD Pipeline

Image for post
Photo by Casey Horner on Unsplash

This is also a very underrated and underused strategy. I cannot put enough emphasis on the importance of a Test Driven Development and the countless benefits that it possesses. Establishing a solid CI/CD pipeline can help achieve the following objectives:

  1. Catch bugs early before they go into production
  2. Rollback to previous code if you have an significant bug
  3. Track your code at all stages of development
  4. Split the different aspects of the project into separate workspaces so that all stages of the process are tracked and assessed separately.

Personally, I like to use Travis CI and Git when setting up my CI/CD pipeline. However, feel free to use any CI or CD tool that you may desire.

Conclusion

While nobody is perfect, I believe that one should always strive for it. Here, I have given tips and resources to make your projects stand out and unique from the rest, and how to make the most out of your project. I have described the end-to-end cycle that should be performed, and how you can use 2 underrated methods to really reach maximum efficiency in your next project.

I hope this article has helped you one way or another, and I hope that you are now ready to build the most authentic portfolio project ever witnessed! Make sure to stay updated for more content, and always be the best you can be!

Image for post

Photo by Wilhelm Gunkel on Unsplash
Spread the word

This post was originally published by Vagif Aliyev at Towards Data Science

Related posts