The Data Scientist’s Toolbox (part 1)


This post was originally published by Benedict Neo at Towards Data Science

Data is the ingredient for data science processes, and by understand what data is can help you to be more efficient and appreciate what data science is all about.

1. According to the Cambridge English Dictionary:

Information, especially facts or numbers, collected to be examined and considered and used to help decision-making.

2. According to Wikipedia

A set of values of qualitative or quantitative variables.

Based on the definition by Wikipedia, data can be broken down into the terms set, variables, qualitative and quantitative.


  • The population from which data is drawn


  • Input variable (X, predictor, independent variable)
  • Output variable (Y, response, dependent variable)


  • information about quantity (can be counted and measured)
  • age, height, weight, number of cases, etc.


  • descriptive variables (can be observed but not measured)
  • color, blood type, infected or not, address, etc.


Taking the COVID-19 pandemic as an example, let’s say we want to visualize the number of confirmed cases in the US with a simple scatter plot,

  • set – the confirmed cases of the United States.
  • Independent variable, X – time (days)
  • Dependent variable, Y – the number of confirmed
  • Both X and Y are quantitative variables

The result of the plot can also be used to depict the relationship between X and Y, either a positive or negative correlation. With the use of statistical learning techniques, algorithms such as linear regression can be used to build models for predictions and inference purposes.

Data is messy and not perfect

As you advance in the field of data science, you will realize that data is messy and unstructured, and it takes skills, patience, and time to clean data and structure it so that it’s ready to use. Take image data, for example, if you were to build a facial recognition model that detects a face, the input images could be dark, grainy, or blurry, etc. These messy image data can be difficult to deal with. Another aspect is missing data, data-mined, or collected from the real world is often bombarded with missing information, and several techniques are implemented to deal with them.

Sources of Data

Data comes from many places, especially in this time where smartphone usage has dramatically increased due to social media, and the rise of streaming services such as Netflix and Spotify. Data can be categorized into internal or external, where internal is information generated within a business, such as finance, and external is information from the customer, usage analytics, etc. Good data is also often hard to find, in most cases, you’ll have to mine it from the internet to perform analyses, and lots of cleaning is required for it to be useful.

Data is of Secondary importance

The most important rule for that data scientists should adhere to is to always ask questions first before seeking out data. Just as the scientific method starts off with a hypothesis, data science starts off with questions that are crucial to solving the problem at hand.

As Einstein puts it:

“If I had an hour to solve a problem and my life depended on the solution, I would spend the first 55 minutes determining the proper question to ask… for once I know the proper question, I could solve the problem in less than five minutes.”

Spread the word

This post was originally published by Benedict Neo at Towards Data Science

Related posts