Data Science must-know

mediumThis post was originally published by Boadzie Daniel at Medium [AI]

Data Science is among the new technologies that make the “Industrial Revolution 4.0. It has become a crucial part of every company and every successful economy. Billion of dollars are up for grab in the data economy. In fact, Data Science is called the “sexiest job of the 21st Century” . The reasons for the data hype are numerous; they include;

  1. Data Deluge

Huge volumes of data in all forms (text, videos, images, etc) are generated daily. All these data need to be analyzed to gain valuable insight for decision-making.

2. Moore’s law

The computing power required to process these huge data volumes are now possible due to advancement in computing devices and storage. For example, GPU and TPU have significantly reduced the time needed to run some algorithms that usually took about weeks and months to compile.

Data Science involves the activity of analyzing large and usually messy datasets in order to extract knowledge and insight for decision-making.

Data Science is an interdisciplinary field of study that includes maths, computer science, etc.

The following are the required skills that make a great data scientist;

  1. Curiosity: Data Science involves uncovering knowledge and insight buried deep in messy data. This means that to excel as a Data Scientist, you must learn to embrace your curiosity and explore options.
  2. Problem-Solving: Data Scientist are problem-solvers. They use insights from data to solve real world problems.
  3. Communication Skills: The ability to communicate insight in a clear way for non-technical people to understand is crucial to the success of any Data Scientist.
  4. People Skills: Data Scientist rarely work alone. As such, the ability to work well with other people is important for any data scientist to master if they will get far.
  5. Critical Thinking: The ability to analyze situations/solutions/problems and select the most viable from alternatives as well as continually evaluate the solution so as to make incremental changes is required for anyone who want to be a Data Scientist
  1. Coding (Python/R/etc.): Coding is an essential part of the skill set of a Data Scientist. Python and R are the most popular languages used in Data Science. Python has many tools that makes the work of a Data Scientist easy. Examples include Numpy, Pandas, Matplotlib, etc.
  2. SQL: Databases are the primary storage of most of our data. It therefore crucial that you learn to write SQL queries to retrieve data from databases.
  3. Big Data: Large volumes of data requires large processing capacity. That is why technologies like Hadoop, Apache Spark were created to handle these huge data with its attributes.
  4. Machine Learning: Machine Learning is a crucial part of Data Science. Machine Learn is the ability to make computers learn from data. Deep learning; a subset of machine learning is responsible for all the cool things we see today like self-driving cars and chatbots.
  5. Data Visualization: Data communication through visualization is also important in the tool set of a Data Scientist. After all, if you can’t communicate your results to non-technical people, how will would your work benefit them?
  6. Domain Knowledge: Knowledge and intuition about the field in which you are working in is also important in the accurate analysis of Data Science results and subsequent business applications.


Scaler: A scaler is a single number. examples include the numbers 1 or 2, or 3 as single units and not a part of a list.

Vector: Vectors are arrays of numbers arranged in some order.

data = [1,2,3,4,5] # the variable data is a one dimensional vector

Matrix: A matrix is a two-dimensional array (often called a 2D array). Elements of a matrix are stored in row and columns.

myMatrix = [
[1, 2, 3, 4, 5],
[2, 1, 4, 5, 2],
[5, 2, 0, 4, 8]] # a 3 X 5 matrix

Tensor: Tensors are multidimensional and multi-generic arrays.

# a tensor
from numpy import array

Random Variable: A variable who’s outcome depends on chance. This variable can be discrete or continuous.

Probability Distribution: This refers to an arrangement that depicts the possibility of a random variable assuming a probable state.

Probability Mass Function: refers to a distribution over discrete random variables. It includes Binomial and Poisson Distributions.

Probability Density Function: This refers to distribution over continuous random variables. Examples include Normal, Uniform and Student’s T Distributions

Marginal Probability: The marginal probability of a sample made up of random variables is the probability distributions of elements in the sample.

Conditional Probability: This refers to the probability that an event will take place given that another event took place.For example, the probability that you will a buy car is dependent on the fact that you have money.


Statistics is a branch of mathematics that deals with the collection, organization, analysis, interpretation and presentation of data. There are two types;

  1. Descriptive Statistics: This refers to the descriptive coefficients that summarizes a given dataset for better understanding. It includes finding central tendencies like the mean, median, mode, and other coefficients that describe a dataset.
  2. Inferential Statistics: This involves testing a hypothesis and drawing a conclusion from features of a population.

Data Mining

Data Mining involves processes, methodologies, tools and techniques to discover and extract patterns, knowledge and valuable insights from messy datasets.

Artificial Intelligence(AI)

Artificial Intelligence is the art, science and engineering of making intelligent agents and machines that perform human specific tasks. It includes fields such Machine Learning, Natural Language Processing, etc.

Natural Language Processing(NLP)

Natural Language Processing is a multidisciplinary field of AI that combines computational linguistics, machine learning and computer science to help computers process, understand and interpret natural human language. Application of NLP includes;

  • Machine translation
  • Speech recognition
  • Question answering systems
  • Context recognition and resolution
  • Text summarization
  • Text categorization
  • Information extraction
  • Sentiment and emotion analysis
  • Topic segmentation

Machine Learning

Machine Learning is the ability of a computer to learn to perform human-level task without being explicitly programmed. — Arthur Samuel(1959)

Deep Learning

Deep Learning is a sub-field of Machine Learning that mimick the biological brain to help machines extract insights from messy data.

Professional Data Science/ Machine Learning development follow standards that are robust, iterative and efficient for quality data projects. This methodology is used heavily in industry and for small projects. This process helps Data Professionals to build end-to-end data solutions.

The process is nicknamed CRISP-DM. The acronym stands for Cross Industry Standard Process for Data Mining. There are six different phases in the process.

Phase 1 — Business Understanding

This is the crucial initial stage of any data project. This phase has the following objective;

  • Business Problem definition
  • Assessing and Analysis of Scenarios
  • Project Planning and Documentation

Phase 2 — Data Understanding

This second phase is concerned with the understanding of the data that will be used for analysis and model building. The objectives include;

  • Data Collection
  • Data Description
  • Exploratory Data Analysis
  • Data Quality Analysis

Phase 3 — Data Preparation

This phase involves pre-processing the data into a form that is suitable for doing analysis and build models. It includes cleaning, wrangling, curating and preparing data. Objectives include;

  • Data Integration
  • Data Wrangling
  • Attribute Generation and Selection

Phase 4 — Modeling

The fourth phase is the phase where the cleaned data is used to build models that will solve the business problem. Objectives includes;

  • Selecting the appropriate model for the task
  • Model Building
  • Model Evaluation and Tuning
  • Model Assessment

Phase 5 — Evaluation

This phase including measuring the model’s performance(accuracy) against business goals. It also include the evaluation of the entire process including all the previous phases.

Phase 6 — Deployment

The final phase involves deploying the model to production environment for users to access it. It also includes plans for monitoring and maintenance.


Spread the word

This post was originally published by Boadzie Daniel at Medium [AI]

Related posts