The Data Mining process

mediumThis post was originally published by Arne Wolfewicz at Medium [AI]

To this day, simply dumping a pile of data into even the most advanced machine is unlikely to give you back anything meaningful, let alone produce the outcome that you desire. Intelligent systems still need people to ask the right questions, set goals, and evaluate the performance.

we set ourselves the objective of democratizing machine learning and allow users to prepare data, as well as train, evaluate and put a model into production — without having to write a single line of code.

But how can we get from an idea to a functioning system?

In this article, we walk you through the CRISP framework and highlight not only the critical elements in the process. In addition, we want to show you how modern tools can take away much of the complexity.

Machine learning practitioners around the world have been — consciously or unconsciously — following a certain pattern in order to make a machine produce good results: CRISP-DM (cross-industry standard process for data mining). It suggests that certain steps have to be taken in the following order:

  1. Problem understanding
  2. Data understanding
  3. Data preparation
  4. Modeling
  5. Evaluation
  6. Deployment

However, it is usually wrong to assume that one can get from idea to working system by just taking each step once — iteration is the rule rather than the exception (see illustration).

CRISP-DM: Cross-industry standard process for data mining.

Let’s go through the steps one by one.

Note to those who are familiar with some machine learning techniques: This is the time for objective analysis. It is quite common to reframe the problem several times as the project progresses but any iteration that can be avoided should be avoided.

Also referred to as business understanding, it is important to first have a clear view of the problem at hand. Each business problem is unique in some way and may not present itself as a data mining case from the start.

Some guiding questions during this phase:

  • What is the issue you are facing?
  • What is the input, processing, and desired output of the process?
  • How is the process being done today?
  • What steps would you like to automate?
  • Which aspects should be automated by machine learning and which can be handled by other tools?

After the problem is framed, it is important to understand (1) what data is available and (2) how that data looks. In any business setting, there is data in a variety of formats. Images, plain text, sound, videos, databases — standardized and unstructured. Today’s technology is able to deal with all those mentioned but we first must determine what can be used.

Besides the format alone, it also helps to get an idea of where the data is coming from. In the simplest way, you have immediate access to it.

Business people are usually great at framing the problem and understand the data involved in the process. At this point, however, projects turn a lot more “technical”: Data needs to be retrieved at greater quantities, labeled, and transformed into a machine-digestible format. Afterwards, a number of models have to be set up, trained and finally evaluated.

All these activities typically go beyond the skill set of managers and non-technical staff, which is why the project has to be given out of hand at some point. There is nothing bad about this per se, however, most companies do not even have these skills on their payroll. This is the very reason why many projects either get stale at this point or — worse — never get considered in the first place.

This is why we established colabel: On our platform, all three stages can be handled without code:

  • Data can be labeled in Slack (if needed)
  • State-of-the-art models are automatically selected and trained
  • The user receives immediate feedback on how good or bad the training process went

As such, control remains with whoever thought of the problem in the first place without having to employ someone or apply for developer capacity.

Note: There is much more to be said about each step but we skip those for brevity. If you want to read up on it, have a look at IBM’s knowledge center.

  1. Embed the trained model into an existing program
  2. Set up a microservice that is able to communicate via API

Simply having a prediction machine is worth little to nothing. What ultimately drives performance, speed, and quality in processes is having the system work on automatic requests, possibly embedded in a no-touch workflow.

In the traditional system, there are two popular ways of deploying a model:

While the first option is becoming less and less common, the second one at least allows connecting workflow automation tools like Zapier. Our software does that for you: you get the Zapier integration right out of the box.

Most users that we speak to immediately get excited about the technology. This is understandable given that such software is usually not available without major investments of time, money, or both. Therefore, especially smaller or young businesses do not get to enjoy the benefits of such systems and end up either hiring manual labor or not doing something at all.

Having said that, we strongly recommend you to not “jump the gun” by skipping the initial steps. This typically means additional iterations that you might want to avoid.

We are working hard to make the flow as user-friendly as humanly possible. Our ultimate goal is to make the software entirely self-explanatory. Until then, feel free to connect with us to discuss your idea!

Spread the word

This post was originally published by Arne Wolfewicz at Medium [AI]

Related posts