This post was originally published by Arne Wolfewicz at Medium [AI]
Most people know that machine learning (ML) creates value but few know how it does. This article aims to give an overview of the value chain of machine learning. Where does the value creation start, where does it end? Contrary to the beliefs of many, programming algorithms is only a small part.
A value chain describes the sequence of activities through which companies add value to a product, from start to finish.
The value chain for traditional industries is rather straightforward. For your local bakery around the corner, selling the fresh buns is the last step of a long sequence of activities: Procurement of raw materials, inbound logistics, storage, baking, sales, and maybe even distribution.
But what are the different value-adding activities of machine learning? Contrary to the beliefs of many, the actual algorithm programming comprises only a minor part of the ML value chain. There are other value-adding steps both before and after the ML programming takes place.
The ML value chain consists of six major steps:
- Problem definition
- Data collection
- Data storage
- Data preparation
- Algorithm programming
- Application development
Let’s go through each step, one by one.
Machine Learning can be a valuable tool to solve a multitude of tasks. However, clearly understanding the problem, defining goals, and outlining a plan of action is not trivial.
Before you start thinking about how you will implement an ML solution, you need to clearly define the business objectives you want to reach with it. Set milestones along the way according to which performance can be measured.
Make sure you understand the current solution and at which exact steps ML could provide a benefit. Think about the people involved. How will they interact with the new solution?
Data collection is about gathering the raw data. It is an important step, as machine learning usually requires huge amounts of data. This is especially true for deep learning, a subfield of machine learning. Normally, deep learning algorithms need thousands or even millions of data points to learn (read about the differences between deep learning and machine learning here).
Hence, data collection is a relevant value-adding activity. Or as our deep learning engineers like to explain it:
Back in 2017, The Economist published a story titled, “The world’s most valuable resource is no longer oil, but data.” Hence, the crucial question is: Where do you get the data from?
Firms can generate their data, e.g. on their customers or internal events. Yet in most cases, this isn’t enough and firms rely on external data to train their ML models.
External data refers to public datasets, for example through Google’s Datasets Search Engine, or data that can be purchased or scraped from the web.
Due to the shortage of internal data and high costs associated with getting external data, some companies resort to a third method: Cooperate with similar firms to pool their data, at times even with their competitors.
Once the data is collected it needs a secure place for storage: Data storage is the process of compiling raw data in data centers. Given the massive amount of data involved in machine learning, data storage is an integral part of the whole value chain.
In the early ML days, most companies used to store data on their own servers — not ideal, to put it mildly. With the rise of cloud technology, however, data can be stored and accessed at high speed and low cost, and some tools — like colabel — offer storage as an integral part of the product at no extra charge.
Still, masses of raw data are worth nothing. Raw data, oftentimes, is inconsistent, incomplete, and unstructured (read about the challenges of dealing with unstructured data here). Most machine learning models are not able to work with these data flaws.
- Data conversion: the conversion of data from one format to another, most often to make it readable for a specific computer program
- Data cleaning: correcting inaccurate or incomplete data as well as removing any irrelevant data
- Data enhancement: adding information to data by matching it against an existing database, allowing the desired missing data fields to be added (e.g. your company’s customer data enhanced by information from a public business database)
- Data formatting: the organization of information according to specifications. Think, for instance, of ZIP codes in a spreadsheet column. Without data formatting, the AI might falsely interpret the ZIP codes as large numbers.
- Data labeling: tagging data with one or more labels, e.g. dog pictures with the label “dog”. This step is crucial since (supervised) ML models need input and related output to learn. Many of today’s ML applications are built upon data that is labeled by human labelers who regularly interfere to improve model performance. This concept is called human in the loop.
This is where data preparation comes into play. It describes all efforts that make data utilizable for the ML algorithms. This could include data conversion, cleaning, enhancement, formatting, and labeling.
With prepared data at hand, software engineers can finally devote themselves to the topic of programming the algorithm. In machine learning, algorithms perform tasks without being explicitly programmed. While ML code might be perceived as the step where the “magic” happens, it is only one of several activities of the ML value chain.
Once the algorithm is developed, the model needs to be trained on data. There are three broad categories of how algorithms can be trained. You might have heard of supervised learning, unsupervised learning, and reinforcement learning. In case you are interested to learn more, we have a separate article on that here. But the ML value chain is not finished here.
Only because the baker has removed freshly baked buns from the oven, he or she is not finished. Similarly, the ML code itself is not the end of the value chain, no matter how good it is. What comes next is application development.
Application development is the process of turning the ML model into a commercially viable product. The code comes to life. In this step, software engineers and business people work hand in hand. Great raw data and high-quality code are worth nothing if there isn’t a use case for it.
From a management perspective, understanding how the ML value chain looks like is not enough. Business executives should also know who delivers the value at each step.
There are highly specialized companies, each serving a specific activity of the ML value chain. By focusing on their core competencies, these companies can provide best-in-class service in a particular area. Also, companies can configure their suppliers in a way that suits them best.
To make it more practical: Imagine you want to build up an initial training database of labeled animal images, i.e. all dog pictures are tagged as “dog”, cat pictures as “cat”. In terms of the ML value chain, this would translate to data collection and data preparation. A set of images could be obtained by using a web scraping tool. Next, you could hire a data labeling service. This might be a company with access to a large pool of workers or a platform allowing you to do the work yourself.
The advantage is obvious: These companies excel at what they are doing. However, this comes with drawbacks, too. The process of collecting and labeling data from the example above already has two companies involved — and it is still only a small process of the value chain. This adds complexity and inefficiencies.
Some companies promise to solve these problems by offering an end-to-end solution. Visually, this translates to a vertical representation, as you can see in the graphic above. In simple terms, those companies cover the whole ML value chain as a complete functional solution.
- Data collection & storage: Ok, we don’t collect your data ourselves. But: We use pre-trained models that are then tweaked according to your data. This concept is called transfer learning. It reduces data needs dramatically — from millions to hundreds. Also, we created a free Dataset Builder, so you can quickly build datasets with Google images.
- Data preparation: We help you prepare the data, mostly with classification. When dealing, for instance, with image classification, our software allows you to label your pictures with the corresponding classes. You can do so on our platform or with our Slack integration, which sends your employees images for labeling within the Slack environment.
- Algorithm training: This is where the magic happens. The good news: colabel provides a no-code solution. You can train your algorithms without a single line of code.
- Application development: We also help you apply what you have built: We are aware that this step ultimately drives the value. For instance, you can use colabel’s ML models and integrations to automate processes and boost productivity. Your business might be unique but many of your activities aren’t. Our templates speed things up even more.
What does this look like in practice? End-to-end solutions offer the benefit of speed and simplicity. You can easily build an application from beginning to end and see whether it drives value.
Many traditional value chains don’t have end-to-end solutions that would enable such a procedure. As you hopefully understand at this point, machine learning doesn’t have these limitations.
To briefly sum up what we have covered:
- The AI market is large and growing with machine learning being its main driver.
- The machine learning value chain consists of 6 main steps: problem definition, data collection, data storage, data preparation, algorithm programming, and application development.
- Contrary to the beliefs of many, the ML model is only a small part of the AI value chain.
- There are both specialized players as well as end-to-end solutions serving the value chain.
- End-to-end solutions offer the benefit of speed and simplicity.
This post was originally published by Arne Wolfewicz at Medium [AI]