The next Industry shift for Machine Learning performance enhancement

towards-data-science

This post was originally published by Federico Riveroll at Towards Data Science

It’s simpler than you think.

Featured image by Museums Victoria on Unsplash

If you don’t like the road you’re walking, start paving another one.

Dolly Parton

Thousands of companies around the world, from small startups to global corporations, find great value in improving the performance of their supervised or unsupervised ML models, whether it’s a sales or demand forecast, a market basket analysis recommender, a customer classifier, a sales optimizer, a chatbot, an algorithmic trading pipeline, a document labeler, an elections forecast, a spam filter, a medical diagnosis solution, a route optimizer, a face recognizer or a self-driving car. And I’m not even going to get started on IoT.

However, all of them seem to attempt to increase accuracy (reduce error) by focusing on mainly two things:

1) Feature engineering (getting the most out of your features by crunching your dataset to death)

2) Model/parameter optimization (choosing the best model and best parameters even if you have to come up with a hybrid of several algorithms and iterate to infinity)

Both of the above are very necessary indeed, but there is a third process that adds value in a complementary way, which has traditionally been wildly underused in most data science projects and is now starting to take off.

Adding external data.

Over 90% of the world’s data has been created in the last two years alone, and volumes are expected to continue growing exponentially. Every 6 hours, one quintillion bytes of data are generated globally. You can’t come up with an intuitive reference for how much that is without recurring to stars or atoms and still, that figure will seem laughable in a couple of years.

On the other side, we have broad access to cutting-edge systems, like neural networks with genetic algorithms, that are remarkable at explaining one variable with other variables on the same dataframe (once they are in a tidy, numerical format).

So the question isn’t IF the two worlds are going to meet, the question is WHEN, and the answer is starting to look like NOW.

With so many sudden changes impacting this highly uncertain and socially-distanced way of life, it is especially challenging to generate accurate predictions relying solely on internal data. Therefore, it is now more relevant (and feasible) than ever to enhance ML models with external data that can provide a more complete view of the problem at hand.

Good data scientists are looking to find good, clean influencing data to blend with their own data to make more accurate predictions.”

‘4 Ways to Differentiate Your Analytics Product by Including External Datasets’ Gartner research report by Kevin Quinn and Emil Berthelsen, 24 July 2020.

Data Scientists tend to be discouraged to add external data to their models as they believe there is a low benefit/effort ratio because it’s a lot of work to gather, process, profile and join unstructured data in a completely different formats. Moreover, the decision to add data is ‘only based on a hunch’ and there could be no relationship at all.

But the thing is, it can be waaay simpler than you’d think. Here’s a technical tutorial by Jack Shepherd of Ways to Blend External Data to your dataset using Python or R. Spoiler alert: one-liners.

So, now that model enrichment with useful variables from open data is available for everyone, the time has come for ML dependent enterprises to adapt or be outperformed.

Big things are coming.

Spread the word

This post was originally published by Federico Riveroll at Towards Data Science

Related posts