20 AutoML libraries for the Data Scientists

mediumThis post was originally published by Saurav Anand at Medium [AI]

Image for post
Photo by Lenin Estrada on Unsplash

“One of the holy grails of machine learning is to automate more and more of the feature engineering process.” ― Pedro Domingos

Introduction

AutoML refers to automated machine learning. It explains how the end to end process of machine learning can be automated at the organizational and educational level. The machine learning model includes basically the following steps :

Initially all these steps were done manually. But now with the advent of the AutoML these steps can be automated. AutoML currently falls into three categories :

a) AutoML for automated parameter tuning (a relatively basic type)

b) AutoML for non-deep learning, for example, AutoSKlearn. This type is mainly applied in data pre-processing, automated feature analysis, automated feature detection, automated feature selection, and automated model selection.

c) AutoML for deep learning/neural networks, including NAS and ENAS as well as Auto-Keras for frameworks.

Why AutoML is necessary ?

The demand for machine learning is increasing day by day. Organizations have adopted machine learning at the application level. Still lot of improvements are going on and still many companies are struggling to provide better solutions with respect to the deployment of the machine learning models.

For the deployment purpose, an enterprise needs to have a team of experienced data scientists, who expect high salaries. Even if an enterprise does have an excellent team, usually more experience rather than AI knowledge is needed to decide which model best fits the enterprise.The success of machine learning in a variety of applications leads to an increasingly higher demand for machine learning systems, which are supposed to be easy-to-use even for non-experts. AutoML tends to automate as many steps as possible in ML pipelines and retain good model performance with minimum manpower.

AutoML has three major advantages :-

  • It improves the efficiency by automating the most repetitive tasks. This allows the data scientists to devote more time on the problems rather than on the models.
  • Automated ML pipelines also help avoid potential errors caused by manual work.
  • AutoML is a big step toward the democratization of machine learning and allows everyone to use ML features.

Let’s see some of the most common AutoML libraries which are present in different programming languages :-

Python

1. auto-sklearn

Image for post

auto-sklearn is an automated machine learning toolkit and a drop-in replacement for a scikit-learn estimator. Auto-SKLearn frees a machine learning user from algorithm selection and hyper-parameter tuning. It includes feature engineering methods such as One-Hot, digital feature standardization, and PCA. The model uses SKLearn estimators to process classification and regression problems.Auto-SKLearn creates a pipeline and uses Bayes search to optimize that channel. In the ML framework, two components are added for hyperparameter tuning by means of Bayesian reasoning: Meta learning is used to initialize optimizers using Bayes and evaluate the auto collection construction of the configuration during the optimization process.

Auto-SKLearn performs well in medium and small datasets, but it cannot produce modern deep learning systems with the most advanced performance in large datasets.

Example

import sklearn.model_selection
import sklearn.datasets
import sklearn.metrics

import autosklearn.regression

def main():
X, y = sklearn.datasets.load_boston(return_X_y=True)
feature_types = ([‘numerical’] * 3) + [‘categorical’] + ([‘numerical’] * 9)
X_train, X_test, y_train, y_test =
sklearn.model_selection.train_test_split(X, y, random_state=1)

automl = autosklearn.regression.AutoSklearnRegressor(
time_left_for_this_task=120,
per_run_time_limit=30,
tmp_folder=’/tmp/autosklearn_regression_example_tmp’,
output_folder=’/tmp/autosklearn_regression_example_out’,
)
automl.fit(X_train, y_train, dataset_name=’boston’,
feat_type=feature_types)

print(automl.show_models())
predictions = automl.predict(X_test)
print(“R2 score:”, sklearn.metrics.r2_score(y_test, predictions))

if __name__ == ‘__main__’:
main()

Official website : Find the documentation here

2. FeatureTools

Image for post

It is a python library for automated feature engineering.

Installation

Install with pip

python -m pip install featuretools

or from the Conda-forge channel on conda:

conda install -c conda-forge featuretools

Add-ons

We can install add-ons individually or all at once by running

python -m pip install featuretools[complete]

Update checker — Receive automatic notifications of new Featuretools releases

python -m pip install featuretools[update_checker]

TSFresh Primitives — Use 60+ primitives from tsfresh within Featuretools

python -m pip install featuretools[tsfresh]

Example

>> import featuretools as ft
>> es = ft.demo.load_mock_customer(return_entityset=True)
>> es.plot()
Image for post

Featuretools can automatically create a single table of features for any “target entity”

>> feature_matrix, features_defs = ft.dfs(entityset=es, target_entity="customers")
>> feature_matrix.head(5)
Image for post

Official Website : https://featuretools.alteryx.com/en/stable/

3.MLBox

Image for post

MLBox is a powerful Automated Machine Learning python library. According to the official document, it provides the following features:

  • Fast reading and distributed data preprocessing/cleaning/formatting
  • Highly robust feature selection and leak detection as well as accurate hyper-parameter optimization
  • State-of-the art predictive models for classification and regression (Deep Learning, Stacking, LightGBM,…)
  • Prediction with model interpretation
    MLBox has been tested on Kaggle and shows good performance. (See Kaggle “Two Sigma Connect:Rental ListingInquiries”| Rank:85/2488)
  • Pipeline

MLBox architecture

MLBox main package contains 3 sub-packages:

  • Pre-processing: reading and pre-processing data
  • Optimization: testing or optimizing a wide range of learners
  • Prediction: predicting the target on a test dataset

Official website : https://github.com/AxeldeRomblay/MLBox

4. TPOT

TPOT stands for Tree-based Pipeline Optimization Tool.It uses genetic algorithms to optimize machine learning pipelines.TPOT is built on top of scikit-learn and uses its own regressor and classifier methods. TPOT explore thousands of possible pipelines and finds the one that best fit the data.

Image for post

TPOT automates the most tedious part of machine learning by intelligently exploring thousands of possible pipelines to find the best one for our data.

Image for post

Once TPOT is finished searching , it provides us with the Python code for the best pipeline it found so we can tinker with the pipeline from there.

Image for post

TPOT is built on top of scikit-learn, so all of the code it generates should look familiar… if we are familiar with scikit-learn, anyway.

TPOT is still under active development.

Examples

Classification

This is the working example with the with the the optical recognition of handwritten digits dataset.

from tpot import TPOTClassifier
from sklearn.datasets import load_digits
from sklearn.model_selection import train_test_splitdigits = load_digits()
X_train, X_test, y_train, y_test = train_test_split(digits.data, digits.target,
 train_size=0.75, test_size=0.25, random_state=42)tpot = TPOTClassifier(generations=5, population_size=50, verbosity=2, random_state=42)
tpot.fit(X_train, y_train)
print(tpot.score(X_test, y_test))
tpot.export(‘tpot_digits_pipeline.py’)

This code will discover the pipeline which achieves about the 98 % testing accuracy. The corresponding Python code should be exported to the tpot_digits_pipeline.py file and look similar to the following:

import numpy as np
import pandas as pd
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.pipeline import make_pipeline, make_union
from sklearn.preprocessing import PolynomialFeatures
from tpot.builtins import StackingEstimator
from tpot.export_utils import set_param_recursive# NOTE: Make sure that the outcome column is labeled ‘target’ in the data file
tpot_data = pd.read_csv(‘PATH/TO/DATA/FILE’, sep=’COLUMN_SEPARATOR’, dtype=np.float64)
features = tpot_data.drop(‘target’, axis=1)
training_features, testing_features, training_target, testing_target = 
 train_test_split(features, tpot_data[‘target’], random_state=42)# Average CV score on the training set was: 0.9799428471757372
exported_pipeline = make_pipeline(
 PolynomialFeatures(degree=2, include_bias=False, interaction_only=False),
 StackingEstimator(estimator=LogisticRegression(C=0.1, dual=False, penalty=”l1")),
 RandomForestClassifier(bootstrap=True, criterion=”entropy”, max_features=0.35000000000000003, min_samples_leaf=20, min_samples_split=19, n_estimators=100)
)
# Fix random state for all the steps in exported pipeline
set_param_recursive(exported_pipeline.steps, ‘random_state’, 42)exported_pipeline.fit(training_features, training_target)
results = exported_pipeline.predict(testing_features)

Regression

TPOT can optimize pipelines for regression problems. Below is a minimal working example with the practice Boston housing prices data set.

from tpot import TPOTRegressor
from sklearn.datasets import load_boston
from sklearn.model_selection import train_test_splithousing = load_boston()
X_train, X_test, y_train, y_test = train_test_split(housing.data, housing.target,
 train_size=0.75, test_size=0.25, random_state=42)tpot = TPOTRegressor(generations=5, population_size=50, verbosity=2, random_state=42)
tpot.fit(X_train, y_train)
print(tpot.score(X_test, y_test))
tpot.export(‘tpot_boston_pipeline.py’)

which should result in a pipeline that achieves about 12.77 mean squared error (MSE), and the Python code in tpot_boston_pipeline.py should look similar to:

import numpy as np
import pandas as pd
from sklearn.ensemble import ExtraTreesRegressor
from sklearn.model_selection import train_test_split
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import PolynomialFeatures
from tpot.export_utils import set_param_recursive# NOTE: Make sure that the outcome column is labeled ‘target’ in the data file
tpot_data = pd.read_csv(‘PATH/TO/DATA/FILE’, sep=’COLUMN_SEPARATOR’, dtype=np.float64)
features = tpot_data.drop(‘target’, axis=1)
training_features, testing_features, training_target, testing_target = 
 train_test_split(features, tpot_data[‘target’], random_state=42)# Average CV score on the training set was: -10.812040755234403
exported_pipeline = make_pipeline(
 PolynomialFeatures(degree=2, include_bias=False, interaction_only=False),
 ExtraTreesRegressor(bootstrap=False, max_features=0.5, min_samples_leaf=2, min_samples_split=3, n_estimators=100)
)
# Fix random state for all the steps in exported pipeline
set_param_recursive(exported_pipeline.steps, ‘random_state’, 42)exported_pipeline.fit(training_features, training_target)
results = exported_pipeline.predict(testing_features)

Github Link :- https://github.com/EpistasisLab/tpot

5.Lightwood

Image for post

Lightwood is like Legos for Machine Learning.

A Pytorch based framework that breaks down machine learning problems into smaller blocks that can be glued together seamlessly with one objective:

  • Make it so simple that you can build predictive models with as little as one line of code.

Installation

We can install Lightwood from pip:

pip3 install lightwood

Note: depending on our environment, we might have to use pip instead of pip3 in the above command.

Given the simple sensor_data.csv let’s predict sensor3 values.

Image for post

Import Predictor from Lightwood

from lightwood import Predictor

Train the model.

import pandas
sensor3_predictor = Predictor(output=['sensor3']).learn(from_data=pandas.read_csv('sensor_data.csv'))

We can now predict what sensor3 value will be.

prediction = sensor3_predictor.predict(when={'sensor1':1, 'sensor2':-1})

Official link : https://github.com/mindsdb/lightwood

6. mindsdb

Image for post

MindsDB is an open-source AI layer for existing databases that allows you to effortlessly develop, train and deploy state-of-the-art machine learning models using SQL queries.

Image for post

Official Link : https://github.com/mindsdb/mindsdb

7. mljar-supervised

Image for post

The mljar-supervised is an Automated Machine Learning Python package that works with tabular data. It is designed to save time for a data scientist 😎. It abstracts the common way to preprocess the data, construct the machine learning models, and perform hyper-parameters tuning to find the best model 🏆. It is no black-box as you can see exactly how the ML pipeline is constructed (with a detailed Markdown report for each ML model).

The mljar-supervised will help you with:

  • explaining and understanding your data,
  • trying many different machine learning models,
  • creating Markdown reports from analysis with details about all models,
  • saving, re-running and loading the analysis and ML models.

It has three built-in modes of work:

  • Explain mode, which is ideal for explaining and understanding the data, with many data explanations, like decision trees visualization, linear models coefficients display, permutation importances and SHAP explanations of data,
  • Perform for building ML pipelines to use in production,
  • Compete mode that trains highly-tuned ML models with ensembling and stacking, with a purpose to use in ML competitions.

Official link :- https://github.com/mljar/mljar-supervised

8. Auto-Keras

Image for post

Auto-Keras is an open source software library for automated machine learning (AutoML) developed by DATA Lab. Built on top of the deep learning framework Keras, Auto-Keras provides functions to automatically search for architecture and hyper-parameters of deep learning models.

Auto-Keras follows the classic Scikit-Learn API design and therefore is easy to use. The current version provides the function to automatically search for hyper-parameters during deep learning.

In Auto-Keras, the trend is to simplify ML by using automatic Neural Architecture Search (NAS) algorithms. NAS basically uses a set of algorithms that automatically adjust models to replace deep learning engineers/practitioners.

Official link: https://github.com/keras-team/autokeras

9.Neural Network Intelligence

Image for post

An open source AutoML toolkit for neural architecture search and hyper-parameter tuning.NNI provides CommandLine Tool as well as an user friendly WebUI to manage training experiments. With the extensible API, you can customize your own AutoML algorithms and training services. To make it easy for new users, NNI also provides a set of build-in state-of-the-art AutoML algorithms and out of box support for popular training platforms.

Official website :- https://nni.readthedocs.io/en/latest/

10. Ludwig

Image for post

Ludwig is a toolbox that allows users to train and test deep learning models without the need to write code. It is built on top of TensorFlow.Ludwig is built with extensibility principles in mind and is based on datatype abstractions, making it easy to add support for new datatypes as well as new model architectures.It can be used by practitioners to quickly train and test deep learning models as well as by researchers to obtain strong baselines to compare against and have an experimentation setting that ensures comparability by performing the same data processing and evaluation.

Ludwig provides a set of model architectures that can be combined together to create an end-to-end model for a given use case. As an analogy, if deep learning libraries provide the building blocks to make your building, Ludwig provides the buildings to make your city, and you can choose among the available buildings or add your own building to the set of available ones.

  • No coding required: no coding skills are required to train a model and use it for obtaining predictions.
  • Generality: a new datatype-based approach to deep learning model design makes the tool usable across many different use cases.
  • Flexibility: experienced users have extensive control over model building and training, while newcomers will find it easy to use.
  • Extensibility: easy to add new model architecture and new feature datatypes.
  • Understandability: deep learning model internals are often considered black boxes, but Ludwig provides standard visualizations to understand their performance and compare their predictions.
  • Open Source: Apache License 2.0

Official link:- https://github.com/uber/ludwig

11. AdaNet

Image for post

AdaNet is a lightweight TensorFlow-based framework for automatically learning high-quality models with minimal expert intervention. AdaNet builds on recent AutoML efforts to be fast and flexible while providing learning guarantees. Importantly, AdaNet provides a general framework for not only learning a neural network architecture, but also for learning to ensemble to obtain even better models.

AdaNet has the following goals:

  • Ease of use: Provide familiar APIs (e.g. Keras, Estimator) for training, evaluating, and serving models.
  • Speed: Scale with available compute and quickly produce high quality models.
  • Flexibility: Allow researchers and practitioners to extend AdaNet to novel subnetwork architectures, search spaces, and tasks.
  • Learning guarantees: Optimize an objective that offers theoretical learning guarantees.
Image for post

Official link : https://github.com/tensorflow/adanet

12. darts (Differentiable Architecture Search)

The algorithm is based on continuous relaxation and gradient descent in the architecture space. It is able to efficiently design high-performance convolutional architectures for image classification (on CIFAR-10 and ImageNet) and recurrent architectures for language modeling (on Penn Treebank and WikiText-2). Only a single GPU is required.

Official link :- https://github.com/quark0/darts

13. automl-gs

Give an input CSV file and a target field you want to predict to automl-gs, and get a trained high-performing machine learning or deep learning model plus native Python code pipelines allowing you to integrate that model into any prediction workflow. No black box: you can see exactly how the data is processed, how the model is constructed, and you can make tweaks as necessary.

Image for post

automl-gs is an AutoML tool which, unlike Microsoft’s NNI, Uber’s Ludwig, and TPOT, offers a zero code/model definition interface to getting an optimized model and data transformation pipeline in multiple popular ML/DL frameworks, with minimal Python dependencies.

Official link :- https://github.com/minimaxir/automl-gs

R

14. R Interface to AutoKeras

AutoKeras is an open source software library for automated machine learning (AutoML). It is developed by DATA Lab at Texas A&M University and community contributors. The ultimate goal of AutoML is to provide easily accessible deep learning tools to domain experts with limited data science or machine learning background. AutoKeras provides functions to automatically search for architecture and hyperparameters of deep learning models.

Check out the AutoKeras blogpost at the RStudio TensorFlow for R blog.

Official Documentation : https://github.com/r-tensorflow/autokeras

Scala

15. TransmogrifAI

TransmogrifAI (pronounced trăns-mŏgˈrə-fī) is an AutoML library written in Scala that runs on top of Apache Spark. It was developed with a focus on accelerating machine learning developer productivity through machine learning automation, and an API that enforces compile-time type-safety, modularity, and reuse. Through automation, it achieves accuracies close to hand-tuned models with almost 100x reduction in time.

Use TransmogrifAI if you need a machine learning library to:

  • Build production ready machine learning applications in hours, not months
  • Build machine learning models without getting a Ph.D. in machine learning
  • Build modular, reusable, strongly typed machine learning workflows

Official link : https://github.com/salesforce/TransmogrifAI

Java

16. Glaucus

Image for post

Glaucus is a Data Flow based machine learning suite that incorporates Automated machine learning pipeline, Simplified the complex processes of machine learning algorithms and applying Excellent distributed data-processing engines. For the non-data science professionals across the domain, help them get the benefits of powerful machine learning tools by a simple way.

The user only need to upload data, simple configuration, algorithm selection, and train the algorithm by automatic or manual parameter adjustment. The platform also provides a wealth of evaluation indicators for the training model, so that non-professionals can maximize the role of machine learning in their fields. The entire platform structure as shown below, the main functions are:

Image for post
  • Receive Multi-source dataset, including structured, document and image data;
  • Provides rich mathematical statistics function, and the graphical interface allows users to easily grasp the data situation;
  • In automatic mode, we implement full pipe automation from preprocessing, feature engineering to machine learning algorithms;
  • In manual mode, it dramatically simplifies machine learning pipeline processes, and provides automated data cleaning, semi-automated feature selection, and depth learning suites.

Official Website :- https://github.com/ccnt-glaucus/glaucus

Other Tools

17. H20 AutoML

Image for post

The H2O AutoML interface is designed to have as few parameters as possible so that all the user needs to do is point to their dataset, identify the response column and optionally specify a time constraint or limit on the number of total models trained.

In both the R and Python API, AutoML uses the same data-related arguments, x, y, training_frame, validation_frame, as the other H2O algorithms. Most of the time, all you’ll need to do is specify the data arguments. You can then configure values for max_runtime_secs and/or max_models to set explicit time or number-of-model limits on your run.

Official Link : https://github.com//h2oai/h2o-3/blob/master/h2o-docs/src/product/automl.rst

18. PocketFlow

PocketFlow is an open-source framework for compressing and accelerating deep learning models with minimal human effort. Deep learning is widely used in various areas, such as computer vision, speech recognition, and natural language translation. However, deep learning models are often computational expensive, which limits further applications on mobile devices with limited computational resources.

PocketFlow aims at providing an easy-to-use toolkit for developers to improve the inference efficiency with little or no performance degradation. Developers only needs to specify the desired compression and/or acceleration ratios and then PocketFlow will automatically choose proper hyper-parameters to generate a highly efficient compressed model for deployment.

Image for post

Official link :- https://github.com/Tencent/PocketFlow

19. Ray

Image for post

Ray provides a simple, universal API for building distributed applications.

Ray is packaged with the following libraries for accelerating machine learning workloads:

  • Tune: Scalable Hyperparameter Tuning
  • RLlib: Scalable Reinforcement Learning
  • RaySGD: Distributed Training Wrappers
  • Ray Serve: Scalable and Programmable Serving

Install Ray with: pip install ray

Official link: https://github.com/ray-project/ray

20. SMAC3

SMAC is a tool for algorithm configuration to optimize the parameters of arbitrary algorithms across a set of instances. This also includes hyperparameter optimization of ML algorithms. The main core consists of Bayesian Optimization in combination with an aggressive racing mechanism to efficiently decide which of two configurations performs better.

For a detailed description of its main idea, we refer to

Hutter, F. and Hoos, H. H. and Leyton-Brown, K.
Sequential Model-Based Optimization for General Algorithm Configuration
In: Proceedings of the conference on Learning and Intelligent OptimizatioN (LION 5)

SMAC v3 is written in Python3 and continuously tested with Python 3.6 and python3.6. Its Random Forest is written in C++.

Conclusion

The autoML libraries are so important as they automate the repetitive tasks such as pipeline creation and hyperparameter tuning . It saves time for the data scientists so that they can devote more time on buisiness problems. AutoML also allows everyone instead a small group of people to use the machine learning technology. Data scientists can accelerate ML development by using AutoML to implement really efficient machine learning.

Let’s see how much successful AutoML will be depending on it’s usage and demand by the organizations. Time will decide it’s fate. But currently I can say that AutoML is significant in the field of machine learning.

Spread the word

This post was originally published by Saurav Anand at Medium [AI]

Related posts