PyCaret 2.2 is here — What’s new?


This post was originally published by at Towards Data Science

PyCaret 2.2 is now available for download using pip.

We are excited to announce PyCaret 2.2 — update for the month of Oct 2020.

PyCaret is an open-source, low-code machine learning library in Python that automates machine learning workflows. It is an end-to-end machine learning and model management tool that speeds up the machine learning experiment cycle and makes you more productive.

In comparison with the other open-source machine learning libraries, PyCaret is an alternate low-code library that can be used to replace hundreds of lines of code with few words only. This makes experiments exponentially fast and efficient.

Release Notes:


Installing PyCaret is very easy and takes only a few minutes. We strongly recommend using a virtual environment to avoid potential conflicts with other libraries. See the following example code to create a conda environment and install pycaret within that conda environment:

PyCaret’s default installation is a slim version of pycaret which only installs hard dependencies that are listed here. To install the full version of pycaret, use the following code:

When you install the full version of pycaret, all the optional dependencies as listed here are also installed.

PyCaret is evolving very fast. Often, you want to have access to the latest features but want to avoid compiling PyCaret from source or waiting for the next release. Fortunately, you can now install pycaret-nightly using pip.

# or install the full version of the nightly build
pip install pycaret-nightly[full]

PyCaret 2.2 provides the option to use GPU for select model training and hyperparameter tuning. There is no change in the use of the API, however, in some cases, additional libraries have to be installed as they are not installed with the default slim version or the full version. The following models now can be trained on GPU.

  • Extreme Gradient Boosting (requires no further installation)
  • CatBoost (requires no further installation)
  • Light Gradient Boosting Machine (requires GPU installation:
  • Logistic Regression, Ridge Classifier, Random Forest, K Neighbors Classifier, K Neighbors Regressor, Support Vector Machine, Linear Regression, Ridge Regression, Lasso Regression, K-Means Clustering, and Density-Based Spatial Clustering (requires cuML >= 0.15

To enable Light Gradient Boosting Machine on GPU, you will have to install the GPU enabled version of LightGBM. The official step-by-step tutorial to do that is here.

If you are using Google Colab you can install Light Gradient Boosting Machine for GPU but first, you have to uninstall LightGBM — CPU version. Before doing that, ensure that GPU is enabled in your Colab session. Use the following code to install GPU-enabled LightGBM:

As of today, cuML 0.15 is not supported on Google Colab. This may change in the future but for now, you can use blazingSQL Notebook for free which comes pre-installed with cuML 0.15.

Once you sign-in to your account, initiate the Python 3 Notebook and use the following code to install pycaret:

Alternatively, if you have GPU on your local machine or you are planning to use any other cloud service with GPU, you can follow the official installation guide for cuML.

Assuming the installation is successful, the only thing that needs to be done to train models on GPU is to enable GPU when initializing the setup function.

That’s it. You can now use pycaret in the exact way you would use on CPU. It will automatically, use GPU for model training where possible else fallback to CPU equivalent algorithms. Even before starting to train, you can actually check which models are enabled on GPU by using the following command:

Output from models(internal=True)

Benchmark Comparisons CPU vs GPU (Time in Seconds)

Benchmark Comparisons CPU vs. GPU, Dataset: poker (100K x 88)

New methods for hyperparameter tuning are now available. Up until PyCaret 2.1, the only way you can tune the hyperparameters of your model in PyCaret was by using the Random Grid Search from scikit-learn. New methods added in 2.2 are:

  • scikit-learn (grid)
  • scikit-optimize (bayesian)
  • tune-sklearn (random, grid, bayesian, hyperopt, bohb)
  • optuna (random, tpe)

To use these new methods, two new parameters ‘search_library’ and ‘search_algorithm’ have been added.

tune_model output (dt with default hyperparameters AUC = 0.7401

search_algorithms are dependent on the search_library. Following search algorithms are available for the respective search libraries:

  • scikit-learn → random (default), grid
  • scikit-optimize → bayesian (default)
  • tune-sklearn → random (default), grid, bayesian, hyperopt, bohb
  • optuna → random, tpe (default)

Early stopping is also supported for estimators with the partial_fit attribute. Read more about it in the release notes.

Benchmark Comparisons of different tuners

Benchmark comparisons of available search libraries / search algorithms in PyCaret, dataset: diabetes

PyCaret 2.2 is all about performance and functionality. A significant amount of code was refactored to improve memory footprint and optimize performance without impacting user-experience.

One example is all the numeric data is dynamically cast as 32 bit from 64 bit previously, reducing memory footprint significantly. Another example of performance improvement is cross-validation across all the functions are now parallelized automatically across multiple cores compared to sequential training by fold previously.

We have compared the performance of all released versions of PyCaret on 5M sampled rows from the famous New York Taxi Dataset. The below figure compares the time taken to complete the setup initialization:

Benchmark performance comparison on 5M rows from NY Taxi dataset

All the comparisons are done on the AMD64 machine with 8 CPU cores.

You can now fully customize (add or remove) the metrics evaluated during cross-validation. This means that you are no more limited to PyCaret’s default model evaluation metrics. Three new functions get_metrics, add_metric, and remove_metric have been added. The usage is super simple. See the example code:

compare_models output after adding custom metric

Notice that a new column “LogLoss” (all new metrics are added on the right, before TT) is added in the compare_models score grid because we added the metric using the add_metric function. You can use any metric available in scikit-learn or you can create your own using the make_scorer function. You can remove the metric using the following command:

Iterative imputation is a technique of imputing missing data using regression and classification estimators to model each feature as a function of other features. Each feature is imputed in a round-robin fashion, previous predictions being used in new ones. This process is repeated several times in order to increase the quality of imputation. Compared to simple imputation, it can create synthetic values that are closer to real values, at a cost of additional processing time.

Staying true to the spirit of PyCaret, the usage is super simple:

By default, it will use Light Gradient Boosting Machine as an estimator for both categorical features (Classification) and numeric features (Regression) that can be changed using categorical_iterative_imputer and numeric_iterative_imputer parameter in the setup.

Benchmark comparisons of iterative imputation vs. simple imputation

To compare the results of the simple mean imputation with iterative imputation we have used the horse colic dataset that contains a large number of missing values. The figure below compares the performance of the Logistic Regression with different imputation methods.

A blog post by Antoni Baum:

Using Iterative Imputer with KNN as an estimator for both categorical and numeric features improved the mean AUC score by 0.014 (1.59%) compared to simple mean imputation. To learn more about this feature, you can read the complete blog post here.

PyCaret 2.2 provides flexibility to define the fold strategy. Up until PyCaret 2.1, you cannot define the cross-validation strategy. It uses ‘StratifiedKFold’ for Classification and ‘KFold’ for Regression which limits the use of PyCaret for certain uses cases, for example, Time-series data.

To overcome this problem, a new parameter ‘fold_strategy’ is added to the setup function. It can take the following values:

  • kfold for KFold CV;
  • stratifiedkfold for Stratified KFold CV;
  • groupkfold for Group KFold CV;
  • timeseries for TimeSeriesSplit CV; or
  • a custom CV generator object compatible with scikit-learn.

If you have used PyCaret before, you must be familiar with its most used function compare_models. This function trains and evaluates the performance of all estimators available in the model library using cross-validation. However, the problem is if you are dealing with very large datasets, compare_models may take forever to finish. The reason being that it fits 10 fold for each estimator in the model library. For Classification, this means 15 x 10 = 150 estimators in total.

In PyCaret 2.2 we have introduced a new parameter cross_validation in the compare_models function, which when set to False evaluate all metrics on the holdout set instead of cross-validating. While it may not be advisable to rely on holdout metrics solely, especially when the dataset is too small. It is definitely a huge time saver when working with large datasets.

To quantify the impact, we have compared the performance of compare_models in both scenarios (with cross-validation = True, and cross-validation = False). The dataset used for this comparison is here (45K x 50)

With Cross-Validation (It took 7 min 13s):

Output from compare_models(cross_validation = True)

Without Cross-Validation (It took 1 min 19s):

Output from compare_models(cross_validation = False)

This is a home run when it comes to flexibility. A new parameter custom_pipeline has been added to the setup function which can take any transformer and append to the preprocessing pipeline of PyCaret. All custom transformations are applied after train_test_split on each CV fold separately to avoid the risk of target leakage. The usage is super simple:

This is long-awaited and one of the most requested features since the first release. Now you can pass a separate test set instead of relying on pycaret’s internal train_test_split. A new parameter ‘test_data’ has been added to the setup. When a DataFrame is passed into the test_data, it is used as a test set and the train_size parameter is ignored. test_data must be labeled. See the example code below:

If you don’t want to use PyCaret’s default preprocessing pipeline or you already have the transformed dataset and just want to use PyCaret’s modeling capabilities, It wasn’t possible before but now we got you covered. Simply turn off the ‘preprocess’ parameter in the setup. When preprocess is set to False, no transformations are applied except for train_test_split and custom transformations passed in the custom_pipeline parameter.

However, when turning off the preprocessing in the setup, you have to ensure that your data is modeling-ready i.e. no missing values, no dates/timestamps, categorical data is encoded, etc.)

  • New plots ‘lift’, ‘gain’, and ‘tree’ has been added in the plot_model.
  • CatBoost is now compatible with the plot_model function. It requires catboost >= 0.23.2.
  • In order to make both the usage and development easier, type hints have been added to all updated pycaret functions, in accordance with best practices. Users can leverage those by using an IDE with support for type hints.

To learn more about all the updates in PyCaret 2.2, please see the release notes.

There is no limit to what you can achieve using the lightweight workflow automation library in Python. If you find this useful, please do not forget to give us ⭐️ on our GitHub repo.

To hear more about PyCaret follow us on LinkedIn and Youtube.

User Guide
Official Tutorials
Example Notebooks
Other Resources

Click on the links below to see the documentation and working examples.

Anomaly Detection
Natural Language Processing
Association Rule Mining

Spread the word

This post was originally published by at Towards Data Science

Related posts