Hyperparameter Optimization with Scikit-Learn, Scikit-Opt and Keras

towards-data-science

This post was originally published by Luke Newman at Towards Data Science

Explore practical ways to optimize your model’s hyperparameters with grid search, randomized search, and bayesian optimization.

Hyperparameter optimization is often one of the final steps in a data science project. Once you have a shortlist of promising models you will want to fine-tune them so that they perform better on your particular dataset.

In this post, we will go over three techniques used to find optimal hyperparameters with examples on how to implement them on models in Scikit-Learn and then finally a neural network in Keras. The three techniques we will discuss are as follows:

  • Grid Search
  • Randomized Search
  • Bayesian Optimization

You can view the jupyter notebook here.

One option would be to fiddle around with the hyperparameters manually until you find a great combination of hyperparameter values that optimize your performance metric. This would be very tedious work, and you may not have time to explore many combinations.

Instead, you should get Scikit-Learn’s GridSearchCV to do it for you. All you have to do is tell it which hyperparameters you want to experiment with and what values to try out, and it will use cross-validation to evaluate all the possible combinations of hyperparameter values.

Let’s work through an example where we use GridSearchCV to search for the best combination of hyperparameter values for a RandomForestClassifier trained using the popular MNIST dataset.

To give you a feel for the complexity of the classification task, the figure below shows a few images from the MNIST dataset:

To implement GridSearchCV we need to define a few things. First being the hyperparameters we want to experiment with and the values we want to try out. Below we specify this in a dictionary called param_grid.

The param_grid tells Scikit-Learn to evaluate 1 x 2 x 2 x 2 x 2 x 2 = 32 combinations of bootstrap, max_depth, max_features, min_samples_leaf, min_samples_split and n_estimators hyperparameters specified. The grid search will explore 32 combinations of RandomForestClassifier’s hyperparameter values, and it will train each model 5 times (since we are using five-fold cross-validation). In other words, all in all, there will be 32 x 5 = 160 rounds of training! It may take a long time, but when it is done you can get the best combination of hyperparameters like this:

forest_grid_search.best_params_

Since n_estimators=350 and max_depth=10 are the maximum values that were evaluated, you should probably try searching again with higher values; the score may continue to improve.

You can also get the best estimator directly:

forest_grid_search.best_estimator_

And of course, the evaluation score is also available:

forest_grid_search.best_score_

Our best score here is 94.59% accuracy which is not bad for such a small parameter grid.

The grid search approach is fine when you are exploring relatively few combinations, like in the previous example, but when the hyperparameter space is large, it is often preferable to use RandomizedSearchCV instead. This class can be used in much the same way as the GridSearchCV class, but instead of trying out all possible combinations, it evaluates a given number of random combinations by selecting a random value for each hyperparameter at every iteration. This approach has two main benefits:

  • If you let a randomized search run for, say, 1,000 iterations, this approach will explore 1,000 different values for each hyperparameter (instead of just a few values per hyperparameter with the grid search approach).
  • Simply by setting the number of iterations, you have more control over the computing budget you want to allocate to the hyperparameter search.

Let’s walk through the same example as before but instead use RandomizedSearchCV. Since we are using RandomizedSearchCV we can search a larger param space than we did with GridSearchCV:

Same as above we can see the best hyperparameters that were explored:

forest_rand_search.best_params_

Also the best estimator:

forest_rand_search.best_estimator_

And check the best score:

forest_rand_search.best_score_

Our best performance was 96.21% accuracy beating GridSearchCV by 1.5%. As you can see RandomizedSearchCV allows us to explore a larger hyperparameter space in relatively the same amount of time and generally outputs better results than GridSearchCV.

You can now save this model, evaluate it on the test set, and, if you are satisfied with its performance, deploy it into production. Using a randomized search is not too hard, and it works well for many fairly simple problems.

When training is slow, however, (e.g., for more complex problems with larger datasets), this approach will only explore a tiny portion of the hyperparameter space. You can partially alleviate this problem by assisting the search process manually: first, run a quick random search using wide ranges of hyperparameter values, then run another search using smaller ranges of values-centered on the best ones found during the first run, and so on. This approach will hopefully zoom in on a good set of hyperparameters. However, it’s very time consuming, and probably not the best use of your time.

Fortunately, there are many techniques to explore a search space much more efficiently than randomly. Their core idea is simple: when a region of the space turns out to be good, it should be explored more. Such techniques take care of the “zooming” process for you and lead to much better solutions in much less time.

One such technique is called Bayesian Optimization and we will use Scikit-Optimize (Skopt) https://scikit-optimize.github.io/ to perform Bayesian Optimization. Skopt is a general-purpose optimization library that performs Bayesian Optimization with its class BayesSearchCV using an interface similar to GridSearchCV.

If you don’t have Skopt already installed go ahead and run the following line of code in your virtual environment:

! pip install scikit-optimize

There are 2 main differences when performing Bayesian Optimization using Skopt’s BayesSearchCV. First, when creating your search space you need to make each hyperparameter’s space a probability distribution as opposed to using lists likeGridSearchCV. Skopt makes this easy for you with their library skopt.space which lets us import Real, Integer, and Categorical to create the probability distributions.

  • Real: Continuous hyperparameter space.
  • Integer: Discrete hyperparameter space.
  • Categorical: Categorical hyperparameter space.

Below you can see examples of using both the categorical and integer functions. For categorical spaces simply input a list inside the function. For Integer, spaces input the minimum and maximum values you want BayesSearchCV to explore.

The function on_step allows us to implement a form of early stopping and also prints out the score after each iteration. Here we specified after each iteration we want to print the best score and if the best score is greater than 98% accuracy training is no longer necessary.

Just like in Scikit-Learn we can view the best parameters:

forest_bayes_search.best_params_

And the best estimator:

forest_bayes_search.best_estimator_

And the best score:

forest_bayes_search.best_score_

Bayesian Optimization allowed us to improve our accuracy by another whole percent in the same amount of iterations as Randomized Search. I hope this convinces you to stray away from your comfort zone using GridSearchCV and RandomizedSearchCV and try implementing something new like BayesSearchCV in your next project. Hyperparameter searching can be tedious, but there are tools that can do the tedious work for you.

The flexibility of neural networks is also one of their main drawbacks: there are many hyperparameters to tweak. Not only can you use any imaginable network architecture, but even in a simple MLP you can change the number of layers, the number of neurons per layer, the type of activation function to use in each layer, the weight initialization logic, and much more. It can be hard to know what combination of hyperparameters is best for your task.

One option is to simply try many combinations of hyperparameters and see which one works best on the validation set (or use K-fold cross-validation). For example, we can use GridSearchCV or RandomizedSearchCV to explore the hyperparameter space. To do this, we need to wrap our Keras models in objects that mimic regular Scikit-Learn classifiers. The first step is to create a function that will build and compile a Keras model, given a set of hyperparameters:

This function creates a simple Sequential model for multi-class classification with the given input shape and the given number of hidden layers and neurons, and it compiles it using an SGD optimizer configured with the specified learning rate.

Next, let’s create a KerasClassifier based on this build_model() function:

keras_clf = keras.wrappers.scikit_learn.KerasClassifier(build_model)

The KerasClassifier object is a thin wrapper around the Keras model built using build_model(). This will allow us to use this object as a regular Scikit-Learn classifier: we can train it using its fit() method, then evaluate it using its score() method, and use it to make predictions using its predict() method.

We don’t want to train and evaluate a single model like this though, we want to train hundreds of variants and see which one performs best on the validation set. Since there are many hyperparameters, it is preferable to use a randomized search rather than a grid search. Let’s try to explore the number of hidden layers, the number of neurons, and the learning rate:

Now we can access the best parameters, estimator and score as we did in Scikit-Learn:

keras_rand_search.best_params_
keras_rand_search.best_score_

Our accuracy increased by another .5%! The last step is to see how each model performed on the test set (see below).

Hyperparameter tuning is still an active area of research, and different algorithms are being produced today. But having basic algorithms in your back pocket can alleviate a lot of the tedious work searching for the best hyperparameters.

Remember, a randomized search is almost always preferable then grid search unless you have very few hyperparameters to explore. If you have a more complex problem using a larger dataset you might want to turn to a technique that explores a search space much more efficiently like Bayesian Optimization.

As always, any feedback and constructive criticism are greatly appreciated.

Feel free to check out the Github repository if you would like to see the presentation slides or jupyter notebook complete with the code and descriptions here.

Additional Resources

https://scikit-optimize.github.io/stable/

Spread the word

This post was originally published by Luke Newman at Towards Data Science

Related posts