This post was originally published by Frank Odom at Towards Data Science
We demonstrate an efficient method for simultaneously tuning discrete and continuous hyperparameters for machine learning models using policy gradients.
In my previous article, I showed how to build policy gradients from scratch in Python, and we used it to tune discrete hyperparameters for machine learning models. (If you haven’t read it already, I’d recommend starting there.) Now, we’ll build on that progress, and extend policy gradients to optimize continuous parameters as well. By the end of this article, we’ll have a full-fledged method for simultaneously tuning discrete and continuous hyperparameters.
Review of Policy Gradients
From last time, recall that policy gradients optimizes the following cost function for tuning hyperparameters:
where a is the set of hyperparameters chosen for a particular experiment, and theta represents all trainable parameters for our PG model. Then, p denotes the probability of selecting action a, and r is the “reward” received for that action. We then showed that:
The above equation tells us how to update our PG model, given a set of actions and their observed rewards. For discrete hyperparameters, we directly updated the relative log-probabilities (logits) for each possible action:
This approach will not work for continuous hyperparameters, because we cannot possibly store the log-probability for every possible outcome! We need a new method for generating continuous random variables and their relative log-probabilities.
Extending to Continuous Hyperparameters
In the field of reinforcement learning, continuous variables are commonly modeled using Gaussian Processes. The idea is pretty straightforward: our model predicts the mean and standard deviation for a Gaussian distribution, and we gather actions/predictions using a random number generator.
Again, we’ve chosen the simplest possible model — just store the mean and log-standard deviation for each parameter. To understand the update rules for
self.log_sigma above, let’s compute some gradients. The probability density function for a Gaussian distribution is:
Or equivalently, written in terms of the log-standard deviation:
To obtain the log-probability, simply take the natural logarithm,
where we’ve used the following mathematical identities to help simplify the expression:
After all that work, computing gradients from here isn’t too difficult. Take the partial derivative with respect to mu and log-sigma (or just use WolframAlpha), and we find:
This explains how parameters are updated inside of
GaussianActor… almost. There’s a practical issue with our formulation, despite all of the nice mathematics: the mean of our
GaussianActor changes very slowly compared to the standard deviation. Essentially, this is because
self.log_sigma is in log-space, and updates to
self.log_sigma cause exponentially larger changes to the standard deviation. We can manually increase the learning rate for
self.mu to encourage it to keep up. Multiplying by a factor of 1000 tends to work well in practice.
This is, in my opinion, an ugly band-aid for an otherwise elegant solution. But it really does work. I don’t recommend using this hack for most applications of policy gradients, because it degrades the stability of our model during training. In our case, efficiency is much more important than stability, because we’re limited to a fixed number of experiments. And unlike most applications, we won’t be re-using the trained model for anything — we just use it to estimate hyperparameters, and then discard the model altogether.
Toy Problem: Continuous Hyperparameters
Let’s apply what we’ve learned. We can make up a set of hyperparameters, ground truth values, and an objective function. This time, we’ll use mean squared error as an objective, rather than mean absolute error. The choice of objective function is not critical, and any sensible option should work. Some choices are more efficient than others, though. MSE and MAE are generally pretty safe choices. Empirically, mean squared error gave the best results for these experiments.
Notice that we added the “type” key to each parameter dictionary. This will prove helpful later on, when we incorporate discrete hyperparameters as well. Let’s also set up a function to create actors for each parameter type.
Finally, we can show the complete policy gradients code. Only a small change is needed to allow for continuous parameters.
To measure the performance of our algorithm, let’s run 1000 or so unique experiments and observe how accurate our predictions are. Again, we’ll use a budget of 250 — that’s the maximum number of times that new hyperparameters can be evaluated. This will allow for an apples-to-apples comparison with our previous implementation for discrete hyperparameters only.
Average MAE: 2.155
How does this compare against a random search? To achieve the same MAE, you’d need to guess within ±4.31 of each ground truth value. There’s roughly a 0.0862³ = 6.41E-4 probability of randomly selecting such a set of parameters. On average, it would take 1561 random experiments before matching the accuracy of policy gradients! In this case, PG is roughly 6.25x more efficient than random search. That’s lower efficiency than we saw for discrete parameters (>10x), but still pretty respectable.
Another Toy Problem: Mixed Hyperparameters
At this point, we have confirmed that policy gradients works for tuning continuous and discrete hyperparameters, respectively. Let’s put those pieces together, and see how simultaneous parameter tuning works. No additional changes are needed — just update hyperparameters and re-run the experiments. We’ll also increase the budget to 1000, since we’re optimizing a larger set of parameters.
Average MAE: 2.088
Pause for a brief second, and appreciate the significance of that result. Using only 1000 experiments, policy gradients achieves a mean error of just 2.1% across 6 separate hyperparameters. There’s roughly a 0.0835⁶ = 3.39E-7 probability of randomly drawing a similar set of parameters, which means PG is now nearly 3000x more efficient than random search!
The benefit of using policy gradients sharply increases as the size of our search space grows. That’s because we’re concurrently (not sequentially) tuning our hyperparameters. If we had simply tuned the discrete parameters first, and then tuned the continuous parameters, PG would have been >60x more efficient than random search. (Just multiply the relative efficiencies of discrete and continuous searches.) Instead, we’re able to use every experiment in our budget to optimize all hyperparameters simultaneously, and efficiency skyrockets.
I’m constantly amazed by the power and flexibility of policy gradients. In a few hundred lines of Python, we’ve created an efficient, production-ready optimizer for discrete and continuous hyperparameters. It works equally well with any machine learning algorithm (neural networks, decision trees, k-nearest neighbors, etc.), because external gradients are not needed for optimization. If you need faster performance, each batch of experiments can also be parallelized using built-in Python modules like
concurrent.futures. I avoided doing that here for simplicity.
After this demo, I hope you’re equally as enamored with policy gradients as I am. For more information on PG, I strongly recommend OpenAI Spinning Up, which contains a plethora of background material, research papers, and open-source software relating to reinforcement learning.
This post was originally published by Frank Odom at Towards Data Science