*This post was originally published by Frank Odom at Towards Data Science*

## We demonstrate an efficient method for simultaneously tuning discrete and continuous hyperparameters for machine learning models using policy gradients.

### Overview

In my previous article, I showed how to build policy gradients from scratch in Python, and we used it to tune *discrete* hyperparameters for machine learning models. (If you haven’t read it already, I’d recommend starting there.) Now, we’ll build on that progress, and extend policy gradients to optimize continuous parameters as well. By the end of this article, we’ll have a full-fledged method for simultaneously tuning discrete and continuous hyperparameters.

### Review of Policy Gradients

From last time, recall that policy gradients optimizes the following cost function for tuning hyperparameters:

where *a* is the set of hyperparameters chosen for a particular experiment, and *theta* represents all trainable parameters for our PG model. Then, *p* denotes the probability of selecting action *a*, and *r* is the “reward” received for that action. We then showed that:

The above equation tells us how to update our PG model, given a set of *actions* and their observed *rewards.* For discrete hyperparameters, we directly updated the relative log-probabilities (*logits*) for each possible action:

This approach will not work for continuous hyperparameters, because we cannot possibly store the log-probability for every possible outcome! We need a new method for generating *continuous* random variables and their relative log-probabilities.

### Extending to Continuous Hyperparameters

In the field of reinforcement learning, continuous variables are commonly modeled using *Gaussian Processes*. The idea is pretty straightforward: our model predicts the mean and standard deviation for a Gaussian distribution, and we gather actions/predictions using a random number generator.

Again, we’ve chosen the simplest possible model — just store the mean and log-standard deviation for each parameter. To understand the update rules for `self.mu`

and `self.log_sigma`

above, let’s compute some gradients. The probability density function for a Gaussian distribution is:

Or equivalently, written in terms of the *log-standard deviation*:

To obtain the log-probability, simply take the natural logarithm,

where we’ve used the following mathematical identities to help simplify the expression:

After all that work, computing gradients from here isn’t too difficult. Take the partial derivative with respect to *mu* and *log-sigma* (or just use WolframAlpha), and we find:

This explains how parameters are updated inside of `GaussianActor`

… almost. There’s a practical issue with our formulation, despite all of the nice mathematics: the *mean* of our `GaussianActor`

changes very slowly compared to the *standard deviation*. Essentially, this is because `self.log_sigma`

is in log-space, and updates to `self.log_sigma`

cause exponentially larger changes to the standard deviation. We can manually increase the learning rate for `self.mu`

to encourage it to keep up. Multiplying by a factor of 1000 tends to work well in practice.

This is, in my opinion, an ugly band-aid for an otherwise elegant solution. But it really does work. I don’t recommend using this hack for most applications of policy gradients, because it degrades the stability of our model during training. In our case, efficiency is much more important than stability, because we’re limited to a fixed number of experiments. And unlike most applications, we won’t be re-using the trained model for anything — we just use it to estimate hyperparameters, and then discard the model altogether.

### Toy Problem: Continuous Hyperparameters

Let’s apply what we’ve learned. We can make up a set of hyperparameters, ground truth values, and an objective function. This time, we’ll use *mean squared error* as an objective, rather than *mean absolute error*. The choice of objective function is not critical, and any sensible option should work. Some choices are more efficient than others, though. MSE and MAE are generally pretty safe choices. Empirically, mean squared error gave the best results for these experiments.

Notice that we added the “type” key to each parameter dictionary. This will prove helpful later on, when we incorporate discrete hyperparameters as well. Let’s also set up a function to create actors for each parameter type.

Finally, we can show the complete policy gradients code. Only a small change is needed to allow for continuous parameters.

To measure the performance of our algorithm, let’s run 1000 or so unique experiments and observe how accurate our predictions are. Again, we’ll use a *budget* of 250 — that’s the maximum number of times that new hyperparameters can be evaluated. This will allow for an apples-to-apples comparison with our previous implementation for discrete hyperparameters only.

```
Average MAE: 2.155
```

How does this compare against a random search? To achieve the same MAE, you’d need to guess within ±4.31 of each ground truth value. There’s roughly a *0.0862³ = 6.41E-4* probability of randomly selecting such a set of parameters. On average, it would take *1561* random experiments before matching the accuracy of policy gradients! In this case, PG is roughly *6.25x* more efficient than random search. That’s lower efficiency than we saw for discrete parameters (*>10x*), but still pretty respectable.

### Another Toy Problem: Mixed Hyperparameters

At this point, we have confirmed that policy gradients works for tuning continuous and discrete hyperparameters, *respectively*. Let’s put those pieces together, and see how *simultaneous* parameter tuning works. No additional changes are needed — just update hyperparameters and re-run the experiments. We’ll also increase the *budget* to 1000, since we’re optimizing a larger set of parameters.

```
Average MAE: 2.088
```

Pause for a brief second, and appreciate the significance of that result. Using only *1000* experiments, policy gradients achieves a mean error of just *2.1%* across *6* separate hyperparameters. There’s roughly a *0.0835⁶ = 3.39E-7* probability of randomly drawing a similar set of parameters, which means PG is now nearly ** 3000x more efficient** than random search!

The benefit of using policy gradients sharply increases as the size of our search space grows. That’s because we’re *concurrently* (not sequentially) tuning our hyperparameters. If we had simply tuned the discrete parameters first, and then tuned the continuous parameters, PG would have been *>60x* more efficient than random search. (Just multiply the relative efficiencies of discrete and continuous searches.) Instead, we’re able to use *every* experiment in our budget to optimize *all* hyperparameters simultaneously, and efficiency skyrockets.

### Takeaways

I’m constantly amazed by the power and flexibility of policy gradients. In a few hundred lines of Python, we’ve created an efficient, production-ready optimizer for discrete and continuous hyperparameters. It works equally well with *any* machine learning algorithm (neural networks, decision trees, k-nearest neighbors, etc.), because external gradients are not needed for optimization. If you need faster performance, each batch of experiments can also be parallelized using built-in Python modules like `multiprocessing`

or `concurrent.futures`

. I avoided doing that here for simplicity.

After this demo, I hope you’re equally as enamored with policy gradients as I am. For more information on PG, I strongly recommend OpenAI Spinning Up, which contains a plethora of background material, research papers, and open-source software relating to reinforcement learning.

*This post was originally published by Frank Odom at Towards Data Science*