Nevergrad: A Python library for performing derivative-free ML optimization

oteytaud · on Dec 20, 2018

To the best of my knowledge, Hyperopt is limited to random search and Parzen variants. We have more algorithms, and include test functions, deal with noise. On the other hand, in Hyperopt conditional variables are naturally handled, whereas for the moment Nevergrad needs user manual work on this.

Both frameworks are asynchronous.

breckuh · on Dec 21, 2018

Are there any practical Pytorch examples? Say my network training time is 12 hours, I wonder how beneficial this would be for hyperparameter tuning over just simple grid/random search? Or would I instrument my network in a way to iterate over hyperparams faster than at every epoch/run?

oteytaud · on Dec 21, 2018

We have not yet released examples of interfaces with Pytorch. Maybe with moderate number of hyperparameters the benefit compared to random search will be moderate, whereas it will be very significant with high number of hyperparameters. It also depends on how parallel you are. In all cases we have a wide range of algorithms with a common interface, so that you can compare.

We also use it for direct training of the weights of a network in reinforcement learning, not only hyperparameters.

geedy · on Dec 21, 2018

Can you elaborate on the benefit for a high number of hyper parameters?

sliem · on Dec 21, 2018

A fundamental problem is as the number of parameters increase the probability of sampling from the edge of the hypercube increases. You will then not effectively explore the parameter space. This might be some what alleviated by a concentrated multivariate normal, but I guess that has its own caveat.

If you instead have a sampling algorithm informed by the loss functions you avoid this problem. (You instead might have to worry about local minima.)

oteytaud · on Dec 21, 2018

For small numbers of hyperparameters, sometimes just random search is enough. This is not an absolute rule, sometimes with just 4 parameters random search miserably fails... just my rule of thumb, empirically, is that for hyperparameters in machine learning (this is certainly not the case in general) random search is often enough for 4 to 12 hyperparameters if the budget for hyperparameter search is ~100 trainings.

snthpy · on Dec 20, 2018

How does this compare to hyperopt?

oteytaud · on Dec 20, 2018

To the best of my knowledge, Hyperopt is limited to random search and Parzen variants. We have more algorithms, and include test functions, deal with noise. On the other hand, in Hyperopt conditional variables are naturally handled, whereas for the moment Nevergrad needs user manual work on this. Both frameworks are asynchronous.

oteytaud · on Dec 22, 2018

... incidentally Hyperopt has the advantage of considering conditional domains; we might either do the same or combine Nevergrad with Hyperopt...

torgian · on Dec 21, 2018

Nevergrad?

I wonder if the maker never graduated.

;-)

jerry40 · on Dec 21, 2018

Also 'grad' means 'a town' in Russian (example: Leningrad). So I can read it either Neverton or Nikograd.

keypusher · on Dec 21, 2018

> Nevergrad offers an extensive collection of algorithms that do not require gradient computation

torgian · on Dec 21, 2018

I like my idea better

ngcc_hk · on Dec 21, 2018

You are not alone

mark212 · on Dec 23, 2018

Don’t know about the maker of this tool, but his or her CEO never did

fulafel · on Dec 21, 2018

Would this type of thing be suited for program synthesis or property based testing?

oteytaud · on Dec 21, 2018

For property-based testing I would say yes, with an objective function equal to the margin by which the properties are satisfied.

Program synthesis only in some particular cases, like the parametrization of programs for speed or another criterion - but not in the general case of program synthesis.

mikejulietbravo · on Dec 20, 2018

Has anyone tried this? Interested to know if results were in line with the benchmarks.

oteytaud · on Dec 20, 2018

We have a wide range of experiments on plenty of objective functions in games, reinforcement learning, in real world design and machine learning hyperparameter tuning - these reports will come soon.

brokensegue · on Dec 20, 2018

isn't this the same thing as blackbox learning?

oteytaud · on Dec 20, 2018

It's black-box optimization. This means that we just have an objective function, without access to derivatives or whatever other information. This is not relevant for training weights in deep learning for image classification, or other things for which the gradient works well.

lostmsu · on Dec 21, 2018

There was a recent paper from Uber, that GA works well for weights, so I wouldn't drop that area right away.

oteytaud · on Dec 21, 2018

Sure GA can be great for weights as well - but mainly when gradient is unreliable. I would not use Nevergrad for training the weights of a convolutional network for image classification for example; whereas I use Nevergrad for WorldModels.

lostmsu · on Dec 28, 2018

Doesn't the model Uber used begin with a bunch of convolutional layer sets, since it processes raw images?

mi_lk · on Dec 21, 2018

What’s GA here?

oteytaud · on Dec 21, 2018

GA stands for genetic algorithms.