Disclaimer. This tutorial is primarily about the use of multifidelity optimization where you can access, evaluate and benefit from cheaper low-fidelity objective function evaluations to obtain high-quality solutions at a fraction of cost which would be required with high-fidelity-only evaluations. The title of this tutorial mentions DNNs (Deep Neural Networks) because we refer to optimization of DNNs for illustration purposes. Otherwise, the same material can be applied to other domains. We intentionally don't provide any PyTorch or TensorFlow example code because it would be expensive to reproduce and it would likely be irrelevant (in terms of demonstrating state-of-the-art deep learning results) by the time you read this tutorial. We also don't provide a comprehensive introduction and comparison with some relevant literature on the topic, it would require us to write a journal paper here instead.
ELI5 explanation of this tutorial. If you want to find a solution which will at once best satisfy a large group of people (an expensive high-fidelity evaluation of what you want), you can first try various solutions on a few or only one person (a cheaper low-fidelity evaluation of what you want) to then select the best solutions for the large audience. Whether you will save a lot of resources or waste them depends of how well the few people that you pick will represent your target audience.
The source code of this tutorial is available for
Optimization of DNNs consists of finding solutions which satisfy a set of user-defined requirements on their performance metrics. In the simplest scenario, the network architecture is fixed and we are searching for hyperparameter settings (e.g., the learning rate, weight decay, regularization coefficient, batch size, etc.) of the used training algorithm such as momentum SGD, Adam or a more powerful method. In a more advanced scenario, we may have a higher-level system/controller which sets/schedules these hyperparameters as a function of the task at hand (e.g., dataset properties, constraints on our computational budget, etc). Thus, the hyperparameters of the controller can become our hyperparameters to be optimized. Similarly, we may have some hyperparameters (e.g., the network's depth and layer-wise width) which are used to setup the architecture of our DNNs as an instance in some user-defined search space of possible architectures. In a more advanced scenario, we may have hyperparameters which govern a system/controller which is used to grow (and train with derivative-based methods) our network and thus search in a space of possible architectures primarily constrained by growth operators.
Let us assume that each candidate solution of one of the above-described scenarios is encoded by a vector
To complete the definition of our toy benchmark problem, we should come up with some way to schedule
In other words, optimal/target hyperparameters settings
Note that the optimal parameter values at
Our optimization problem has 4 parameters to be optimized and 1 cost parameter which tells the optimizer that there is a way to control fidelity of objective function evaluations. We introduce not only our objective function to be minimized but also a cost function which measures the actual cost of each evaluation. For instance, if the cost determines the number of batch passes, one should expect that by requesting very short training runs of e.g. 1/10 epoch, the relative cost and overhead of DNN's model construction and data loading can be huge. The solver assumes/learns that there is some link (e.g., linear or quadratic) between cost variables and cost metrics.
Candidate solutions will be evaluated in a function called "evaluate" which implements
In order to communicate with Indie Solver, you should initialize a worker and tell it to connect to a particular IP address and port while providing your secret authentication token. All this information is available on your
The following JSON request describing the problem to be created will be sent to Indie Solver:
Your worker can ask solutions iteratively one after another by calling "ask_new_solutions". The reply of this call will include solutions to be evaluated. For the case when some errors were detected by the server, one should check for the status of the reply in order to avoid unnecessary and potentially expensive calls to the objective function. Once some solutions are evaluated, one can communicate them to the server by calling "tell_metrics".
The loop described above will request and evaluate solutions 1000 times. If you are currently on our free subscription plan, then only 80 function evaluations will be performed because of the limit to perform at most 20x the number of parameters function evaluations per problem. If you are a researcher, contact us to get a free extension of the limit to 1000x which corresponds to 4000 function evaluations on this problem.
The entire code of this tutorial example is given below:
The figure below demonstrates the optimization results available in your
We change the x-axis to use the measured costs. We also add a transformation to sum up values given on the the x-axis to plot the results w.r.t. the total cost used so far. Note that one full-fidelity run corresponds to
One can add another transformation do show the best objective function values found so far. The figure below shows that the total cost needed to reach the error value of about 1 is in order of 500, i.e., about 5 full-fidelity evaluations:
The following parallel coordinates figure will be shown by default:
We swap some parallel coordinates, swap colors and select a range of interest for error values (the corresponding solutions are given in the table below):
While we demonstrated that thanks to the use of low-fidelity evaluations one can find a good approximate of the optimum in about 5 full-fidelity evaluations, here we propose to take a more critical look at the results. We consider four different scenarios: i) when we run random search with only high-fidelity full-cost evaluations (referred to as full-cost scenario), ii) the same scenario but with Indie Solver, iii) an optimistic scenario when
Given the number of variables and the number of available function evaluations, random search does not look like a competitive option. Note that the remaining three scenarios demonstrate much faster convergence compared to random search and the difference between the three of them is smaller.
The scenario of multifidelity A is an interesting case where low-fidelity evaluations give worse objective function values than high-fidelity evaluations but the landscape of the search space is the same in terms of isolines of the objective function values. As an example, consider the case when your hyperparameters are used to setup the training algorithm and they are very stable for different budgets of epochs, i.e., work equally great for 10 epochs and 1000 epochs. While you don't get the same validation error after 10 epochs as after 1000 epochs because
The scenario of multifidelity B is the one where optimal hyperparameter values change as a function of the allowed cost. The figure given above shows that one can benefit from faster convergence but in the case when best low-fidelity solutions don't predict best high-fidelity solutions, one could actually run the full-cost optimization without wasting time on low-fidelity evaluations at all. We provide this rather negative result in order to illustrate that multifidelity optimization can provide you a great speedup (any factor from very small to orders of magnitude) when low-fidelity evaluations are useful but in certain cases it is overkill to use it. The latter observation is important because it motivates to formulate the search space in a way to account for scalability and different cost runs, e.g., you can formulate hyperparameters as a function of cost budget (e.g., epochs or wall-clock time) and thus make them more robust/invariant to cost budgets.
In this tutorial we demonstrated how to setup and solve a simple multifidelity optimization problem relevant to hyperparameter and architecture search of deep neural networks.
Let us know at