Multifidelity optimization on toy surrogates of Deep Neural Networks

We invite you to read our introduction tutorials [1 , 2 ] to get started with Indie Solver. If you have any question or suggestion, let us know at contact@indiesolver.com

Disclaimer. This tutorial is primarily about the use of multifidelity optimization where you can access, evaluate and benefit from cheaper low-fidelity objective function evaluations to obtain high-quality solutions at a fraction of cost which would be required with high-fidelity-only evaluations. The title of this tutorial mentions DNNs (Deep Neural Networks) because we refer to optimization of DNNs for illustration purposes. Otherwise, the same material can be applied to other domains. We intentionally don't provide any PyTorch or TensorFlow example code because it would be expensive to reproduce and it would likely be irrelevant (in terms of demonstrating state-of-the-art deep learning results) by the time you read this tutorial. We also don't provide a comprehensive introduction and comparison with some relevant literature on the topic, it would require us to write a journal paper here instead.

ELI5 explanation of this tutorial. If you want to find a solution which will at once best satisfy a large group of people (an expensive high-fidelity evaluation of what you want), you can first try various solutions on a few or only one person (a cheaper low-fidelity evaluation of what you want) to then select the best solutions for the large audience. Whether you will save a lot of resources or waste them depends of how well the few people that you pick will represent your target audience.

The source code of this tutorial is available for download and includes two files: IndieSolver.js (our thin wrapper code to communicate with Indie Solver) and DNN1.js (our code for this tutorial). You will need to install system-sleep, a module for sleep() events. Consider to modify our wrapper code if you our implementation does not fit well in your asynchronous Node code.

Optimization of DNNs consists of finding solutions which satisfy a set of user-defined requirements on their performance metrics. In the simplest scenario, the network architecture is fixed and we are searching for hyperparameter settings (e.g., the learning rate, weight decay, regularization coefficient, batch size, etc.) of the used training algorithm such as momentum SGD, Adam or a more powerful method. In a more advanced scenario, we may have a higher-level system/controller which sets/schedules these hyperparameters as a function of the task at hand (e.g., dataset properties, constraints on our computational budget, etc). Thus, the hyperparameters of the controller can become our hyperparameters to be optimized. Similarly, we may have some hyperparameters (e.g., the network's depth and layer-wise width) which are used to setup the architecture of our DNNs as an instance in some user-defined search space of possible architectures. In a more advanced scenario, we may have hyperparameters which govern a system/controller which is used to grow (and train with derivative-based methods) our network and thus search in a space of possible architectures primarily constrained by growth operators.

Let us assume that each candidate solution of one of the above-described scenarios is encoded by a vector x. Then, the error to be minimized can be defined by the following toy objective function

\begin{aligned} & \text{minimize} \,\, f(x)= \sum^{n}_{i=1} \left| w_i (x_{i} - x^*_i) \right|\\[0.5em] & \text{subject to} \,\, x \in [0, 1]^n, \end{aligned}

where x^* is the optimal/target solution and w is a vector defining weights of individual hyperparameters. This optimization problem is simple, noiseless (a great over-simplification) and separable, i.e., the optimum can be achieved by one-dimensional search along each of the variables separately (we could introduce a rotation matrix to deal with it but we keep it simple here).

If f(x) models validation error on some dataset, then the fact that f(x)=0 when x = x^* is uncomfortable because we rarely expect zero validation error. In practice, we invest some resource, e.g., GPU time, wall-clock time, money or any other costly resource. For the amount c of the invested resource (e.g., corresponding to performing 100 full epoch passes on our dataset), we expect to get some good validation error but not better than some optimal validation error f^*(c) bounded to what is achievable given all the imperfections of our training algorithms and network architectures (note that we constrain the architecture search space in some suboptimal way). We modify our objective so that f(x)=f^*(c) when x = x^*. Then, we should consider the practical aspect that optimal hyperparameter settings x^* for some small computational budgets c (e.g., 1 epoch) are not necessarily the same as for much larger budgets (e.g., 100 epochs). It means that our optimal x^* is a function of c. By taking into account the cost variable c, we arrive at the following formulation:

\begin{aligned} & \text{minimize} \,\, f(x,c)= f^*(c) + \sum^{n}_{i=1} \left| w_i (x_{i} - x^*(c)_i) \right|\\[0.5em] & \text{subject to} \,\, x \in [0, 1]^n, c \geq 0 \end{aligned}

If x controls the network's architecture and not only hyperparameter settings of the training algorithm, then interpreting c as the total number of batch passes can be problematic. There is a good chance that the optimizer will try to suggest larger and larger networks because/if they tend to provide better error values and because the training is bounded by the total number of epochs and not the wall-clock time. Instead, you can consider to limit your training by the wall-clock time and then derive the total number of epochs and other components (e.g., let cosine annealing learning rate to decay as function of wall-clock time and not batch pass index) accordingly. Measuring performance of different networks under the same wall-clock time cost is a reasonable option but any other cost metric can be considered.

To complete the definition of our toy benchmark problem, we should come up with some way to schedule x^*(c) and f^*(c). For c changing between c_0=0 and c_{max}=100 (an arbitrary choice), we pick the following scheduling rule:

\begin{aligned} & x^*(c) = x^*(c_{max}) - (x^*(c_{max}) - x^*(c_{0})) / (1 + \sqrt{c})\\[0.5em] & f^*(c) = 10.0 / (1 + \sqrt{c}), \end{aligned}

In other words, optimal/target hyperparameters settings x^*(c) change from some value x^*(c_{0}) at low-cost (or low-fidelity) budget c_0 (e.g., 1 hour) to some value x^*(c_{max}) at high-cost (or high-fidelity) budget c_{max} (e.g., 100 hours).

We set w = [0.5, 1, 1.5, 2.0], x^*(c_{0})=[0.7, 0.8, 0.8, 0.5] and x^*(c_{max})=[0.9, 0.8, 0.7, 0.6] which corresponds to the following dynamics:

Note that the optimal parameter values at c=20 seem to be similar to the ones at c=100 but this visually small difference corresponds to a difference of a factor of 2 for the objective function value. This significant sensitivity can happen in practice if, e.g., raw parameter value x_1\in[0,1] corresponds to the learning rate LR whose values are mapped to [10^{-5}, 10^0] as LR=10^{-5 + 5x_1}. Alternatively, it could be the number of convolutional filters varied in [2^5, 2^{12}].

Our optimization problem has 4 parameters to be optimized and 1 cost parameter which tells the optimizer that there is a way to control fidelity of objective function evaluations. We introduce not only our objective function to be minimized but also a cost function which measures the actual cost of each evaluation. For instance, if the cost determines the number of batch passes, one should expect that by requesting very short training runs of e.g. 1/10 epoch, the relative cost and overhead of DNN's model construction and data loading can be huge. The solver assumes/learns that there is some link (e.g., linear or quadratic) between cost variables and cost metrics.

Candidate solutions will be evaluated in a function called "evaluate" which implements f(x,c).


In order to communicate with Indie Solver, you should initialize a worker and tell it to connect to a particular IP address and port while providing your secret authentication token. All this information is available on your profile page . Then, you can create a problem according to your previously defined "problem" object. Note that every time you create a problem with some "problem_name", the server will rewrite all the information available for that "problem_name". Therefore, when you want to continue to optimize an already existing problem, you can ask its description by calling "ask_problem_description" and set it to your worker by calling "set_problem" as shown in the last two commented out lines given below:

The following JSON request describing the problem to be created will be sent to Indie Solver:

Your worker can ask solutions iteratively one after another by calling "ask_new_solutions". The reply of this call will include solutions to be evaluated. For the case when some errors were detected by the server, one should check for the status of the reply in order to avoid unnecessary and potentially expensive calls to the objective function. Once some solutions are evaluated, one can communicate them to the server by calling "tell_metrics".


The loop described above will request and evaluate solutions 1000 times. If you are currently on our free subscription plan, then only 80 function evaluations will be performed because of the limit to perform at most 20x the number of parameters function evaluations per problem. If you are a researcher, contact us to get a free extension of the limit to 1000x which corresponds to 4000 function evaluations on this problem.

The entire code of this tutorial example is given below:


The figure below demonstrates the optimization results available in your dashboard . Note that the x-axis represents the total number of function evaluations of the objective function f(x,c). Note that the calls are made for very different fidelity level where c=100 would make the call 100 times more costly than c=1.

We change the x-axis to use the measured costs. We also add a transformation to sum up values given on the the x-axis to plot the results w.r.t. the total cost used so far. Note that one full-fidelity run corresponds to c=100 and thus in order to estimate the total number of full-fidelity runs one should divide the number given on x-axis by c=100. The vast majority of function calls belong to low-fidelity evaluations which are cheap.

One can add another transformation do show the best objective function values found so far. The figure below shows that the total cost needed to reach the error value of about 1 is in order of 500, i.e., about 5 full-fidelity evaluations:

The following parallel coordinates figure will be shown by default:

We swap some parallel coordinates, swap colors and select a range of interest for error values (the corresponding solutions are given in the table below):

While we demonstrated that thanks to the use of low-fidelity evaluations one can find a good approximate of the optimum in about 5 full-fidelity evaluations, here we propose to take a more critical look at the results. We consider four different scenarios: i) when we run random search with only high-fidelity full-cost evaluations (referred to as full-cost scenario), ii) the same scenario but with Indie Solver, iii) an optimistic scenario when x^*(c=100)=x^*(c=0) (referred to as multifidelity A scenario), and iv) the scenario considered in this tutorial where x^* and f^* change over time (referred to as multifidelity B scenario). The results for all four scenarios are given below (the results of random search are averaged over 1000 runs).

Given the number of variables and the number of available function evaluations, random search does not look like a competitive option. Note that the remaining three scenarios demonstrate much faster convergence compared to random search and the difference between the three of them is smaller.

The scenario of multifidelity A is an interesting case where low-fidelity evaluations give worse objective function values than high-fidelity evaluations but the landscape of the search space is the same in terms of isolines of the objective function values. As an example, consider the case when your hyperparameters are used to setup the training algorithm and they are very stable for different budgets of epochs, i.e., work equally great for 10 epochs and 1000 epochs. While you don't get the same validation error after 10 epochs as after 1000 epochs because f^* changes, you can exploit short 10 epochs runs to guess right hyperparameter settings for longer runs. This scenario is optimistic and typically does not hold in practice especially if your hyperparameters control network's architecture. The speedup that we demonstrate here for the scenario A is about a factor of 5 (depending on how you measure).

The scenario of multifidelity B is the one where optimal hyperparameter values change as a function of the allowed cost. The figure given above shows that one can benefit from faster convergence but in the case when best low-fidelity solutions don't predict best high-fidelity solutions, one could actually run the full-cost optimization without wasting time on low-fidelity evaluations at all. We provide this rather negative result in order to illustrate that multifidelity optimization can provide you a great speedup (any factor from very small to orders of magnitude) when low-fidelity evaluations are useful but in certain cases it is overkill to use it. The latter observation is important because it motivates to formulate the search space in a way to account for scalability and different cost runs, e.g., you can formulate hyperparameters as a function of cost budget (e.g., epochs or wall-clock time) and thus make them more robust/invariant to cost budgets.

In this tutorial we demonstrated how to setup and solve a simple multifidelity optimization problem relevant to hyperparameter and architecture search of deep neural networks. Let us know at contact@indiesolver.com if you need some help setting up your own experiment.