In this post, I want to explain why diversifying investments in a portfolio is a good idea. You might have heard that "diversification is the only free lunch in investing", but why is that the case? To explain that, I need to give you a very short introduction into Modern Portfolio Theory. It is a mathematical topic, it involves statistics, but I will make sure that everybody with an attention span to reach the end of the article should grasp the fundamental reason of why diversity helps. No equations involved.

Diversification is the only free lunch in investing.

– Nobel Prize laureate Harry Markowitz

Take the following three examples. Each of these examples has the same growth-rate of 7% per year, but they have a different volatility.

The first asset we could invest in, has an expected return on investment of 7% per year, with a volatility of 2%. Shown above are 5 different possible futures for the next 10 years. In some cases we double our investment, in some cases we gain 50%. The volatility is considerably smaller than the return on investment, so the underlying exponential curve is clearly visible.

The second asset we could invest in, has an expected return on investment of 7% per year, with a volatility of 15%. This matches the volatility and growth of the S&P500. Shown here are 5 different possible futures for the next 10 years. In one case, we made a small loss. In some of the other cases, we would four-fold our original investment.

The third thing we could invest in has an expected return on investment of 7% per year, with a volatility of 38%. This volatility is akin to the volatility of Bitcoin. As you can see, in these 5 cases, one case breaks even, one goes to nearly zero after a crash in year 4, one does as well as the low volatility case and one grows a sixfold after a spurt in year 8. As you can see in this illustrative example, the growth is now completely dominated by the volatility.

There is a strict mathematical definition for this 'volatility' I am using here, namely the standard deviation of the logarithmic return on investment over one year. But, no worries if this definition is not clear, it is irrelevant for the story.

What is important to grasp, is that volatility is a measure of how uncertain you are of your returns. When volatility is low, you will typically get a growth which is close to your expected growth. When volatility is large, you will get almost random results. If it is too large, the probability of making a profit becomes fifty-fifty. At that point, you might as well start betting on coin flips rather than investing.

Therefore, Harry Markowitz argued that you want to limit the volatility of your investments. The thing you want is a *guaranteed* growth of your portfolio, not just an expected growth in some average case. After all, *your* future might not be an average one.

Let's take a look at a simple example. Instead of thinking of stochastic processes like stocks, let us look at a much simpler stochastic problem. Imagine I have a six sided die and I have an imaginary asset whose return on investment is the number of eyes I throw in percents.

What is the return of such a dice-stock? Well, my expected return on investment is 3.5%. You can trust me that the volatility here is about 1.7%. If my portfolio only invests in this stock, the return and volatility on my portfolio will be exactly 3.5% and 1.7% as well.

Now, what happens if I diversify my portfolio over more of these dice-stocks? So instead of investing my whole portfolio into this one dice-stock, I invest half of it into a red die stock, and half of it into a blue die stock.

Well, my expected return on investment is still 3.5%. But this time, the volatility is only 1.2%! How can we see this on a more intuitive level?

Well, take a look at all possible returns my single-die-portfolio has, and how likely it is to get each of these returns:

This kind of probability distribution should not be too unexpected when you throw a single die. Now, let us see what happens when we roll two dice, and look at the average number of eyes:

The reason for this distribution is that there are more ways to throw a 7 with two dice (1+6, 2+5, 3+4, 4+3, 5+2, 6+1) then there are to throw a 2 (only 1+1) or a 12 (only 6+6). Therefore, if you look at the average eyes of both dice, you have more ways to get a 3.5% return than a 1% or 6% return.

So apparently, when you diversify your portfolio over two separate dice-stocks, your expected return on investment is still the same, *but you are more likely to get that expected return*. Or in other words, you return stayed the same, but your volatility became lower!

Let us look at the best case scenario. What if the blue die would always do the exact opposite of the red die? Mind you, it is still a blue die, with an expected return of 3.5% and a volatility of 1.7%, but it just magically happens to always do the opposite of the red die. So if red throws 1, blue throws 6. If red throws 5, blue throws 2. Let's visualise the returns of a two-dice portfolio with this magic blue die.

Wow, so if the blue die does the opposite of the red die, you always get exactly your expected return! There is zero volatility here. Not surprisingly, as the sum of the red and blue die will always be 7. Here, you are guaranteed to get exactly your expected return.

This magic blue die is however the best case scenario. Let's imagine the worst case scenario: the blue die magically always throws the same as the red die. Still, both dice have a return of 3.5%, both dice have a volatility of 1.7%. Let's have a look at the behaviour of our portfolio.

Well, that looks exactly the same as our original portfolio, where we were only investing in the red die. An expected return of 3.5%, but a volatility of 1.7%. Why is this? Well, if both dice always throw the same number of eyes, and you average their number of eyes, you might as well throw a single die and look at the eyes of that die. Nothing has changed.

While the above dice-based stocks might be artificial, what they say about portfolios is true in general! If you have multiple funds with similar return and volatility:

- If you diversify over similar stocks, your return stays the same, but the volatility goes down!
- In the absolute best case of this diversified approach, you can keep your returns while removing all volatility.
- In the absolute worst case, you cannot do worse than having picked a single fund. So diversifying can only improve your portfolio, it cannot make it worse.

And that is why diversification is coined "a free lunch". You can only lower your volatility, and it does not come at a cost in returns. And as we have shown with the examples at the beginning of the article, low volatility is a desirable property as it makes returns more reliable.

However, based on the above, this free lunch does come with some caveats:

- First, volatility and returns from the past will not necessarily stay the same in the future. In this post, we kind of assumed that we knew the volatility and the expected return of our assets in the future, which we don't really, of course.
- Second, the above works for
*similar*investments. But when stocks are dissimilar, the above explanation becomes a bit more complex. Modern Portfolio Theory is essentially the exact approach to the problem of dissimilar assets.

And that's it! If you have questions, feel free to leave a comment below, no account needed. There is no algorithm that needs nudging to send more people this way, but I can update this post based on your comments for future readers.

]]>I figured I needed to know more about the history of democracy, if I was ever going to fix this broken system of ours. That much was obvious after hearing David Stasavage talk on the MindScape podcast. Or at least gain an understanding as to *why* "the system" is the way it is.

This post is a quick review of his latest book, "The Fall and Rise of Democracy". In short, I would rate it 9/10. Extremely interesting book, however, I wished it was a bit longer and would spend a little more time on the current situation of democracy around the world. Nonetheless, it is a thorough overview of the origins of democracy and gives a peek into how democracy and autocracy worked throughout history and around the world.

First, the authors lay out that fundamentally, you can divide the many types of organising politics into two larger groups: more towards democracy, and more towards autocracy. Don't think of democracy as our modern, parliamentarian, particratic, corporate, oligarchic way of organising society, but more as very low levels of hierarchy and tribe councils. Similarly, for autocracy, think of pharaohs and emperors, where a single person wields a lot of decision making power.

The main thesis of the book is that from a statistical point of view in the times before bureaucratic technologies like writing, three things lead to tribes being more democratic.

- First off, if people can walk away, there is a bigger chance of them living in an early democracy. That makes total sense if you think about it. If I am in a tribe, and I have to pay taxes but do not get to get involved in any decisions, I am sodding off the first chance I get. Therefore if I have fewer chances to escape and start anew, I am more likely to be stuck in an autocracy.
- Second, if people have some kind of leverage on their leader, having a democracy is more likely. For example, say a rowdy tribe lives next door to us, waiting for a chance to violently take over our clan. If I do not get to have an impact on decision making in our clan, I am more likely to be all pacifist when this enemy tribe is passing by my front door on the way to our leader. The tribe leader is probably less enthusiastic about being killed by the enemy tribe, so he might set up a more democratic system where I get to feel like I have a say, to prevent the above from happening.
- Third, if it is hard to raise taxes, you are more likely to have a democracy. This one takes a bit more to explain.

So, when it is hard to raise taxes, why are you more likely to have a democracy? Imagine you run a small tribe, you will need to set a level of taxes from your subjects. You will need these taxes to run your military and the bureaucracy that collects taxes. If you set this level too low, you are not collecting enough taxes to run your bureaucracy. If you set this level too high, you illicit a riot or your peons will starve. You need to get this level right.

Now, when you are an early democracy, this level of taxes is set by agreement and in the common interest. If everyone cheapens out, the tribe might be overrun by neighbouring tribes or breaks down due to inside bickering. Simultaneously, nobody will want to contribute too much either. However, everyone is also well aware of how much they *can* pay, based on the land they work with and the last harvest. So starving due to taxes is less likely and riots even more so. The decisions are made by the people that have the best information.

However, if you are an autocrat, *you* need to set the tax levels. Since these taxes are controlled by your tax collecting bureaucrats, the system cannot be too complex or arbitrary either. And if you are wrong, you will go under to either an unpaid military and a collapsing weak bureaucracy, or to a rioting population.

Jean-Baptiste Colbert (finance minister to Louis XIV) is said to have described the problem of taxation as plucking a goose to obtain as many feathers as possible, without making it hiss.

One of the interesting places where this third point becomes clear, is by comparing places where harvests are easy to predict to the places where harvests are unpredictable from afar. Take for example China or Egypt with a uniform fertile Loess soil around a yearly flooding Nile or Yellow river. Very easy to set up a tax system in this situation, as you do not need to know how fertile the farmer's soil is this year, all soil does pretty much the same every year. As a consequence, you find giant autocratic systems in these places.

On the other hand, take a look at Huron tribes around the Great Lakes in North America, where the fertility varies quickly depending on where you are and depending on the weather every year. A farmer a little higher up might have a lot more rocky soil, and some years the farmer down below can get ruined by a local drought. How could you set up taxes that your small bureaucracy can handle? The autocrats do not have the information available to assess a good tax level, as it is too variable.

When you look at the data of these examples and the dozens more in the book, you will find a tendency of more early democratic approaches among tribes that live in more unpredictable conditions.

When it becomes really interesting, is on the transition from early democracy to something more mature, when technology enters the stage. And the result of Stasavage's analysis is fascinating.

Unlike the common narrative of how Europe's industrial revolution and democratization went hand in hand, Stasavage claims that if you dig deeper, the effect goes the other way around. It turns out that civilisations that have been democratic, can still become autocratic once they pick up technologies such as map-making or writing. However, once autocracies pick up these technologies, the odds of them becoming more democratic goes down. So on a larger scale, technology seems to inhibit democracy.

And that does not have to be surprising. Writing helps decrease the information asymmetry autocrats have concerning how productive their subjects are. Writing makes it possible for information to flow from the low levels to the high levels. Map-making empowers bureaucrats to streamline the economy by forcing people to move to places where they can be more productive, but also easier to control. On top of that, technological improvements such as crop rotation made farming more reliable, and thus easier to tax.

This decrease in a democracy which happened thousands of years ago, is what is meant by the "Fall of Democracy" in the title. So, how did democracy rise again?

The book goes on to describe how the West did stay more democratic all the way to the 21st century. For one, Europe was considerably behind in technological know-how to the rest of the world, far into the middle ages. On top of that, the European patchwork of soil types made centralisation hard to achieve, despite a plethora of medieval kings trying really hard to get their way. So when industrialisation really took off, Europe already had some strong early democracies and has mostly kept up that tradition in the Western modern democracies we know today.

A variety of other subjects are touched upon in the book: democracy in the United States (obviously), the rise of democracy in Africa in the last decennia, and what happened in the Arabic world, Russia and China. But also voting rights for women, slavery, inequality, the role of mandates, *Quod Omnes Tangit* and many other topics that dig down to the fundamentals of democracy.

In general, with broad strokes, it seems that autocracies do not go away, unless the people go really hungry due to their weak state. And these insights lead to some poignant observations one could make about the democratic optimism in the West, like the one below:

At the time of the Tiananmen Square protests thirty years ago Western observers sometimes suggested that economic development would necessarily push China in the direction of democracy. [...] It is always easy to pass judgment on erroneous predictions like these, but in retrospect, a deeper look at history might have led to different predictions, even in 1989.

According to the book, China is a prime example of the bureaucratic alternative to democracy. And while they went from empire to communist state, the bureaucratic heritage is something that goes back for thousands of years, all the way back to before even writing was invented there. The same holds pretty much for democracy in the West.

Finally, the best feature of the book, is that it does not treat democracy as a holy relic we should treasure at all costs. It makes some good points of cases in history where a little less democracy lead to more economic prosperity for everyone involved. For example, when industrialisation took off, the Netherlands was slower to pick it up, *because* they had a more decentralised way of governing where regional cities had a bigger say.

The book concludes on the following quote, nicely stating that democracy is not the end-all goal, it is what you do with it that is more important.

Finally, we need to remember that just because modern democracy survives does not mean that we will be happy with how it performs if the 1% dominates over the 99. Instead of only asking whether democracy will survive, we need to also ask whether we will be satisfied with the democracy that does survive.]]>

In the previous blog post, we established that machine learning algorithms are often hard to tune, and hopefully explained the mechanism for why gradient descent has difficulty with linear combinations of losses. In this blog post, we will lay out some possible solutions.

But first, let us identify what we mean by "tunable algorithm". What is it that makes a hyper-parameter easy to tune? In order to have tunable hyper-parameters for our algorithms, we have the following desiderata:

- Preferably,
**the hyper-parameter has semantic meaning**. This would allow finding the preferred solution in one go, without having to iterate multiple times over various parameters to narrow down the area of interest. - By tuning the hyper-parameter,
**you should be able to find any solution on the Pareto front**. In other words, every solution on the Pareto front should have a value for the hyper-parameter for which the optimisation algorithm finds that solution.

In order to achieve this, we reframe our optimisation problem as a Lagrangian optimisation problem instead. We choose one of the losses as the primary loss and put a constraint on the other loss. The goal is to minimise the primary loss subject to the second loss being less than a value ** ε**. In symbols:

$$\text{Minimize}\;L_0(\theta)\;\text{subject}\;L_1(\theta)\leq\epsilon$$

In the end, the Lagrangian we end with looks the same as our original total loss linearly combined.

$$L(\theta,\lambda)=L_0(\theta)-\lambda(\epsilon-L_1(\theta))$$

However, we can be smarter about how we tackle this total loss when thinking about it as a constrained optimisation problem.

For instance, where and whether we converge on this constrained problem has been mathematically formalised in the Karush–Kuhn–Tucker (KKT) conditions. This is a bit technical and is not necessary for the rest of the post, but from those conditions, we know that the optimum we are looking for, will be a saddle point on this Lagrangian with a linear weighted combination of losses. However, we need to combine that insight with the inability of gradient descent to find saddle points to notice that there might be an issue with problems that have concave Pareto fronts.

So, we are not in the clear yet. Mixing Lagrangian optimisation with gradient descent isalsoa dangerous domain fraught with peril.

In this blog post, we will therefore give an overview of many approaches found in various places in literature. We will show the problems with most of these approaches, and give the approach we would suggest taking to make machine learning algorithms tunable.

Take for starters the following intuitive solve-the-dual approach, which can be found in many papers.

In this approach, a typical method from Lagrangian optimisation is used, and a dual is constructed and solved to find the ideal ** λ**. We then optimise our Lagrangian with gradient descent. Intuitively, we might think that plugging in this value for

```
def loss(θ, λ, ε):
return loss_1(θ) - λ*(ε - loss_2(θ))
loss_derivative = grad(loss)
ε = 0.3
λ = solve_dual(ε) # The crux
for gradient_step in range(200):
gradient = loss_derivative(θ, λ, ε)
θ = θ - 0.02 * gradient
```

To visualise how this works, we have plotted the evolution of the losses for our optimisation process, colour-coded according to where the parameters were initialised. In the figure below, this constraint on loss 2 is depicted as a black hatched line, which hopefully is helpful to gain intuition on how this constrained optimisation problem relates to our original problem.

So yes, in this case, the problem is solved nicely, and it seems we could use ** ε** in this case to tune the optimum. So we can nail the trade-off we want, without running our optimisation processes multiple times to tune this

However, there is an issue when we look at what happens on our reparametrised model with the concave Pareto front. It turns out that not only did we not seem to converge on the Pareto front, but some of the solutions found with gradient descent are downright ignoring our hard constraint!

The reason for this should now be clear if you read the previous blog post. Even though we solved a dual to find the optimal ** λ** parameter that matches our

An alternative method, which does work, is to **first optimise for the constraint with gradient descent, and to optimise for the primary loss only as long as this constraint is satisfied**. With this approach, the constraint will always be satisfied upon convergence, and the other loss minimised. This approach works for both the concave and the convex case.

```
def constraint(θ, ε):
return ε - loss_2(θ)
optimization_derivative = grad(loss_1)
constraint_derivative = grad(constraint)
ε = 0.7
for gradient_step in range(200):
while constraint(θ, ε) < 0:
# maximize until the constraint is positive again
gradient = constraint_derivative(θ, ε)
θ = θ + 0.02 * gradient
gradient = optimization_derivative(θ)
θ = θ - 0.02 * gradient
```

There is a major downside to this method, in that you do not really treat the trade-off between the losses during your gradient descent. Only one of the losses is being considered at every step. In general, this makes this approach less desirable for many applications.

For example, a common failure case is when your constraint has an obvious solution, which however brings it to a part of the space where the primary loss has barely any gradient to follow. Convergence can be another issue, when every time you solve the constraint, you undo the progress you have made on the primary loss.

Additionally, **this method does not work that well when you want to use stochastic gradient descent rather than gradient descent**. Since the constraint is defined on the average loss across all data, you do not want to enforce the hard constraint on a sample of your data where it is not satisfied yet, as long as it is satisfied in the general case. And this issue is hard to overcome.

However, this method is easy to implement and might actually work sufficiently for your particular problem.

So, from the previous section, we know that solutions do exist that can handle this constrained optimisation problem using gradient descent. If we return to the idea of using a Lagrangian to solve the constrained version of our problem, we can see that the fundamental issue is the interaction between a fixed ** λ** found by solving the dual function and using gradient descent to minimise the other parameters. Could we not use a single gradient descent to find both the optimal parameters and Lagrangian multiplier simultaneously?

Yes, that is in fact possible. Take the following algorithm:

```
def lagrangian(θ, λ, ε):
return loss_1(θ) - λ*(ε - loss_2(θ))
derivative = grad(lagrangian, (0,1))
ε = 0.7
λ = 0.0
for gradient_step in range(200):
gradient_θ, gradient_λ = derivative(θ,λ,ε)
θ = θ - 0.02 * gradient_θ # Gradient descent
λ = λ + 1.0 * gradient_λ # Gradient ascent!
if λ < 0:
λ = 0
```

We follow the gradient of the Lagrangian downwards for the parameters, but upwards for ** λ**. So gradient descent for the parameters, but gradient ascent for the Lagrangian multipliers. As long as we take care that

This approach does work nicely on the convex case. It has the upside that at every gradient step, both of the losses are considered. This makes this approach applicable to stochastic gradient descent as well, at the cost of a little additional intricacy of understanding how Lagrangians work. Also, note that it is still possible to use a single gradient evaluation for both the gradient descent and the gradient ascent. Therefore, the computational complexity broadly stays the same.

However, it does not solve our original problem in the general case, and that becomes clear when we take a look at how this algorithm behaves on the concave Pareto front case.

While the performance is fair, it does not converge to a solution. This optimisation method keeps oscillating on the Pareto front, unable to settle on a good solution. Therefore, this method can give the appearance of finding better solutions, but it still is not tunable. The method can however be found in various papers, because if one cherry-picks when to stop the optimisation process, one can probably find the solution which would be most suitable to convince the peer reviewers. Yet, it is disappointingly hard to use this method when manual intervention is not possible. For example, when the optimisation process is just a piece in a larger puzzle, as is the case in for instance reinforcement learning.

Finally, this section will introduce a solution here, which as far as we can tell was introduced to the field of machine learning for the first time in a NIPS paper in 1988 by John C. Platt and Alan H. Barr.

The method is easier to understand after we identify on an intuitive level what is causing the oscillations in the previous figure. This can be found if we follow the behaviour of the parameters as it oscillates around the optimum. As long as the constraint is violated, ** λ** keeps increasing. But when we are suddenly satisfying the constraint again,

From that perspective, the authors of the NIPS paper introduce damping on this energy in their Modified Differential Method of Multipliers. With this damping, you can prevent the system from oscillating eternally, and make it converge instead.

The paper has become harder to read as machine learning has settled on different conventions for thinking about optimisation problems, so we will not discuss the notation used in the paper. However, we can reformulate their ideas from over 30 years ago into a more contemporary Jax code in the following way.

```
damping = 10.0
def lagrangian(θ, λ, ε):
damp = damping * stop_gradient(ε-loss_2(θ))
return loss_1(θ) - (λ-damp) * (ε-loss_2(θ))
derivative = grad(lagrangian, (0,1))
ε = 0.7
λ = 0.0
for gradient_step in range(200):
gradient_θ, gradient_λ = derivative(θ, λ, ε)
θ = θ - 0.02 * gradient_θ
λ = λ + 1.0 * gradient_λ
if λ < 0:
λ = 0
```

Indeed, this method does work nicely for both convex and concave Pareto fronts, as illustrated in the following figures.

It does come with a slight downside, namely that you now have a damping hyper-parameter. This additional parameter trades the time to find the Pareto front with the time to converge to a solution on that front. Note that the damping parameter does not alter which solution is found, only how fast it is found.

However, that perspective of “oh no, not an additional hyperparameter” is only muddying the waters. **It is the first time in this article that we have a tunable algorithm that works for stochastic gradient descent!** We can use this Modified Differential Method of Multipliers to tune the balance between the losses in a semantically useful way using stochastic gradient descent, *no matter the shape of the invisible Pareto front*.

In our limited experience, this is the method which should be used more often. In fact, we would postulate that wherever you see a linear combination of losses being optimised with gradient descent, this more principled approach could be used.

In those cases where currently a linear combination of losses is used, the constrained reformulation of that problem is going to yield more general and tunable algorithm.

In this blog post, we hopefully illustrated an approach to trade poorly tunable algorithms caused by linearly combined losses for more robust approaches with semantically relevant hyper-parameters. Note that while this does come with some more headroom to understand the optimisation process, it does not cause additional computational complexity. In fact, in the Jax examples used here, it was implemented with only 6 additional lines of code.

The upside of this approach is that significantly less effort will be spent on tuning hyper-parameters. Optimisation procedures don't need to be iterated on as much, and they will be more robust to where the parameters were initialised as well.

If you like this blog post, it would mean the world to us if you could leave a comment below. Let us know what you think, e.g. whether you learned something or if we were stating the obvious, and if we should write more blog posts like this.

]]>In machine learning, linear combinations of losses are all over the place. In fact, they are commonly used as the standard approach, despite that they are a perilous area full of dicey pitfalls. Especially regarding how these linear combinations make your algorithm hard to tune.

Therefore, in this post we hope to lay out the following arguments:

- A lot of problems in machine learning should be treated as multi-objective problems, while they currently are not.
- This lack of multi-objective treatment leads to difficulties in tuning the hyper-parameters for these machine learning algorithms.
- It is nigh on impossible to detect when these problems are occurring, making it tricky to work around them.
- There are methods to solve this which might be slightly involved, but do not require more than a few lines of code. One of these methods is laid out in a follow-up blog post.

Nothing of this article is novel. You might already be aware of everything we wanted to say. However, we have the impression that most machine learning curricula do not discuss optimisation methods very well (I know mine did not), and consequently, gradient descent is being treated as the one method to solve all problems. And the general message is that if an algorithm does not work for your problem, you need to spend more time tuning the hyper-parameters to your problem.

In the next blog post, a solution is introduced, based on the NIPS’88 paper which introduced the Modified Differential Method of Multipliers.

So we hope that this blogpost can remove some confusion on how to handle this issue in a more foundational and principled way. And hey, maybe it can make you spend less time tuning your algorithms, and more time making research progress.

While there are single-objective problems, it is common for these objectives to be given additional regularisation. We have picked a selection of such optimisation objectives from across the field of machine learning field.

First off, we have the regularisers, weight decay and lasso. It is obvious that when you add these regularisations, you effectively have created a multi-objective loss for your problem. After all, what you really care about, is that both the original loss \(L_0\) and the regulariser loss are kept low. And you will tune the balance between the two using a \(\lambda\) parameter.

$$ L(\theta) = L_0(\theta) + \lambda \sum \left| \theta \right| $$

$$ L(\theta) = L_0(\theta) + \lambda \sum \theta^2 $$

As a consequence, losses found in e.g. VAE’s are effectively multi-objective, with a first objective to maximally cover the data, and a second objective to stay close to the prior distribution. In this case, occasionally KL annealing is used to introduce a tunable parameter \(\beta\) to help handle the multi-objectiveness of this loss.

$$ L(\theta) =\mathbb{E}_{q_{\phi}(z | x )} \left[ \log p_\theta ( x | z) \right] - \beta D_{KL} \left( q_\phi ( z | x) \| p(z) \right) $$

Also in reinforcement learning, you can see this multi-objectiveness. Not only is it common for many environments to simply sum rewards received for obtaining partial goals. The policy loss is usually also a linear combination of losses. Take as an example here the losses on the policy for PPO, SAC and MPO, entropy regularized methods with their tunable parameter ** α**.

$$ L(\pi) = - \sum_t \mathbb{E}_{(s_t, a_t)} \left[ r(s_t, a_t) + \alpha \mathcal{H}(\cdot , s_t)\right]$$

$$ L(\pi) = - \sum_t \mathbb{E}_{(s_t, a_t)} \left[ \mathbb{E}_\pi\left(Q(s_t, a_t)\right) - \alpha D_{KL} \left( q \| \pi \right) \right]$$

Finally, the GAN-loss is of course a sum between the discriminator and the generator loss:

$$ L(\theta) = - \mathbb{E}_x \left[ \log D_\theta(x)\right] - \mathbb{E}_z \left[ \log ( 1- D_\theta(G_\theta(z))\right]$$

All of these losses have something in common, they are effectively trying to optimise for multiple objectives simultaneously, and argue that the optimum is found in balancing these often contradicting forces. In some cases, the sum is more ad hoc and a hyper-parameter is introduced to weigh the parts against each other. In some cases, there are clear theoretical foundations on why the losses are combined this way, and no hyper-parameter is used for tuning the balance between the parts.

In this post, we hope to show you this approach of combining losses may sound appealing, but that this linear combination is actually precarious and treacherous. The balancing act is often more like a tightrope walk.

Let us consider a simple case, where we are trying to optimise for such a linear combination of losses. We take the approach of optimising the total loss, which is a sum of losses. We optimise this with gradient descent, and we observe the following behaviour.

Our code in Jax would look something like this:

```
def loss(θ):
return loss_1(θ) + loss_2(θ)
loss_derivative = grad(loss)
for gradient_step in range(200):
gradient = loss_derivative(θ)
θ = θ - 0.02 * gradient
```

As is usually the case, we are not immediately happy about the tradeoff between the two losses. So we introduce a scaling coefficient ** α** on the second loss and run the following code:

```
def loss(θ, α):
return loss_1(θ) + α*loss_2(θ)
loss_derivative = grad(loss)
for gradient_step in range(200):
gradient = loss_derivative(θ, α=0.5)
θ = θ - 0.02 * gradient
```

The behaviour we hope to see is that when tuning this ** α**, we can choose the trade-off between the two losses and select the point we are most happy with for our application. We effectively will go on a hyper-parameter tuning loop, manually select an

However, that is not what is always happening. The actual behaviour we sometimes observe for our problem looks like the one below.

It seems that no matter how we finetune our-parameter, we cannot make a good trade-off between our two losses.α

We see two clusters of solutions, one where the first loss is ignored, and one where the second loss is ignored. However, both of these solutions are not useful for most applications. Most of the time, a point where the two losses were more balanced is a more preferred solution.

In fact, this diagram of the two losses over the course of training is barely ever plotted, so the dynamics illustrated in this figure often goes unobserved. We just look at the training curve plotting the total loss, and we might conclude that this hyper-parameter needs more time tuning, as it seems to be really sensitive. Alternatively, we could settle for an approach of early stopping to make the numbers in the paper work. After all, reviewers love data efficiency.

Where did it go wrong though? **Why does this method sometimes work, and why does it sometimes fail to give you a tunable parameter?** For that, we need to look deeper into the difference between the two figures.

Both figures are generated for the same problem, using the same losses and are optimising these losses using the same optimisation method. So none of these aspects are to blame for the difference. The thing which has changed between these problems is the model. In other words, the effect the model parameters ** θ** have on the output of the model is different.

Therefore, let us *cheat* and visualise something which is normally not visualisable, the Pareto front for both of our optimisations. This is the set of all solutions achievable by our model, which are not dominated by any other solution. In other words, it is the set of achievable losses, where there is no point where *all* of the losses are better. No matter how you choose to trade off between the two losses, your preferred solution always lies on the Pareto front. By tuning the hyper-parameter of your loss, you usually hope to merely find a different point on that same front.

The difference between the two Pareto fronts is what is causing the tuning to turn out well for the first case, but to fail horribly after changing our model. It turns out that when the Pareto front is convex, we can achieve all possible trade-offs by tuning our ** α**-parameter. However, when the Pareto front is concave, that approach does not seem to work well anymore.

We can illustrate why that is the case, by looking at the total loss in the third dimension, the loss which is actually optimised with gradient descent. In the following figure, we visualise the plane of total loss in relation to each of the losses. While we actually descend on this plane using the gradient with respect to the parameters, each gradient descent step we take will also necessarily go downwards on this plane. You can imagine the gradient descent optimisation process as putting a spherical pebble on that plane, letting it wobble down under gravity and wait until it comes to a halt.

The point where the optimisation process halts is the result of the optimisation process, here indicated by a star. As you can see in the following figure, no matter how you wobble down the plane, you will always end up in the optimum.

By tuning ** α**, this space stays a plane. After all, by changing

However, if we take a look at the differently modeled problem with a concave Pareto front, it becomes apparent where our problem is coming from.

If we imagine our pebble following the gradients on this plane: sometimes rolling more to the left, sometimes more to the right, but always rolling downwards. Then it is clear it will end up in one of the two corner points, either the red star or the blue star. When we tune ** α**, our plane is tilting in exactly the same way as in the convex case, but because of the shape of the Pareto front, only two points on that front will ever be reached, namely the points on the ends of the concave curve. The \(\times\)-point on the curve, the one which you would actually want to reach, cannot be found with gradient descent based methods. Why? Because it is a saddle point.

Also important to note is what happens when we tune ** α**. We can observe that we tune how many of the starting points end up in one solution versus the other, but we cannot tune to find other solutions on the Pareto front.

To conclude this blog post, we would like to enlist the problems with using this linear combination of losses approach:

- First off, even if you do not introduce a hyper-parameter to weigh off between the losses,
**it is not correct to say that gradient descent will try to balance between counteracting forces**. Depending on the solutions which are achievable by your model, it may very well completely ignore one of the losses to focus on the other or vice versa, depending on where you initialised the model. - Second, even when a hyper-parameter is introduced,
**this hyper-parameter is tuned on a try-and-see basis**. You run a complete optimisation process, decide if you are happy, and then finetune your hyper-parameter. You repeat this optimisation loop until you are happy with the performance. It is a wasteful and laborious approach, usually involving multiple iterations of running gradient descent. - Thirdly,
**the hyper-parameter cannot tune for all optima**. No matter how much you tune and keep fine-tuning, you will not find the intermediate solutions you might be interested in. Not because they do not exist, as they most certainly do, but because a poor approach was chosen for combining these losses. - Fourthly, it is important to stress that for practical applications,
**it is always unknown whether the Pareto front is convex and therefore whether these loss weights are tunable**. Whether they are good hyper-parameters depends on how your model is parameterised, and how this affects the Pareto curve. However, it is impossible to visualise or analyse the Pareto curve for any practical application. Visualising it is considerably more difficult than our original optimisation problem. So if a problem has occurred, it would go unnoticed. - Finally, if you would really like to use those linear weights to find trade-offs, you need an explicit proof that the whole Pareto curve is convex
**for the specific model in use**. So using a loss which is convex with respect to the output of your model, is not enough to avoid the problem. If your parametrisation space is large, which is always the case if your optimisation involves weights inside a neural network, you can forget about attempting such a proof. It is important to stress that showing convexity of the Pareto curve for these losses based on some intermediate latents is not sufficient to show you have a tunable parameter. The convexity really needs to depend on the parameter space, and what the Pareto front of the achievable solutions looks like.

Note that in most applications, Pareto fronts are not either convex or concave, but they are a mix of both. This amplifies the problem.

Note that in most applications, Pareto fronts are not either convex or concave, but they are a mix of both. This amplifies the problem. Take for example a Pareto front, where there are concave pieces between convex pieces. Not only will each concave piece make sure that none of its solutions can be found with gradient descent, it will also split the space of parameter initialisations into two parts, one which will find solutions on the convex piece on one side, and one that will only find solutions on the other. Having more than one concave piece in the Pareto front compounds the problem, as illustrated in the figure below.

So not only do we have a hyper-parameter ** α** which cannot find all solutions, depending on the initialisation it might find a different convex part of the Pareto curve. To make things even harder to work with, this parameter and the initialisation mix with each other in confusing ways. If you tune your parameter slightly hoping to move the optimum slightly, you can suddenly jump to a different convex part of the Pareto front, even when keeping the initialisation the same.

All of these issues do not need to be the case. There are excellent ways to deal with this issue, ways that could make our lives easier by making our optimisations spit out better solutions even on the first attempt.

In the next blog post, we detail one of these solutions based on an old NIPS’88 paper. Meanwhile, this is our first blogpost of 2021. It would mean the world to us if you could leave a comment below, and let us know if this helped you learn something new or whether we should write more of these.

]]>