*This post was originally published by Alireza Modirshanechi at Towards Data Science*

## A proof of the existence of the optimal policy for finite MDPs

In a finite Markov Decision Process (MDP), the optimal policy is defined as a policy that maximizes the value of all states at the same time¹. In other words, if an optimal policy exists, then the policy that maximizes the value of state *s* is the same as the policy that maximizes the value of state *s’.*² But why should such a policy exist?

The famous introductory book of Sutton and Barto on reinforcement learning¹ takes the existence of optimal policies for granted and let this question unanswered. I had difficult times to believe them and be able to continue reading!

In this article, I am going to give a proof of the existence of the optimal policy in finite MDPs ³.

### Markov decision processes and policies

A finite MDP is characterized by a finite set of states (usually shown by curvy *S*), a finite set of actions for each state (usually shown by curvy *A*), and a probability distribution over the immediate reward value *r* and the next state *s’,* given the current state *s* and the currently chosen action *a*, denoted as *p(s’,r|s,a).*

Given the current state *s*, a policy π is a probability distribution over possible actions at state *s*, denoted as π*(a|s).* Then, given a policy, an agent can navigate in the environment (i.e. go from one state to another) and get rewards through each transition.

We show random variables by capital letters and their values by small letters. Time is added to each variable with a subscript. Then, given a policy and an MDP, and given the initial state (at time *t=1*) *s*, for any *T > 1*, the joint distribution of states, actions, and reward values is

### Values and Bellman equation

Given a policy π and a discount factor 0 ≤ γ < 1, the value of each state is defined as

and the value of each pair of state and action as

It is easy to show that the values of states and action-state pairs can be written in a recursive way

These sets of equations are known as the Bellman equations.

We will use later the fact that

**Equation 1. State value as a function state-action value.**

### Optimal policy

The policy π* is an optimal policy if and only if we have

for any state *s* and any other policy π.

We show the set of all possible states by curvy *S* and the set of all possible actions at state *s* by curvy *A(s).* We denote the Kronecker delta by δ and start this section with the following theorem.

** Proof comment:** We used Eq. 1 in the 1st line of the proof, and then repeatedly used the fact that the value of the pair of state and action

*(s*, a*)*is greater than or equal to the value of state

*s*.*

Theorem 1 states that, whenever there is a pair of state and action *(s*, a*)* with a value greater than the value of state *s**, with respect to policy π, then there is another policy π’ that is better than or equal to (in terms of state-values) π in all states. As a result, if an optimal policy π* exists, its values should satisfy, for any state *s,*

Equation 2. The compact form of the Bellman optimality equations.

where curvy *A(s)* stands for the set of all possible actions at state *s —* one can easily prove this statement by contradiction*.* Using the Bellman equations, we can expand Equation 2 as

Equation 3. The expanded form of the Bellman optimality equations.

This set of non-linear equations (as many as the number of states) is called the “Bellman optimality equations”. So, if an optimal policy exists, its values should satisfy this set of equations³.

Therefore, to show that an optimal policy exists, one must prove the following two statements:

- The set of the Bellman optimality equations has solutions, and
- One of its solutions has values greater than or equal to the values of the other solutions in all states.

In this section, we prove that the set of the Bellman optimality equations has a **unique** solution. By doing so, we prove the two aforementioned statements at the same time.

### The Bellman optimality operator

Given a set of values over states, we define the vector of values as

which is simply a real-value vector whose elements are equal to the values of different states. Then, we define the “Bellman optimality operator” *T* as a mapping

The operator *T* takes a vector of values and maps it to another vector of values. Using this new notation, it is easy to see that Equations 2 and 3 are equivalent to

Equation 4. The Bellman optimality equations as the fixed point of the Bellman optimality operator.

This observation means that the solutions of the Bellman optimality equations are the same as the fixed points of the Bellman optimality operator. Therefore, to prove the existence and uniqueness of the solution to the Bellman optimality equations, one can prove that the Bellman optimality operator has a unique fixed point.

To do so, we need to introduce another concept and another theorem.

### Contraction mapping and Banach fixed-point theorem

Consider a metric space *(M,d)*, i.e. *M* is a set, and *d* is a metric defined on this set for computing the distance of every two elements of *M ⁵.* A mapping *T: M → M* is a contraction mapping if there exists *0* ≤ *k < 1* such that for any *x* and *y* in *M*, we have

Intuitively, contraction mapping makes points closer to each other. Figure 1 shows an illustration of repeatedly applying a contraction mapping on two points.

Figure 1. An illustration of contraction mapping and the statement of Banach fixed-point theorem

The reason that we are interested in contraction mappings is the following famous theorem, known as Banach fixed-point theorem.

** Proof comment:** The proof of the theorem is not hard, but I do not include it in this article, because the theorem is well known and the proof can easily be found elsewhere, e.g. see here.

The whole idea behind the theorem is what is illustrated in Figure 1: All points are getting closer to each other after mapping, and hence, by repeating the mapping all points converge to one point which is the unique fixed point of *T*.

As a result, to prove the existence and uniqueness of the solution of the Bellman optimal equations, it is sufficient to show that there is a metric in which the Bellman optimality operator is a contraction mapping.

### The Bellman optimality operator is a contraction mapping in infinity norm

For any pair of value vectors *V* and *V’*, their infinity norm is defined as

In this section, we want to prove that the Bellman optimality operator is a contraction mapping in this norm. To do so, we first need the following technical Lemma.

** Proof comment:** Although the Lemma is pretty non-trivial, its proof is not difficult and needs only elementary techniques. I had some fun proving it and thought it might be good to leave its proof as an exercise for interested readers⁶.

Now, having the lemma, we can finally go to our main theorem.

** Proof comment:** To go from the 2nd to the 3rd line of the proof, we used the Lemma, and to go from 4th to 5th line we used the convexity of the absolute value function. The rest is straightforward.

As a result, the Bellman optimality operator has a unique fixed point⁷, and the Bellman optimality equations have a unique solution. It is easy to show that any greedy policy on a solution of the Bellman optimality equations has values equal to that solution. *Therefore, optimal policies exist!*

We show that (**1)** the values of an optimal policy should satisfy the Bellman optimal equations. We then show that (**2)** the solutions to the Bellman optimal equations are the fixed point of the Bellman optimal operator. By showing that (**3)** the Bellman optimal operator is a contraction mapping in infinity norm and using (**4)** Banach fixed-point theorem, we proved that (**5)** the Bellman optimal operator has a unique fixed point. As a result, (**6)** there exist policies that maximize the values of all states at the same time.

*This post was originally published by Alireza Modirshanechi at Towards Data Science*