First of all, efficiency and convergence are two different things. There's also the rate of convergence, so an algorithm may converge faster than another, so, in this sense, it may be more efficient. I will focus on the proof that policy evaluation (PE) converges. If you want to know about its efficiency, maybe ask another question, but the proof below also tells you about the rate of convergence of PE.

What is the proof that policy evaluation converges to the optimal solution?

To provide some context, I will briefly describe policy evaluation and what you need to know to understand the proof.

## Policy evaluation

Policy evaluation (PE) is an iterative numerical algorithm to find the value function $v^\pi$ for a given (and arbitrary) policy $\pi$. This problem is often called the *prediction problem* (i.e. you want to predict the rewards you will get if you behave in a certain way).

### Two versions: synchronous and asynchronous

There are (at least) two versions of policy evaluation: a synchronous one and an asynchronous one.

In the **synchronous** version (SPE), you maintain two arrays for the values of the states: one array holds the current values of the states and the other array will contain the next values of the states, so two arrays are used in order to be able to update the value of each state at the same time.

In the **asynchronous** version (APE), you update the value of each state in place. So, first, you update the value of e.g. $s_1$, then $s_2$, etc., by changing your only array of values (so you do not require a second array).

SPE is similar in style to the numerical method called Jacobi method, which is a general iterative method for finding a solution to a system of linear equations (which is exactly what PE is actually doing, and this is also explained in the cited book by Sutton and Barto). Similarly, APE is similar in style to the Gauss–Seidel method, which is another method to solve a system of linear equations.

Both of these general numerical methods to solve a system of linear equations are studied in detail in Parallel and Distributed Computation Numerical Methods (1989) by Bertsekas and Tsitsiklis, which I haven't read yet, but provides convergence results for these numerical methods.

The book Reinforcement learning: an introduction by Sutton and Barto provides a more detailed description of policy evaluation (PE).

## Proof of convergence

I will provide a proof for the SPE based on these slides by Tom Mitchell. Before proceeding, I suggest you read the following question What is the Bellman operator in reinforcement learning? and its answer, and you should also get familiar with vector spaces, norms, fixed points and maybe contraction mappings.

The proof that PE finds a unique fixed-point is based on the **contraction mapping theorem** and on the concept of **$\gamma$-contractions**, so let me first recall these definitions.

**Definition ($\gamma$-contraction)**: An operator on a normed vector space $\mathcal{X}$ is a $\gamma$-contraction, for $0 < \gamma < 1$, provided for all $x, y \in \mathcal{X}$

$$\| F(x) - F(y) \| \leq \gamma \| x - y\|$$

**Contraction mapping theorem**: For a $\gamma$-contraction $F$ in a complete normed vector space $\mathcal{X}$

Now, consider the vector space $\mathcal{V}$ over state-value functions $v$ (i.e. $v \in \mathcal{V})$. So, each point in this space fully specifies a value function $v : \mathcal{S} \rightarrow \mathbb{R}$ (where $\mathcal{S}$ is the state space of the MDP).

**Theorem (convergence of PE)**: The Bellman operator is a $\gamma$-contraction operator, so an iterative application of it converges to a unique fixed-point in $\mathcal{V}$. Given that PE is an iterative application of the Bellman operator (see What is the Bellman operator in reinforcement learning?), PE finds this unique fixed-point solution.

So, we just need to show that the Bellman operator is a $\gamma$-contraction operator in order to show that PE finds this unique fixed-point solution.

### Proof

We will measure the distance between state-value functions $u$ and $v$ by the $\infty$-norm, i.e. the largest difference between state values:

$$\|u - v\|_{\infty} = \operatorname{max}_{s \in \mathcal{S}} |u(s) - v(s)|$$

**Definition (Bellman operator)**: We define the Bellman expectation operator as

$$F^\pi(v) = \mathbf{r}^\pi + \gamma \mathbf{T}^\pi v$$

where $v \in \mathcal{V}$, $\mathbf{r}^\pi$ is an $|\mathcal{S}|$-dimensional vector whose $j$th entry gives $\mathbb{E} \left[ r \mid s_j, a=\pi(s_j) \right]$ and $\mathbf{T}^\pi$ is an $|\mathcal{S}| \times |\mathcal{S}|$ matrix whose $(j, k)$ entry gives $\mathbb{P}(s_k \mid s_j, a=\pi(s_j))$.

Now, let's measure the distance (with the $\infty$-norm defined above) between any two value functions $u \in \mathcal{V}$ and $v \in \mathcal{V}$ after the application of the Bellman operator $F^\pi$

\begin{align}
\| F^\pi(u) - F^\pi(v) \|_{\infty}
&= \| (\mathbf{r}^\pi + \gamma \mathbf{T}^\pi u) - (\mathbf{r}^\pi + \gamma \mathbf{T}^\pi v)\|_{\infty} \\
&= \| \gamma \mathbf{T}^\pi (u - v)\|_{\infty} \\
&\leq \| \gamma \mathbf{T}^\pi ( \mathbb{1} \cdot \| u - v \|_{\infty})\|_{\infty} \\
&\leq \| \gamma (\mathbf{T}^\pi \mathbb{1}) \cdot \| u - v \|_{\infty}\|_{\infty} \\
&\leq \gamma \| u - v \|_{\infty}
\end{align}

where $\mathbb{1} = [1, \dots, 1]^T$. Note that $\mathbf{T}^\pi \cdot \mathbb{1} = \mathbb{1}$ because $\mathbf{T}^\pi$ is a stochastic matrix.

By the Bellman expectation equation (see Barto and Sutton's book and What is the Bellman operator in reinforcement learning?), $v^\pi$ is **a** fixed-point of the Bellman operator $F^\pi$. Given the contraction mapping theorem, the iterative application of $F^\pi$ produces a **unique** solution, so $v^\pi$ must be this unique solution, i.e. SPE finds $v^\pi$.

## Notes

I didn't prove the contraction mapping theorem, but you can find more info about the theorem and its proof in the related Wikipedia article.

1This explanation isn't clear at all (to me). The discount factor may play a role, but that doesn't give much intuition behind the algorithm, what it does and why it converges, in my opinion. I think it would be nice if you could provide a full description of the proof (maybe taken from a book, that's fine!). I think a good answer would start from a description of the algorithm and what it does. Then it proves why it converges. – nbro – 2020-04-16T15:01:44.310

1@SAGALPREETSINGH I may provide a more formal and complete answer later, if Fourier doesn't decide to do it. – nbro – 2020-04-16T15:43:30.380

@FourierFlux can you please be a bit more descriptive. I am not able to understand your answer. I think you didn't get my question right. I wanted to know, how policy evaluation works out and it happens that the algorithm converges. The intuition I formed for this is that everytime we iterate over current value function, we get closer to the actual one because after each iteration, our value function becomes more and more appropriate but how in the world does it go so close to the actual answer. What was the actual motivation behind using Bellman equation as an update rule. – SAGALPREET SINGH – 2020-04-16T15:45:08.303