Let X,Y,U be Polish spaces and let P be a probability measure on B(X×Y), and let c:X×U→R be a bounded, Borel cost function. Then, for any Borel function γ:X×Y→U, there exists another Borel function γ∗:X→U s.t. X∫c(x,γ∗(x))dPX(dx)≤X×Y∫c(x,γ(x,y))dP(dx,dy)where PX is the marginal of P on X. Thus, policies based only on x are a.s. optimal.
\begin{proof}Goal: Construct a γ∗ given γ Let u=γ(x,y).
To emphasize the random nature of the realizationsx,y,u we denote their corresponding rvs as X,Y,U.
Given some Policyγ, we write for any Borel subsetD⊂U and x∈X, the following Stochastic KernelU∫1{u∈D}P(U∈du∣x)=Pγ(U∈D∣x)=P(γ(X,Y)∈D∣X=x)=Y∫1{γ(x,y)∈D}P(Y∈dy∣X=x).We then have X×Y∫c(x,γ(x,y))P(dx,dy)=X∫Y∫c(x,γ(x,y))P(dy∣x)P(dx)=X∫Y∫c(x,γ(x,y))P(Y∈dy∣x)P(dx)=X∫U∫c(x,u)Pγ(U∈du∣x)P(dx)=X∫U∫c(x,u)Pγ(du∣x)P(dx)We then denote the inner integral as follows: hγ(x):=U∫c(x,u)Pγ(du∣x)(1)to get X∫hγ(x)P(dx).Now, consider the set D={(x,u)∈X×U:c(x,u)≤hγ(x)}and its X-sections Dx:={u∈U:(x,u)∈D}∀x∈X,then Pγ(Dx∣x)>0,∀x∈X(2)since otherwise we would have U∫c(x,u)Pγ(du∣x)=Dxc∫c(x,u)Pγ(du∣x)>hγ(x)Dxc∫Pγ(du∣x)=hγ(x)U∫Pγ(du∣x)=hγ(x)which contradicts our definition in (1). By some measurable selection theorem, (2) implies that ∃γ∗:X→U such that {(x,γ∗(x)):x∈X}⊆Di.e. c(x,γ∗(x))≤hγ(x) for every x∈X: X∫c(x,γ∗(x))P(dx)≤X∫hγ(x)P(dx)≡X×Y∫c(x,γ(x,y))P(dx,dy)
\end{proof} >[!theorem] Ryll-Nardzewski Measurable Selection Theorem >Let X,Y be standard Borel. Let A be a countably-generated sub-σ-algebra of the Borel σ-algebra of X and let B be the class of Borel subsets of Y. For any function X×B s.t. >1. P(x∣⋅) is for each x a Probability Measure on B and >2. for each B∈B, P(⋅∣B) is an A-Measurable Function on X, and any set D∈A×B s.t. P(x,Dx)>0,∀x∈Xwhere Dx={y:(x,y)∈D}. > >Then, there is a A-measurable function g:X→Y whose ‘graph’ is a subset of D (i.e. (x,g(x))∈D,∀x∈X).
This is used to prove the following theorem.
Theorem (5.12)
Let {(xt,ut)} be a controlled Markov chain. Consider the Finite Horizon Optimization problem: JN(X,γ)=Exγ[k=0∑N−1c(Xk,Uk)+cN(XN)]where we seek to minimize the cost over all admissible policies. Any such policy can be replaced with one which is Markov and which is at least as good as the original policy. i.e. there is no loss in restricting policies to be Markov.