Q-Learning

NAVIGATION

Home

Research

Bookshelf

Garden

FIND ME ON

GitHub

Home

Research

Bookshelf

Garden

🌱

Definition

StochasticControl

Consider a MDP with finite $\mathbb{X}$ and $\mathbb{U}$ and some discount parameter $\beta \in(0,1)$ . Let $Q:\mathbb{X}\times \mathbb{U}\to \mathbb{R}$ denote the $Q$ -factor matrix. We assume that the decision maker applies an arbitrary policy $\gamma$ and updates its $Q$ -factors as follows for $t\ge 0$ $Q_{t+1}(x,u)=Q_{t}(x,u)+\alpha_{t}(x,u)\left( c(x,u)+\beta\min_{v}Q_{t}(X_{t+1},v)-Q_{t}(x,u) \right)$ where: - the initial condition $Q_{0}$ is given (i.e. user-defined); - $\alpha_{t}(x,u)$ is the step-size for $(x,u)$ at time $t$ (a.k.a. the learning rate); - $u_{t}$ is chosen according to our arbitrary $\gamma$ ; - $X_{t+1}$ evolves according to $\mathcal{T}$

We can rewrite the update as $Q_{t+1}(x,u)=(1-\alpha_{t}(x,u))Q_{t}(x,u)+\alpha_{t}(x,u)[c(x,u)+\beta \min_{u'\in \mathbb{U}}Q_{t}(X_{t+1},u')]$ essentially thinking of the update as a weighted average between current Q values and new information.

We usually take the following assumption

In our Q-Learning problem, for our learning rate, $\alpha(x,u)$ we assume $\forall (x,u),t\ge 0$ that: 1. Unit Length: $\alpha_{t}(x,u)\in[0,1]$ 2. Defined only at current time: $\alpha_{t}(x,u)=0$ unless $(x,u)=(x_{t},u_{t})$ 3. Strictly Causal: $\alpha_{t}(x,u)$ is a deterministic function of $(x_{0},u_{0}),\dots,(x_{t},u_{t})$ 4. Not L1: $\sum_{t\ge 0}\alpha_{t}(x,u)\to \infty$ almost surely 5. Yes L2: $\sum_{t\ge 0}\alpha_{t}^{2}(x,u)\le C$ almost surely for some deterministic $C<\infty$ .

This brings us to the key result:

1. Under , the , converges a.s. to $Q^{*}$ . 2. A stationary policy $f^{*}$ which satisfies $\min_{u\in \mathbb{U}}Q^{*}(x,u)=Q^{*}(x,f^{*}(x)),\,\forall x\in \mathbb{X}$ is an optimal policy.

So pretty much, applying Q-learning allows us to recover the optimal policy from the Q-matrix!

Linked from

Q-Learning

Optimality of Q Learning Algorithm