Q-Learning

NAVIGATION

Home

Research

Bookshelf

Garden

FIND ME ON

GitHub

Home

Research

Bookshelf

Garden

Definition (Q-Learning Algorithm)

Consider a MDP with finite $\mathbb{X}$ and $\mathbb{U}$ and some discount parameter $\beta \in(0,1)$ . Let $Q:\mathbb{X}\times \mathbb{U}\to \mathbb{R}$ denote the $Q$ -factor matrix. We assume that the decision maker applies an arbitrary policy $\gamma$ and updates its $Q$ -factors as follows for $t\ge 0$ $Q_{t+1}(x,u)=Q_{t}(x,u)+\alpha_{t}(x,u)\left( c(x,u)+\beta\min_{v}Q_{t}(X_{t+1},v)-Q_{t}(x,u) \right)$ where:

the initial condition $Q_{0}$ is given (i.e. user-defined);
$\alpha_{t}(x,u)$ is the step-size for $(x,u)$ at time $t$ (a.k.a. the learning rate);
$u_{t}$ is chosen according to our arbitrary $\gamma$ ;
$X_{t+1}$ evolves according to $\mathcal{T}$

>[!rmk] >We can rewrite the update as $Q_{t+1}(x,u)=(1-\alpha_{t}(x,u))Q_{t}(x,u)+\alpha_{t}(x,u)[c(x,u)+\beta \min_{u'\in \mathbb{U}}Q_{t}(X_{t+1},u')]$ essentially thinking of the update as a weighted average between current Q values and new information.

We usually take the following assumption

Assumption (21.2.1)

In our Q-Learning problem, for our learning rate, $\alpha(x,u)$ we assume $\forall (x,u),t\ge 0$ that:

Unit Length: $\alpha_{t}(x,u)\in[0,1]$
Defined only at current time: $\alpha_{t}(x,u)=0$ unless $(x,u)=(x_{t},u_{t})$
Strictly Causal: $\alpha_{t}(x,u)$ is a deterministic function of $(x_{0},u_{0}),\dots,(x_{t},u_{t})$
Not L1: $\sum_{t\ge 0}\alpha_{t}(x,u)\to \infty$ almost surely
Yes L2: $\sum_{t\ge 0}\alpha_{t}^{2}(x,u)\le C$ almost surely for some deterministic $C<\infty$ .

This brings us to the key result:

Theorem (21.2.1)

Under , the , converges a.s. to $Q^{*}$ .
A stationary policy $f^{*}$ which satisfies $\min_{u\in \mathbb{U}}Q^{*}(x,u)=Q^{*}(x,f^{*}(x)),\,\forall x\in \mathbb{X}$ is an optimal Policy.

So pretty much, applying Q-learning allows us to recover the optimal policy from the Q-matrix!

Theorem (Optimality of Q-Learning Algorithm)

Consider the Q-Learning algorithm and its dynamics. Then:

Q-Learning Converges to an Optimal Solution: Under the , the Q-Learning algorithm converges almost surely to an optimal $Q^{*}$ .
Optimal Stationary Policy is Optimal: A Stationary policy $f^{*}$ which satisfies $\min_{u}Q^{*}(x,u)=Q^{*}(x,f^{*}(x))$ is an optimal policy.

Linked from

Q-Learning