FIND ME ON

GitHub

LinkedIn

Q-Learning

🌱

Definition
StochasticControl

Consider a MDP with finite X\mathbb{X} and U\mathbb{U} and some discount parameter β(0,1)\beta \in(0,1). Let Q:X×URQ:\mathbb{X}\times \mathbb{U}\to \mathbb{R} denote the QQ-factor matrix. We assume that the decision maker applies an arbitrary policy γ\gamma and updates its QQ-factors as follows for t0t\ge 0 Qt+1(x,u)=Qt(x,u)+αt(x,u)(c(x,u)+βminvQt(Xt+1,v)Qt(x,u))Q_{t+1}(x,u)=Q_{t}(x,u)+\alpha_{t}(x,u)\left( c(x,u)+\beta\min_{v}Q_{t}(X_{t+1},v)-Q_{t}(x,u) \right)where: - the initial condition Q0Q_{0} is given (i.e. user-defined); - αt(x,u)\alpha_{t}(x,u) is the step-size for (x,u)(x,u) at time tt (a.k.a. the learning rate); - utu_{t} is chosen according to our arbitrary γ\gamma; - Xt+1X_{t+1} evolves according to T\mathcal{T}

We can rewrite the update as Qt+1(x,u)=(1αt(x,u))Qt(x,u)+αt(x,u)[c(x,u)+βminuUQt(Xt+1,u)]Q_{t+1}(x,u)=(1-\alpha_{t}(x,u))Q_{t}(x,u)+\alpha_{t}(x,u)[c(x,u)+\beta \min_{u'\in \mathbb{U}}Q_{t}(X_{t+1},u')]essentially thinking of the update as a weighted average between current Q values and new information.

We usually take the following assumption

In our Q-Learning problem, for our learning rate, α(x,u)\alpha(x,u) we assume (x,u),t0\forall (x,u),t\ge 0 that: 1. Unit Length: αt(x,u)[0,1]\alpha_{t}(x,u)\in[0,1] 2. Defined only at current time: αt(x,u)=0\alpha_{t}(x,u)=0 unless (x,u)=(xt,ut)(x,u)=(x_{t},u_{t}) 3. Strictly Causal: αt(x,u)\alpha_{t}(x,u) is a deterministic function of (x0,u0),,(xt,ut)(x_{0},u_{0}),\dots,(x_{t},u_{t}) 4. Not L1: t0αt(x,u)\sum_{t\ge 0}\alpha_{t}(x,u)\to \infty almost surely 5. Yes L2: t0αt2(x,u)C\sum_{t\ge 0}\alpha_{t}^{2}(x,u)\le C almost surely for some deterministic C<C<\infty.

This brings us to the key result:

1. Under , the , converges a.s. to QQ^{*}. 2. A stationary policy ff^{*} which satisfies minuUQ(x,u)=Q(x,f(x)),xX\min_{u\in \mathbb{U}}Q^{*}(x,u)=Q^{*}(x,f^{*}(x)),\,\forall x\in \mathbb{X}is an optimal policy.

So pretty much, applying Q-learning allows us to recover the optimal policy from the Q-matrix!

Linked from