Definition (Q-Learning Algorithm)
Consider a MDP with finite X and U and some discount parameter β∈(0,1). Let Q:X×U→R denote the Q-factor matrix. We assume that the decision maker applies an arbitrary policy γ and updates its Q-factors as follows for t≥0 Qt+1(x,u)=Qt(x,u)+αt(x,u)(c(x,u)+βvminQt(Xt+1,v)−Qt(x,u))where:
- the initial condition Q0 is given (i.e. user-defined);
- αt(x,u) is the step-size for (x,u) at time t (a.k.a. the learning rate);
- ut is chosen according to our arbitrary γ;
- Xt+1 evolves according to T
>[!rmk] >We can rewrite the update as Qt+1(x,u)=(1−αt(x,u))Qt(x,u)+αt(x,u)[c(x,u)+βu′∈UminQt(Xt+1,u′)]essentially thinking of the update as a weighted average between current Q values and new information.
We usually take the following assumption
Assumption (21.2.1)
In our Q-Learning problem, for our learning rate, α(x,u) we assume ∀(x,u),t≥0 that:
- Unit Length: αt(x,u)∈[0,1]
- Defined only at current time: αt(x,u)=0 unless (x,u)=(xt,ut)
- Strictly Causal: αt(x,u) is a deterministic function of (x0,u0),…,(xt,ut)
- Not L1: ∑t≥0αt(x,u)→∞ almost surely
- Yes L2: ∑t≥0αt2(x,u)≤C almost surely for some deterministic C<∞.
This brings us to the key result:
Theorem (21.2.1)
- Under , the , converges a.s. to Q∗.
- A stationary policy f∗ which satisfies u∈UminQ∗(x,u)=Q∗(x,f∗(x)),∀x∈Xis an optimal Policy.
So pretty much, applying Q-learning allows us to recover the optimal policy from the Q-matrix!
Theorem (Optimality of Q-Learning Algorithm)
Consider the Q-Learning algorithm and its dynamics. Then:
- Q-Learning Converges to an Optimal Solution: Under the , the Q-Learning algorithm converges almost surely to an optimal Q∗.
- Optimal Stationary Policy is Optimal: A Stationary policy f∗ which satisfies uminQ∗(x,u)=Q∗(x,f∗(x)) is an optimal policy.