🌱
Consider a MDP with finite and and some discount parameter . Let denote the -factor matrix. We assume that the decision maker applies an arbitrary policy and updates its -factors as follows for where: - the initial condition is given (i.e. user-defined); - is the step-size for at time (a.k.a. the learning rate); - is chosen according to our arbitrary ; - evolves according to
We can rewrite the update as essentially thinking of the update as a weighted average between current Q values and new information.
We usually take the following assumption
In our Q-Learning problem, for our learning rate, we assume that: 1. Unit Length: 2. Defined only at current time: unless 3. Strictly Causal: is a deterministic function of 4. Not L1: almost surely 5. Yes L2: almost surely for some deterministic .
This brings us to the key result:
1. Under , the , converges a.s. to . 2. A stationary policy which satisfies is an optimal policy.
So pretty much, applying Q-learning allows us to recover the optimal policy from the Q-matrix!