强化学习算法Q-Learning

走过一步之后对上一步的价值进行评价(经验)

例子(Gym中的Taxi游戏)

游戏简介

±--------+
|R: | : :G|
| : | : : |
| : : : : |
| | : | : |
|Y| : |B: |
±--------+
**R, G, Y, B **are the possible pickup and destination locations. The **blue **letter represents the current passenger pick-up location, and the **purple **letter is the current destination.
The filled square represents the taxi, which is **yellow without a passenger and green with a passenger.
The pipe (“|”) **represents a wall which the taxi cannot cross.
action number:
0 = south
1 = north
2 = east
3 = west
4 = pickup
5 = dropoff

Q-Learning代码

import numpy as np
import random
from IPython.display import clear_output

q_table = np.zeros([env.observation_space.n, env.action_space.n])

# Hyperparameters
alpha = 0.1
gamma = 0.6
epsilon = 0.1

# For plotting metrics
all_epochs = []
all_penalties = []

for i in range(1, 100001):
    state = env.reset()

    epochs, penalties, reward, = 0, 0, 0
    done = False

    while not done:
        if random.uniform(0, 1) < epsilon:  # 部分随机,0.1的概率
            action = env.action_space.sample()  # Explore action space
        else:
            action = np.argmax(q_table[state])  # Exploit learned values

        next_state, reward, done, info = env.step(action)

        old_value = q_table[state, action]  # 上一个在某state的动作价值
        next_max = np.max(q_table[next_state])

        new_value = (1 - alpha) * old_value + alpha * (reward + gamma * next_max)  # Q_learn得分计算公式
        q_table[state, action] = new_value  # 走到下一步之后对上一步进行评分

        if reward == -10:
            penalties += 1

        state = next_state
        epochs += 1

    if i % 100 == 0:
        clear_output(wait=True)
        print(f"Episode: {i}")

print("Training finished.n")

代码详解

初始化Q表

import numpy as np
q_table = np.zeros([env.observation_space.n, env.action_space.n])

Q表的行数为state的个数,为500(棋盘大小55, 乘客位置情况为(4+1), 乘车点为4种情况, 5554=500)
Q表的列数为每个state对应的决策情况,对于本游戏action有6种可能

设置超参,选择决策

import random
from IPython.display import clear_output

# Hyperparameters
alpha = 0.1
gamma = 0.6
epsilon = 0.1

for i in range(1, 100001):
    state = env.reset()

    epochs, penalties, reward, = 0, 0, 0
    done = False
    while not done:
        if random.uniform(0, 1) < epsilon:  # 部分随机,0.1的概率
            action = env.action_space.sample()  # Explore action space
        else:
            action = np.argmax(q_table[state])  # Exploit learned values

超参说明
alpha = 0.1alpha: (the learning rate) should decrease as you continue to gain a larger and larger knowledge base.
gamma = 0.6gamma: determines how much importance we want to give to future rewards. A high value for the discount factor (close to 1) captures the long-term effective award, whereas, a discount factor of 0 makes our agent consider only immediate reward, hence making it greedy.(as you get closer and closer to the deadline, your preference for near-term reward should increase, as you won’t be around long enough to get the long-term reward, which means your gamma should decrease.)
epsilon = 0.1epsilon: as we develop our strategy, we have less need of exploration and more exploitation to get more utility from our policy, so as trials increase, epsilon should decrease.
由于在开始的时候,Q表的所有值都是0,故模型在每次选择最大的值的action方向行进的时候都会走0对应的action方向(即为south)

更新Q表

next_state, reward, done, info = env.step(action)

old_value = q_table[state, action]  # 上一个在某state的动作价值
next_max = np.max(q_table[next_state])

new_value = (1 - alpha) * old_value + alpha * (reward + gamma * next_max)  # Q_learn得分计算公式
q_table[state, action] = new_value  # 走到下一步之后对上一步进行评分

走到新的位置后,对上一步进行打分new_value = (1 - alpha) * old_value + alpha * (reward + gamma * next_max),并更新上一步state的值

周而复始直到找到正确的道路, 进行100000次循环调优Q表

强化学习算法Q-Learning

Python相关栏目本月热门文章