走过一步之后对上一步的价值进行评价(经验)
- 游戏简介
±--------+
|R: | : :G|
| : | : : |
| : : : : |
| | : | : |
|Y| : |B: |
±--------+
**R, G, Y, B **are the possible pickup and destination locations. The **blue **letter represents the current passenger pick-up location, and the **purple **letter is the current destination.
The filled square represents the taxi, which is **yellow without a passenger and green with a passenger.
The pipe (“|”) **represents a wall which the taxi cannot cross.
action number:
0 = south
1 = north
2 = east
3 = west
4 = pickup
5 = dropoff
- Q-Learning代码
import numpy as np
import random
from IPython.display import clear_output
q_table = np.zeros([env.observation_space.n, env.action_space.n])
# Hyperparameters
alpha = 0.1
gamma = 0.6
epsilon = 0.1
# For plotting metrics
all_epochs = []
all_penalties = []
for i in range(1, 100001):
state = env.reset()
epochs, penalties, reward, = 0, 0, 0
done = False
while not done:
if random.uniform(0, 1) < epsilon: # 部分随机,0.1的概率
action = env.action_space.sample() # Explore action space
else:
action = np.argmax(q_table[state]) # Exploit learned values
next_state, reward, done, info = env.step(action)
old_value = q_table[state, action] # 上一个在某state的动作价值
next_max = np.max(q_table[next_state])
new_value = (1 - alpha) * old_value + alpha * (reward + gamma * next_max) # Q_learn得分计算公式
q_table[state, action] = new_value # 走到下一步之后对上一步进行评分
if reward == -10:
penalties += 1
state = next_state
epochs += 1
if i % 100 == 0:
clear_output(wait=True)
print(f"Episode: {i}")
print("Training finished.n")
代码详解
- 初始化Q表
import numpy as np q_table = np.zeros([env.observation_space.n, env.action_space.n])
Q表的行数为state的个数,为500(棋盘大小55, 乘客位置情况为(4+1), 乘车点为4种情况, 5554=500)
Q表的列数为每个state对应的决策情况,对于本游戏action有6种可能
- 设置超参,选择决策
import random
from IPython.display import clear_output
# Hyperparameters
alpha = 0.1
gamma = 0.6
epsilon = 0.1
for i in range(1, 100001):
state = env.reset()
epochs, penalties, reward, = 0, 0, 0
done = False
while not done:
if random.uniform(0, 1) < epsilon: # 部分随机,0.1的概率
action = env.action_space.sample() # Explore action space
else:
action = np.argmax(q_table[state]) # Exploit learned values
超参说明
alpha = 0.1alpha: (the learning rate) should decrease as you continue to gain a larger and larger knowledge base.
gamma = 0.6gamma: determines how much importance we want to give to future rewards. A high value for the discount factor (close to 1) captures the long-term effective award, whereas, a discount factor of 0 makes our agent consider only immediate reward, hence making it greedy.(as you get closer and closer to the deadline, your preference for near-term reward should increase, as you won’t be around long enough to get the long-term reward, which means your gamma should decrease.)
epsilon = 0.1epsilon: as we develop our strategy, we have less need of exploration and more exploitation to get more utility from our policy, so as trials increase, epsilon should decrease.
由于在开始的时候,Q表的所有值都是0,故模型在每次选择最大的值的action方向行进的时候都会走0对应的action方向(即为south)
- 更新Q表
next_state, reward, done, info = env.step(action) old_value = q_table[state, action] # 上一个在某state的动作价值 next_max = np.max(q_table[next_state]) new_value = (1 - alpha) * old_value + alpha * (reward + gamma * next_max) # Q_learn得分计算公式 q_table[state, action] = new_value # 走到下一步之后对上一步进行评分
走到新的位置后,对上一步进行打分new_value = (1 - alpha) * old_value + alpha * (reward + gamma * next_max),并更新上一步state的值
- 周而复始直到找到正确的道路, 进行100000次循环调优Q表



