游戏开发之强化学习

文章目录

概述
基于概率
基于价值
- Q-Learning（离线学习）
- - 简述
  - 实现
- Saras（在线学习）
- - 简述
  - 实现
- SarasLambda（在线学习）
- - 简介
  - 实现
两者结合

概述

TODO

基于概率

TODO

基于价值 Q-Learning（离线学习）

移步：莫烦Python Q-Learning原理

简述

在某一个环境下，玩家Player想要知道自己当前状态State的行为Action正确与否需要由环境Env来反馈。Player所做的决策都将得到Env给的反馈，从而不断去更新Player在环境中每个State的Action权重；最终达到学习的目的。

真实奖励算法：q_target = r + self.gamma * self.q_table.loc[s_, :].max()取比重最大的action。

从self.q_table.loc[s, a] += self.lr * (q_target - q_predict)公式角度来看待，也算是与AI中的BP算法有着异曲同工之妙。但是实际上与真正的人工智能还是有区别的。在这个算法中，可以找出明显的劣势：QTable所需要的空间可能会大到爆炸（State可能过多）

由于State不能过大，因此会导致QLearning过于依赖环境的影响，如果换了一种环境，就不能适应了。

实现

import numpy as np
import pandas as pd

"""
	算法链接: https://mofanpy.com/tutorials/machine-learning/reinforcement-learning/tabular-q1/
	算法思想: 贪婪
		1、 Q表中记录每个state所对应的所有action的权重, 选出权重最大的那个action
		2、 Q表的更新是通过计算 error = (实际奖励 - 预估奖励), 得到error后进行update的
"""
class QLearningTable:

	'''
		初始化
	'''
	def __init__(self, actions, learning_rate = 0.01, reward_decay = 0.9, e_greedy = 0.9):
		# 行为
		self.actions = actions
		# 学习率
		self.lr = learning_rate
		# Q2的权重
		self.gamma = reward_decay
		# 贪婪权重
		self.epsilon = e_greedy
		# Q Learning的 (状态 <===> 行为) ===> 决策表
		self.q_table = pd.Dataframe(columns = self.actions, dtype = np.float64)

	'''
		检查state是否存在, 不存在则创建state
	'''
	def check_state_exit(self, state):
		# 如果state不在q_table中
		if state not in self.q_table.index:
			# 创建一条state
			s = pd.Series([0]*len(self.actions), index = self.q_table.columns, name = state)
			# 添加到q_table
			self.q_table = self.q_table.append(s)

	'''
		在当前env的state下的action选择
	'''
	def choose_action(self, observation):
		# 检查state是否存在, 不存在则创建
		self.check_state_exit(observation)
		# 根据权重选择 => 探索 or 贪婪
		if np.random.uniform() < self.epsilon:
			# 贪婪
			# 先获取对应state的action权重
			state_action = self.q_table.loc[observation, :]
			# 选择权重最大的
			action = np.random.choice(state_action[state_action == np.max(state_action)].index)
		else:
			# 探索
			action = np.random.choice(self.actions)
		# 返回选择的action
		return action

	'''
		学习
		s: current state
		a: action
		r: reward
		s_: next state
	'''
	def learn(self, s, a, r, s_):
		# 检查next state是否存在
		self.check_state_exit(s_)
		# 得到预测值
		q_predict = self.q_table.loc[s, a]
		# 判断是否到达终点
		if s_ != 'terminal':
			# 非终点
			q_target = r + self.gamma * self.q_table.loc[s_, :].max()
		else:
			# 终点
			q_target = r
		# 更新q_table
		self.q_table.loc[s, a] += self.lr * (q_target - q_predict)

'''
	################################
			  以下为伪代码
	################################
'''
def main():
	# 创建环境
	env = Env()
	# actions, 表示环境下可以做的行为
	RL = QLearningTable(env.actions)
	# 重复100次
	for i in range(100):
		# 重置env环境
		env.reset()
		while True:
			# 获取主角的状态
			s = env.player.state()
			# 主角在当前状态下所做的行为
			a = RL.choose_action(s)
			# 主角走了一步
			s_, r, done = env.step(a)
			# 学习
			RL.learn(s, a, r, s_)
			# 结束
			if done:
				break

if __name__ == "__main__":
	main()

Saras（在线学习）

移步：莫烦Python Saras原理

简述

Saras算法与Q-Learning类似，Saras基于当前的State直接作出对应的Action，并且也想好了下一次的State要作出什么样的Action，不断去更新对应的表。不一样的地方在于：Q-Learning是基于当前的State选择对应的Action，再取下一个State中最大的Action作为最终奖励，这是一个尝试的过程，因为下一个State的最大奖励是虚无的，是一种假象已经拿到奖励的行为。

主要区别：A、策略的不同，主要在于Saras是确定自己走的每一步是什么样的Action。B、学习的不同，Saras每一步都是直接作出对应的Action，因此目标函数为 q_target = r + self.gamma * self.q_table.loc[s_, a_] 。

实现

import numpy as np
import pandas as pd
from QLearningTable import QLearningTable

class SarsaTable(QLearningTable):

	def __init__(self, actions, learning_rate=0.01, reward_decay=0.9, e_greedy=0.9):
		super(SarsaTable, self).__init__(actions, learning_rate, reward_decay, e_greedy)

	'''
		学习
		s: current state
		a: action
		r: reward
		s_: next state
		a_: next action
		done: is final point
	'''
	def learn(self, s, a, r, s_, a_, done):
		# 检查next state是否存在
		self.check_state_exit(s_)
		# 得到预测值
		q_predict = self.q_table.loc[s, a]
		# 判断是否到达终点
		if not done:
			# 非终点
			q_target = r + self.gamma * self.q_table.loc[s_, a_]
		else:
			# 终点
			q_target = r
		# 更新q_table
		self.q_table.loc[s, a] += self.lr * (q_target - q_predict)


'''
	################################
			  以下为伪代码
	################################
'''
def main():
	# 创建环境
	env = Env()
	# actions, 表示环境下可以做的行为
	RL = SarsaTable(env.actions)
	# 重复100次
	for i in range(100):
		# 重置env环境
		env.reset()
		# 获取主角的状态
		s = env.player.state()
		# 主角在当前状态下所做的行为
		a = RL.choose_action(s)
		while True:
			# 主角走了一步
			s_, r, done = env.step(a)
			# 获取下一个状态会做的行为
			a_ = RL.choose_action(s_)
			# 学习
			RL.learn(s, a, r, s_, a_, done)
			# 赋值
			s = s_
			a = a_
			# 结束
			if done:
				break

if __name__ == "__main__":
	main()

SarasLambda（在线学习）

移步：莫烦Python SarasLambda原理

简介

与Saras不同之处在于，可以通过lambda值来更新路径权重。这样容易快速收敛QTable。

实现

import numpy as np
import pandas as pd
from SarasTable import SarasTable

class SarasLambdaTable(SarasTable):

	def __init__(self, actions, learning_rate=0.01, reward_decay=0.9, e_greedy=0.9, trace_decay=0.9):
		super(SarasLambdaTable, self).__init__(actions, learning_rate, reward_decay, e_greedy)
		# 后向观测算法, eligibility trace.
		self.lambda_ = trace_decay
		# 空的 eligibility trace 表
		self.eligibility_trace = self.q_table.copy()

	'''
		检查state是否存在, 不存在则创建state
	'''
	def check_state_exit(self, state):
		# 如果state不在q_table中
		if state not in self.q_table.index:
			# 创建一条state
			s = pd.Series([0]*len(self.actions), index = self.q_table.columns, name = state)
			# 添加到q_table
			self.q_table = self.q_table.append(s)
			# 添加到eligibility_trace
			self.eligibility_trace = self.eligibility_trace.append(s)

	'''
		学习
		s: current state
		a: action
		r: reward
		s_: next state
		a_: next action
		done: is final point
	'''
	def learn(self, s, a, r, s_, a_, done):
		# 检查next state是否存在
		self.check_state_exit(s_)
		# 得到预测值
		q_predict = self.q_table.loc[s, a]
		# 判断是否到达终点
		if not done:
			# 非终点
			q_target = r + self.gamma * self.q_table.loc[s_, a_]
		else:
			# 终点
			q_target = r
		# 误差
		error = q_target - q_predict
		# 对于经历过的 state-action, 我们让它为1, 证明他是得到 reward 路途中不可或缺的一环
		self.eligibility_trace.loc[s, :] *= 0
		self.eligibility_trace.loc[s, a] = 1
		# 更新q_table, 与之前不一样, 更新的是所有的state-action
		self.q_table += self.lr * self.eligibility_trace * error
		# 随着时间衰减 eligibility trace 的值, 离获取 reward 越远的步, 他的"不可或缺性"越小
		self.eligibility_trace *= self.gamma * self.lambda_


'''
	################################
			  以下为伪代码
	################################
'''
def main():
	# 创建环境
	env = Env()
	# actions, 表示环境下可以做的行为
	RL = SarasLambdaTable(env.actions)
	# 重复100次
	for i in range(100):
		# 重置env环境
		env.reset()
		# 获取主角的状态
		s = env.player.state()
		# 主角在当前状态下所做的行为
		a = RL.choose_action(s)
		while True:
			# 主角走了一步
			s_, r, done = env.step(a)
			# 获取下一个状态会做的行为
			a_ = RL.choose_action(s_)
			# 学习
			RL.learn(s, a, r, s_, a_, done)
			# 赋值
			s = s_
			a = a_
			# 结束
			if done:
				break

if __name__ == "__main__":
	main()

两者结合

TODO

游戏开发之强化学习

Python相关栏目本月热门文章