黑杰克(Blackjack)又名21点,起源于法国,有着悠久的历史。在世界各地的赌场中都很流行的一种游戏,使用除大小王之外的52张牌,游戏者的目标是使手中的牌的点数之和不超过21点且尽量大。扑克点数的计算规则:2至9牌,按其原点数计算;K、Q、J和10牌都算作10点(一般记作T,即ten之意);A牌(Ace)既可算作1点也可算作11点,由玩家自己决定。
游戏规则(不同玩法略有不同,下面是简单两人玩法,一庄一闲):
- 开局,庄闲各发两张牌,庄家一张明牌一张暗牌。如果庄闲其中一个牌面点数为21,则胜出。点数都为21,平局。否则继续。拿牌流程:闲可以选择拿牌(hit)、停牌(stick)。若选择拿牌,在发牌的过程中,如果玩家的牌点数的和超过21,玩家就输了—叫爆掉(Bust),庄家赢。如玩家没爆掉,又决定不再要牌了(停牌),则轮到庄家。庄家翻开暗牌,并持续拿牌直至点数不小于17(若有Ace,按最大而尽量不爆计算)。如果庄家爆掉了,玩家赢;否则那么比点数大小,大为赢。点数相同为平局。
21点问题可以看作一个有限的马尔可夫决策过程:
s s s : 状态(闲);Ace,玩家(闲)的牌面,庄家的明牌牌面。 a a a : 动作;要牌(hit:1),停牌(stick:0)。 r r r : 奖励;[-1,0, 1],输,赢,平。 γ = 1 gamma = 1 γ=1 问题求解
对于21点游戏,在没有完整的环境模型的情况下,不能使用DP方面来寻找最优策略。但是可以通过蒙特卡洛方法来求解最优策略,MC方法只需要从环境种收集足够的状态、动作、奖励序列数据就能对值函数进行估计。进而找到最优策略。
假设玩家的策略为 π p pi_p πp : 当牌面点数为20或21时,停牌,否则拿牌。采用MC方法计算策略 π p pi_p πp的值函数 v π ( s ) v_{pi}(s) vπ(s), 庄家的固定策略 π d pi_d πd:当点数小于17时,一直拿牌,否则停牌。
模拟游戏过程记录状态、动作、奖励。
import warnings
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
from collections import namedtuple
from tqdm.notebook import tqdm
warnings.filterwarnings('ignore')
# player policy
def player_policy(usable_ace_player, cards_num, dealer_card1):
# 1:拿牌,0: 停牌
if cards_num < 20:
return 1
else:
return 0
# dealer policy
def dealer_policy(cards_num):
if cards_num < 17:
return 1
else:
return 0
def play_blackjack(policy_player, policy_dealer, initial_state=None, initial_action=None):
'''
policy_player : function input state return action
policy_dealer :
return -> reward, trajectory
'''
def card_value(card):
return 11 if card == 1 else card
# 闲
player_sum = 0
# 庄
dealer_card1 = 0
dealer_card2 = 0
# trajectory
player_trajectory = []
player_transition = namedtuple('Transition', ['state', 'action'])
# False : Ace = 1, True Ace = 11
usable_ace_player = False
usable_ace_dealer = False
if initial_state is None:
while player_sum < 12:
# 点数小于12,一直拿牌
card = min(np.random.randint(1, 14), 10)
#print(card)
# 小于12,Ace = 11
player_sum += card_value(card)
# 点数超过21
if player_sum > 21:
# Ace = 1
player_sum -= 10
else:
usable_ace_player |= (1 == card)
# 初始化庄家牌,第一张为明牌
dealer_card1 = min(np.random.randint(1, 14), 10)
dealer_card2 = min(np.random.randint(1, 14), 10)
else:
# 指定初始状态
usable_ace_player, player_sum, dealer_card1 = initial_state
dealer_card2 = min(np.random.randint(1, 14), 10)
dealer_sum = card_value(dealer_card1) + card_value(dealer_card2)
usable_ace_dealer = 1 in (dealer_card1, dealer_card2)
if dealer_sum > 21:
# use Ace = 1
dealer_sum -= 10
# 闲先
while True:
if initial_action is not None:
player_action = initial_action
initial_action = None
else:
player_action = policy_player(usable_ace_player, player_sum, dealer_card1)
# 状态,动作
player_sa = player_transition((usable_ace_player, player_sum, dealer_card1), player_action)
player_trajectory.append(player_sa)
if player_action == 0:
break
# 拿牌,默认Ace = 11
card = min(np.random.randint(1, 14), 10)
#print(card)
# Keep track of the ace count
ace_count = int(usable_ace_player)
if card == 1:
ace_count += 1
player_sum += card_value(card)
# 避免bust ,Ace = 1
while player_sum > 21 and ace_count:
player_sum -= 10
ace_count -= 1
if player_sum > 21:
return -1 , player_trajectory
usable_ace_player = (ace_count == 1)
# 庄
while True:
dealer_action = policy_dealer(dealer_sum)
if dealer_action == 0:
break
# 拿牌,默认Ace = 11
new_card = min(np.random.randint(1, 14), 10)
#print(card)
ace_count = int(usable_ace_dealer)
if new_card == 1:
ace_count += 1
dealer_sum += card_value(new_card)
# 避免bust,Ace = 1
while dealer_sum > 21 and ace_count:
dealer_sum -= 10
ace_count -= 1
if dealer_sum > 21:
return 1 , player_trajectory
usable_ace_dealer = (ace_count == 1)
if player_sum > dealer_sum:
return 1 , player_trajectory
elif player_sum == dealer_sum:
return 0 , player_trajectory
else:
return -1 , player_trajectory
player_reward, player_traj = play_blackjack(player_policy, dealer_policy) # 结果 player_reward
-1
player_traj
[Transition(state=(False, 12, 10), action=1), Transition(state=(False, 15, 10), action=1)]状态值函数估计 v π ( s ) v_{pi}(s) vπ(s)
First-visit MC : 只考虑每一局(episode)种状态 s s s第一次出现的情况。Every-visit MC : 考虑每一局种 s s s多次重复出现的情况。两种方法的收敛效果是一致的。 First-visit MC
Input: a policy π pi π to be evaluatedInitialize: V ( s ) ∈ R V(s) in R V(s)∈R,arbitrarily,for all s ∈ S s in mathcal S s∈S, R e t u r n s ( s ) ← Returns(s) gets Returns(s)← an empty list, for all s ∈ S s in S s∈S。Loop forever (for each episode):
Generate an episode following π : S 0 , a 0 , R 1 , S 1 , a 1 , R 2 , . . . , S T − 1 , a T − 1 , R T pi : S_0, a_0, R_1, S_1, a_1, R_2,...,S_{T-1},a_{T-1},R_T π:S0,a0,R1,S1,a1,R2,...,ST−1,aT−1,RT G ← 0 G gets 0 G←0Loop for each step of episode, t = T − 1 , T − 2 , . . . , 0 t = T-1, T-2, ...,0 t=T−1,T−2,...,0:
G = γ G + R t + 1 G = gamma G + R_{t+1} G=γG+Rt+1Unless S t S_t St appears in S 0 , S 1 , . . . S t − 1 S_0,S_1,...S_{t-1} S0,S1,...St−1:
Append G G G to R e t u r n s ( S t ) Returns(S_t) Returns(St) V ( S t ) ← a v e r a g e ( R e t r u n s ( S t ) ) V(S_t) gets average(Retruns(S_t)) V(St)←average(Retruns(St))
在 γ = 1 gamma = 1 γ=1的情况下,每个episode中状态的值,跟游戏结束时的reward是一样的。
def monte_carlo_state_value_estimate(episodes, gamma=1.0):
# player policy
states_usable_ace = np.zeros((10, 10))
states_usable_ace_count = np.ones((10, 10))
states_no_usable_ace = np.zeros((10, 10))
states_no_usable_ace_count = np.ones((10, 10))
for i in tqdm(range(0, episodes)):
player_reward, player_traj = play_blackjack(player_policy, dealer_policy)
player_states = [t.state for t in player_traj]
player_actions = [t.action for t in player_traj]
player_rewards = [0]*len(player_states)
player_rewards[-1] = player_reward
R = 0
Gs = []
for r in player_rewards[::-1]:
R = r + gamma * R
Gs.insert(0, R)
for player_state, G in zip(player_states, Gs):
usable_ace_player, player_sum, dealer_card = player_state
player_sum -= 12
dealer_card -= 1
if usable_ace_player:
states_usable_ace_count[player_sum, dealer_card] += 1
states_usable_ace[player_sum, dealer_card] += G
else:
states_no_usable_ace_count[player_sum, dealer_card] += 1
states_no_usable_ace[player_sum, dealer_card] += G
return states_usable_ace / states_usable_ace_count, states_no_usable_ace / states_no_usable_ace_count
可视化值函数
Ace = 11 和 Ace = 1, 两种不同状态下值函数 v π ( s ) v_{pi}(s) vπ(s)
在没有环境模型的情况下,只对状态的值进行估计是不够的,虽然知道了每个状态对应的期望回报,但是不知道如何过渡到这个状态(采取什么动作)。在之前MDP问题中,都有一个已知的环境模型(即: P ( s ′ , r ∣ s , π ( s ) ) P(s',r|s,pi(s)) P(s′,r∣s,π(s))),然后计算出每个动作和下一个状态对应的期望值,选择最佳的动作。因此,在环境模型未知的情况下,需要对动作值进行估计(或状态动作值,即: q π ( s , a ) q_{pi}(s,a) qπ(s,a)),动作值函数直接表示策略。基于蒙特卡洛方法的动作值函数估计跟状态值的估计是一样的,都是通过计算平均的回报。
动作值函数估计 q ( s , a ) q(s,a) q(s,a)状态值函数的估计只包含整个状态空间 s ∈ S s in S s∈S,动作值函数的估计参数空间为 S × A , a ∈ A S times A, a in A S×A,a∈A,如果状态和参数空间比较大,那么 ( s , a ) (s,a) (s,a)的组合空间会更大,为了能够准确的估计动作值,需要尽可能的多遍历整个的 s − a s-a s−a空间,在游戏过程中。无尽的游戏次数肯定可以满足但是这个并不实际,所以需要采取一些策略,在游戏时间上做一些平衡。
- 随机开始,从随机状态开始游戏。
Initialize:
π ( s ) ∈ A ( s ) pi(s) in mathcal A(s) π(s)∈A(s)(arbitrartily),for all s ∈ S . s in mathcal S. s∈S. Q ( s , a ) ∈ R Q(s,a) in mathcal R Q(s,a)∈R (arbitrartily),for all s ∈ S , a ∈ A ( s ) s in mathcal S, a in mathcal A(s) s∈S,a∈A(s). R e t u r n s ( s , a ) ← Returns(s,a) gets Returns(s,a)← empty list, for all$ s in mathcal S, a in mathcal A(s)$. Loop forever (for each episode):
Choose S 0 ∈ S , A 0 ∈ A ( S 0 ) S_0 in mathcal S, A_0 in mathcal A(S_0) S0∈S,A0∈A(S0) randomly all pairs have probability > 0Generate an episode from S 0 , A 0 S_0,A_0 S0,A0 following π pi π: S 0 , A 0 , R 1 , . . . , S T − 1 , A T − 1 , R T S_0, A_0, R_1,...,S_{T-1},A_{T-1},R_T S0,A0,R1,...,ST−1,AT−1,RT G ← 0 G gets 0 G←0Loop for each step of episode, t = T − 1 , T − 2 , . . . , 0 t = T-1, T-2, ..., 0 t=T−1,T−2,...,0:
G ← γ G + R t + 1 G gets gamma G + R_{t+1} G←γG+Rt+1Unless the pair S t , A t S_t,A_t St,At appears in S 0 , A 0 , S 1 , A 1 , . . . , S t − 1 , A t − 1 S_0,A_0,S_1,A_1,...,S_{t-1},A_{t-1} S0,A0,S1,A1,...,St−1,At−1:
Append G G G to R e t u r n s ( S t , A t ) Returns(S_t, A_t) Returns(St,At) Q ( S t , A t ) ← Q(S_t, A_t) gets Q(St,At)← average( R e t u r n s ( S t , A t ) Returns(S_t,A_t) Returns(St,At)) π ( S t ) ← a r g m a x a Q ( S t , a ) pi(S_t) gets argmax_{a} Q(S_t, a) π(St)←argmaxaQ(St,a) update q ( s , a ) q(s,a) q(s,a),采用用下面的方式
q ( s , a ) ← q ( s , a ) × ( s , a ) c o u n t + q ( s , a ) n e w ( ( s , a ) c o u n t + 1 ) q(s,a) gets frac{q(s,a) times (s,a)_{count} + q(s,a)_{new}}{big((s,a)_{count} + 1big)} q(s,a)←((s,a)count+1)q(s,a)×(s,a)count+q(s,a)new
def monte_carlo_es(episodes, gamma=1.0):
sa_history = []
# Initialize
state_action_values = np.zeros((10, 10, 2, 2))
state_action_pair_count = np.ones((10, 10, 2, 2))
# argmax_a(sa)
def greedy_policy(usable_ace, player_sum, dealer_card):
usable_ace = int(usable_ace)
player_sum -= 12
dealer_card -= 1
values_ = state_action_values[player_sum, dealer_card, usable_ace, :]
#return np.argmax(values_)
# e: values_=[0, 0], random choice
return np.random.choice([action_ for action_, value_ in enumerate(values_) if value_ == np.max(values_)])
# Loop for each episode
for episode in tqdm(range(episodes)):
# Randomly Initialize
initial_state = [bool(np.random.randint(0, 2)),np.random.randint(12, 22),np.random.randint(1, 11)]
initial_action = np.random.randint(0, 2)
# Generate an episode
#current_policy = greedy_policy if episode else player_policy
player_reward, player_traj = play_blackjack(greedy_policy, dealer_policy, initial_state, initial_action)
player_states = [t.state for t in player_traj]
player_actions = [t.action for t in player_traj]
player_rewards = [0]*len(player_states)
player_rewards[-1] = player_reward
R = 0
Gs = []
for r in player_rewards[::-1]:
R = r + gamma * R
Gs.insert(0, R)
for player_state,action, G in zip(player_states, player_actions, Gs):
usable_ace_player, player_sum, dealer_card = player_state
usable_ace = int(usable_ace_player)
player_sum -= 12
dealer_card -= 1
# Update values of state-action
old_val = state_action_values[player_sum, dealer_card, usable_ace, action]
sa_count = state_action_pair_count[player_sum,dealer_card, usable_ace, action]
new_val = (old_val * sa_count + G)/(sa_count + 1)
state_action_values[player_sum, dealer_card, usable_ace, action] = new_val
state_action_pair_count[player_sum, dealer_card, usable_ace, action] += 1
sa_history.append(state_action_values.copy())
return state_action_values, sa_history
可视化动作值
由于维度限制,所以用不用的颜色来表示动作值的差异。
usable_Aceno usable_Ace颜色越深,值越大
Usable Ace
- usable Ace
- Nousable Ace
q ( s , a ) q(s,a) q(s,a)矩阵就是策略的具体内容,最优策略 π ∗ ( s ) = a r g m a x a ( q ( s , a ) ) pi^{*}(s) = {argmax}_a(q(s,a)) π∗(s)=argmaxa(q(s,a)).
动作值函数的收敛过程动画


