栏目分类:
子分类:
返回
名师互学网用户登录
快速导航关闭
当前搜索
当前分类
子分类
实用工具
热门搜索
名师互学网 > IT > 前沿技术 > 大数据 > 大数据系统

马尔科夫决策过程(MDP) : BlackJack问题(MC-ES)

马尔科夫决策过程(MDP) : BlackJack问题(MC-ES)

问题描述(Black Jack):

黑杰克(Blackjack)又名21点,起源于法国,有着悠久的历史。在世界各地的赌场中都很流行的一种游戏,使用除大小王之外的52张牌,游戏者的目标是使手中的牌的点数之和不超过21点且尽量大。扑克点数的计算规则:2至9牌,按其原点数计算;K、Q、J和10牌都算作10点(一般记作T,即ten之意);A牌(Ace)既可算作1点也可算作11点,由玩家自己决定。

游戏规则(不同玩法略有不同,下面是简单两人玩法,一庄一闲):

    开局,庄闲各发两张牌,庄家一张明牌一张暗牌。如果庄闲其中一个牌面点数为21,则胜出。点数都为21,平局。否则继续。拿牌流程:闲可以选择拿牌(hit)、停牌(stick)。若选择拿牌,在发牌的过程中,如果玩家的牌点数的和超过21,玩家就输了—叫爆掉(Bust),庄家赢。如玩家没爆掉,又决定不再要牌了(停牌),则轮到庄家。庄家翻开暗牌,并持续拿牌直至点数不小于17(若有Ace,按最大而尽量不爆计算)。如果庄家爆掉了,玩家赢;否则那么比点数大小,大为赢。点数相同为平局。
问题抽象

21点问题可以看作一个有限的马尔可夫决策过程:

s s s : 状态(闲);Ace,玩家(闲)的牌面,庄家的明牌牌面。 a a a : 动作;要牌(hit:1),停牌(stick:0)。 r r r : 奖励;[-1,0, 1],输,赢,平。 γ = 1 gamma = 1 γ=1 问题求解

对于21点游戏,在没有完整的环境模型的情况下,不能使用DP方面来寻找最优策略。但是可以通过蒙特卡洛方法来求解最优策略,MC方法只需要从环境种收集足够的状态、动作、奖励序列数据就能对值函数进行估计。进而找到最优策略。

假设玩家的策略为 π p pi_p πp​ : 当牌面点数为20或21时,停牌,否则拿牌。采用MC方法计算策略 π p pi_p πp​的值函数 v π ( s ) v_{pi}(s) vπ​(s), 庄家的固定策略 π d pi_d πd​:当点数小于17时,一直拿牌,否则停牌。

模拟游戏过程

记录状态、动作、奖励。

import warnings
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
from collections import namedtuple
from tqdm.notebook import tqdm

warnings.filterwarnings('ignore')
# player policy
def player_policy(usable_ace_player, cards_num, dealer_card1):
    # 1:拿牌,0: 停牌
    if cards_num < 20:
        return 1
    else:
        return 0
# dealer policy
def dealer_policy(cards_num):
    if cards_num < 17:
        return 1
    else:
        return 0
def play_blackjack(policy_player, policy_dealer, initial_state=None, initial_action=None):
    '''
    policy_player : function input state return action
    policy_dealer : 
    return -> reward, trajectory
    '''
    def card_value(card):
        return 11 if card == 1 else card
    # 闲
    player_sum = 0
    # 庄
    dealer_card1 = 0
    dealer_card2 = 0
    # trajectory
    player_trajectory = []
    player_transition = namedtuple('Transition', ['state', 'action'])
    # False : Ace = 1, True Ace = 11
    usable_ace_player = False
    usable_ace_dealer = False
    if initial_state is None:
        while player_sum < 12:
            # 点数小于12,一直拿牌
            card = min(np.random.randint(1, 14), 10)
            #print(card)
            # 小于12,Ace = 11
            player_sum += card_value(card)
            # 点数超过21
            if player_sum > 21:
                # Ace = 1
                player_sum -= 10
            else:
                usable_ace_player |= (1 == card)
        # 初始化庄家牌,第一张为明牌
        dealer_card1 = min(np.random.randint(1, 14), 10)
        dealer_card2 = min(np.random.randint(1, 14), 10)
    else:
        # 指定初始状态
        usable_ace_player, player_sum, dealer_card1 = initial_state
        dealer_card2 = min(np.random.randint(1, 14), 10)
    
    dealer_sum = card_value(dealer_card1) + card_value(dealer_card2)
    usable_ace_dealer = 1 in (dealer_card1, dealer_card2)
    if dealer_sum > 21:
        # use Ace = 1
        dealer_sum -= 10
    # 闲先
    while True:
        if initial_action is not None:
            player_action = initial_action
            initial_action = None
        else:
            player_action = policy_player(usable_ace_player, player_sum, dealer_card1)
        # 状态,动作
        player_sa = player_transition((usable_ace_player, player_sum, dealer_card1), player_action)
        player_trajectory.append(player_sa)
        if player_action == 0:
            break
        # 拿牌,默认Ace = 11
        card = min(np.random.randint(1, 14), 10)
        #print(card)
        # Keep track of the ace count
        ace_count = int(usable_ace_player)
        if card == 1:
            ace_count += 1
        player_sum += card_value(card)
        # 避免bust ,Ace = 1
        while player_sum > 21 and ace_count:
            player_sum -= 10
            ace_count -= 1
        if player_sum > 21:
            return -1 , player_trajectory
        usable_ace_player = (ace_count == 1)
    # 庄
    while True:
        dealer_action = policy_dealer(dealer_sum)
        if dealer_action == 0:
            break
        # 拿牌,默认Ace = 11
        new_card = min(np.random.randint(1, 14), 10)
        #print(card)
        ace_count = int(usable_ace_dealer)
        if new_card == 1:
            ace_count += 1
        dealer_sum += card_value(new_card)
        # 避免bust,Ace = 1
        while dealer_sum > 21 and ace_count:
            dealer_sum -= 10
            ace_count -= 1
        if dealer_sum > 21:
            return 1 , player_trajectory
        usable_ace_dealer = (ace_count == 1)
    if player_sum > dealer_sum:
        return 1 , player_trajectory
    elif player_sum == dealer_sum:
        return 0 , player_trajectory
    else:
        return -1 , player_trajectory
player_reward, player_traj = play_blackjack(player_policy, dealer_policy)
# 结果
player_reward
-1
player_traj
[Transition(state=(False, 12, 10), action=1),
 Transition(state=(False, 15, 10), action=1)]
状态值函数估计 v π ( s ) v_{pi}(s) vπ​(s)

First-visit MC : 只考虑每一局(episode)种状态 s s s第一次出现的情况。Every-visit MC : 考虑每一局种 s s s多次重复出现的情况。两种方法的收敛效果是一致的。 First-visit MC

Input: a policy π pi π to be evaluatedInitialize: V ( s ) ∈ R V(s) in R V(s)∈R,arbitrarily,for all s ∈ S s in mathcal S s∈S, R e t u r n s ( s ) ← Returns(s) gets Returns(s)← an empty list, for all s ∈ S s in S s∈S。Loop forever (for each episode):

Generate an episode following π : S 0 , a 0 , R 1 , S 1 , a 1 , R 2 , . . . , S T − 1 , a T − 1 , R T pi : S_0, a_0, R_1, S_1, a_1, R_2,...,S_{T-1},a_{T-1},R_T π:S0​,a0​,R1​,S1​,a1​,R2​,...,ST−1​,aT−1​,RT​ G ← 0 G gets 0 G←0Loop for each step of episode, t = T − 1 , T − 2 , . . . , 0 t = T-1, T-2, ...,0 t=T−1,T−2,...,0:

G = γ G + R t + 1 G = gamma G + R_{t+1} G=γG+Rt+1​Unless S t S_t St​ appears in S 0 , S 1 , . . . S t − 1 S_0,S_1,...S_{t-1} S0​,S1​,...St−1​:

Append G G G to R e t u r n s ( S t ) Returns(S_t) Returns(St​) V ( S t ) ← a v e r a g e ( R e t r u n s ( S t ) ) V(S_t) gets average(Retruns(S_t)) V(St​)←average(Retruns(St​))

在 γ = 1 gamma = 1 γ=1的情况下,每个episode中状态的值,跟游戏结束时的reward是一样的。

def monte_carlo_state_value_estimate(episodes, gamma=1.0):
    # player policy
    states_usable_ace = np.zeros((10, 10))
    states_usable_ace_count = np.ones((10, 10))
    states_no_usable_ace = np.zeros((10, 10))
    states_no_usable_ace_count = np.ones((10, 10))
    for i in tqdm(range(0, episodes)):
        player_reward, player_traj = play_blackjack(player_policy, dealer_policy)
        player_states = [t.state for t in player_traj]
        player_actions = [t.action for t in player_traj]
        player_rewards = [0]*len(player_states)
        player_rewards[-1] = player_reward
        R = 0
        Gs = []
        for r in player_rewards[::-1]:
            R = r + gamma * R
            Gs.insert(0, R)
        for player_state, G in zip(player_states, Gs):
            usable_ace_player, player_sum, dealer_card = player_state
            player_sum -= 12
            dealer_card -= 1
            if usable_ace_player:
                states_usable_ace_count[player_sum, dealer_card] += 1
                states_usable_ace[player_sum, dealer_card] += G
            else:
                states_no_usable_ace_count[player_sum, dealer_card] += 1
                states_no_usable_ace[player_sum, dealer_card] += G
    return states_usable_ace / states_usable_ace_count, states_no_usable_ace / states_no_usable_ace_count
可视化值函数

Ace = 11 和 Ace = 1, 两种不同状态下值函数 v π ( s ) v_{pi}(s) vπ​(s)

在没有环境模型的情况下,只对状态的值进行估计是不够的,虽然知道了每个状态对应的期望回报,但是不知道如何过渡到这个状态(采取什么动作)。在之前MDP问题中,都有一个已知的环境模型(即: P ( s ′ , r ∣ s , π ( s ) ) P(s',r|s,pi(s)) P(s′,r∣s,π(s))),然后计算出每个动作和下一个状态对应的期望值,选择最佳的动作。因此,在环境模型未知的情况下,需要对动作值进行估计(或状态动作值,即: q π ( s , a ) q_{pi}(s,a) qπ​(s,a)),动作值函数直接表示策略。基于蒙特卡洛方法的动作值函数估计跟状态值的估计是一样的,都是通过计算平均的回报。

动作值函数估计 q ( s , a ) q(s,a) q(s,a)

状态值函数的估计只包含整个状态空间 s ∈ S s in S s∈S,动作值函数的估计参数空间为 S × A , a ∈ A S times A, a in A S×A,a∈A,如果状态和参数空间比较大,那么 ( s , a ) (s,a) (s,a)的组合空间会更大,为了能够准确的估计动作值,需要尽可能的多遍历整个的 s − a s-a s−a空间,在游戏过程中。无尽的游戏次数肯定可以满足但是这个并不实际,所以需要采取一些策略,在游戏时间上做一些平衡。

    随机开始,从随机状态开始游戏。
Monte Carlo ES(Exploring Starts)

Initialize:

π ( s ) ∈ A ( s ) pi(s) in mathcal A(s) π(s)∈A(s)(arbitrartily),for all s ∈ S . s in mathcal S. s∈S. Q ( s , a ) ∈ R Q(s,a) in mathcal R Q(s,a)∈R (arbitrartily),for all s ∈ S , a ∈ A ( s ) s in mathcal S, a in mathcal A(s) s∈S,a∈A(s). R e t u r n s ( s , a ) ← Returns(s,a) gets Returns(s,a)← empty list, for all$ s in mathcal S, a in mathcal A(s)$. Loop forever (for each episode):

Choose S 0 ∈ S , A 0 ∈ A ( S 0 ) S_0 in mathcal S, A_0 in mathcal A(S_0) S0​∈S,A0​∈A(S0​) randomly all pairs have probability > 0Generate an episode from S 0 , A 0 S_0,A_0 S0​,A0​ following π pi π: S 0 , A 0 , R 1 , . . . , S T − 1 , A T − 1 , R T S_0, A_0, R_1,...,S_{T-1},A_{T-1},R_T S0​,A0​,R1​,...,ST−1​,AT−1​,RT​ G ← 0 G gets 0 G←0Loop for each step of episode, t = T − 1 , T − 2 , . . . , 0 t = T-1, T-2, ..., 0 t=T−1,T−2,...,0:

G ← γ G + R t + 1 G gets gamma G + R_{t+1} G←γG+Rt+1​Unless the pair S t , A t S_t,A_t St​,At​ appears in S 0 , A 0 , S 1 , A 1 , . . . , S t − 1 , A t − 1 S_0,A_0,S_1,A_1,...,S_{t-1},A_{t-1} S0​,A0​,S1​,A1​,...,St−1​,At−1​:

Append G G G to R e t u r n s ( S t , A t ) Returns(S_t, A_t) Returns(St​,At​) Q ( S t , A t ) ← Q(S_t, A_t) gets Q(St​,At​)← average( R e t u r n s ( S t , A t ) Returns(S_t,A_t) Returns(St​,At​)) π ( S t ) ← a r g m a x a Q ( S t , a ) pi(S_t) gets argmax_{a} Q(S_t, a) π(St​)←argmaxa​Q(St​,a) update q ( s , a ) q(s,a) q(s,a),采用用下面的方式

q ( s , a ) ← q ( s , a ) × ( s , a ) c o u n t + q ( s , a ) n e w ( ( s , a ) c o u n t + 1 ) q(s,a) gets frac{q(s,a) times (s,a)_{count} + q(s,a)_{new}}{big((s,a)_{count} + 1big)} q(s,a)←((s,a)count​+1)q(s,a)×(s,a)count​+q(s,a)new​​

def monte_carlo_es(episodes, gamma=1.0):
    sa_history = []
    # Initialize
    state_action_values = np.zeros((10, 10, 2, 2))
    state_action_pair_count = np.ones((10, 10, 2, 2))
    # argmax_a(sa)
    def greedy_policy(usable_ace, player_sum, dealer_card):
        usable_ace = int(usable_ace)
        player_sum -= 12
        dealer_card -= 1
        values_ = state_action_values[player_sum, dealer_card, usable_ace, :]
        #return np.argmax(values_)
        # e: values_=[0, 0], random choice
        return np.random.choice([action_ for action_, value_ in enumerate(values_) if value_ == np.max(values_)])
        
    # Loop for each episode
    for episode in tqdm(range(episodes)):
        # Randomly Initialize
        initial_state = [bool(np.random.randint(0, 2)),np.random.randint(12, 22),np.random.randint(1, 11)]
        initial_action = np.random.randint(0, 2)
        # Generate an episode
        #current_policy = greedy_policy if episode else player_policy
        player_reward, player_traj = play_blackjack(greedy_policy, dealer_policy, initial_state, initial_action)
        player_states = [t.state for t in player_traj]
        player_actions = [t.action for t in player_traj]
        player_rewards = [0]*len(player_states)
        player_rewards[-1] = player_reward
        R = 0
        Gs = []
        for r in player_rewards[::-1]:
            R = r + gamma * R
            Gs.insert(0, R)
        for player_state,action, G in zip(player_states, player_actions, Gs):
            usable_ace_player, player_sum, dealer_card = player_state
            usable_ace = int(usable_ace_player)
            player_sum -= 12
            dealer_card -= 1
            # Update values of state-action
            old_val = state_action_values[player_sum, dealer_card, usable_ace, action]
            sa_count = state_action_pair_count[player_sum,dealer_card, usable_ace, action]
            new_val = (old_val * sa_count + G)/(sa_count + 1)
            state_action_values[player_sum, dealer_card, usable_ace, action] = new_val
            state_action_pair_count[player_sum, dealer_card, usable_ace, action] += 1
            sa_history.append(state_action_values.copy())
    return state_action_values, sa_history
可视化动作值

由于维度限制,所以用不用的颜色来表示动作值的差异。

usable_Aceno usable_Ace颜色越深,值越大


Usable Ace

    usable Ace

    Nousable Ace

最后:

q ( s , a ) q(s,a) q(s,a)矩阵就是策略的具体内容,最优策略 π ∗ ( s ) = a r g m a x a ( q ( s , a ) ) pi^{*}(s) = {argmax}_a(q(s,a)) π∗(s)=argmaxa​(q(s,a)).

动作值函数的收敛过程动画

转载请注明:文章转载自 www.mshxw.com
本文地址:https://www.mshxw.com/it/706762.html
我们一直用心在做
关于我们 文章归档 网站地图 联系我们

版权所有 (c)2021-2022 MSHXW.COM

ICP备案号:晋ICP备2021003244-6号