关键词 > Python代写

Artificial Intelligence in Games Assignment 2

发布时间：2022-01-11

Hello, dear friend, you can consult us at any time if you have any questions, add WeChat: daixieit

Assignment 2

Artiﬁcial Intelligence in Games

In this assignment, you will implement a variety of reinforcement learning algorithms to ﬁnd policies for the frozen lake environment. Please read this entire document before you start working on the assignment.

1 Environment

The frozen lake environment has two main variants: the small frozen lake (Fig. 1) and the big frozen lake (Fig. 2). In both cases, each tile in a square grid corresponds to a state. There is also an additional absorbing state, which will be introduced soon. There are four types of tiles: start (grey), frozen lake (light blue), hole (dark blue), and goal (white). The agent has four actions, which correspond to moving one tile up, left, down, or right. However, with probability 0.1, the environment ignores the desired direction and the agent slips (moves one tile in a random direction, which may be the desired direction). An action that would cause the agent to move outside the grid leaves the state unchanged.

Figure 1: Small frozen lake

Figure 2: Big frozen lake

The agent receives reward 1 upon taking an action at the goal. In every other case, the agent receives zero reward. Note that the agent does not receive a reward upon moving into the goal (nor a negative reward upon moving into a hole). Upon taking an action at the goal or in a hole, the agent moves into the absorbing state. Every action taken at the absorbing state leads to the absorbing state, which also does not provide rewards. Assume a discount factor of γ = 0.9.

For the purposes of model-free reinforcement learning (or interactive testing), the agent is able to interact with the frozen lake for a number of time steps that is equal to the number of tiles.

Your ﬁrst task is to implement the frozen lake environment. Using either Python or Java, try to mimic the Python interface presented in Listing 1.

Listing 1: Frozen lake environment.

import numpy a s np

import c o n t e x t l i b

# C o nfi g u r e s numpy p r i n t o p t i o n s

@ c o n t e x t l i b . contextmanager

def p r i n t o p t i o n s ( * a r g s , ** kwargs ) :

o r i g i n a l = np . g e t p r i n t o p t i o n s ( )

np . s e t p r i n t o p t i o n s ( * a r g s , ** kwargs )

try :

y i e l d

fi n a l l y :

np . s e t p r i n t o p t i o n s ( * * o r i g i n a l )

class EnvironmentModel :

def i n i t ( s e l f , n s t a t e s , n a c t i o n s , s e e d=None ) :

s e l f . n s t a t e s = n s t a t e s

s e l f . n a c t i o n s = n a c t i o n s

s e l f . r a n d o m s t a t e = np . random . RandomState ( s e e d )

def p ( s e l f , n e x t s t a t e , s t a t e , a c t i o n ) :

raise NotImplementedError ( )

def r ( s e l f , n e x t s t a t e , s t a t e , a c t i o n ) :

raise NotImplementedError ( )

def draw ( s e l f , s t a t e , a c t i o n ) :

p = [ s e l f . p ( ns , s t a t e , a c t i o n ) for ns in range ( s e l f . n s t a t e s ) ] n e x t s t a t e = s e l f . r a n d o m s t a t e . c h o i c e ( s e l f . n s t a t e s , p=p ) reward = s e l f . r ( n e x t s t a t e , s t a t e , a c t i o n )

return n e x t s t a t e , reward

class Environment ( EnvironmentModel ) :

def i n i t ( s e l f , n s t a t e s , n a c t i o n s , max steps , pi , s e e d=None ) :

EnvironmentModel . i n i t ( s e l f , n s t a t e s , n a c t i o n s , s e e d ) s e l f . m a x s t e p s = m a x s t e p s

s e l f . p i = p i

i f s e l f . p i i s None :

s e l f . p i = np . f u l l ( n s t a t e s , 1 . / n s t a t e s )

def r e s e t ( s e l f ) :

s e l f . n s t e p s = 0

s e l f . s t a t e = s e l f . r a n d o m s t a t e . c h o i c e ( s e l f . n s t a t e s , p= s e l f . p i ) return s e l f . s t a t e

def s t e p ( s e l f , a c t i o n ) :

i f a c t i o n < 0 or a c t i o n >= s e l f . n a c t i o n s :

raise E x c e p t i o n ( ’ I n v a l i d a c t i o n . ’ )

s e l f . n steps += 1

done = ( s e l f . n steps >= s e l f . max steps )

s e l f . state , reward = s e l f . draw ( s e l f . state , action )

return s e l f . state , reward , done

def render ( s e l f , policy=None , value=None ) :

r a i s e NotImplementedError ()

c l a s s FrozenLake ( Environment ) :

def i n i t ( s e l f , lake , s l i p , max steps , seed=None ) :

”””

l a k e : A m a t r i x t h a t r e p r e s e n t s t h e l a k e . For example :

l a k e = [[ ’ & ’ , ’ . ’ , ’ . ’ , ’ . ’] ,

[ ’ . ’ , ’# ’ , ’ . ’ , ’ # ’] ,

[ ’ . ’ , ’ . ’ , ’ . ’ , ’ # ’] ,

[ ’ # ’ , ’ . ’ , ’ . ’ , ’ s ’ ] ]

s l i p : The p r o b a b i l i t y t h a t t h e a g e n t w i l l s l i p

m a x s t e p s : The maximum number o f t i m e s t e p s i n an e p i s o d e

s e e d : A s e e d t o c o n t r o l t h e random number g e n e r a t o r ( o p t i o n a l ) ”””

# s t a r t (&) , fr o z e n ( . ) , h o l e (#) , g o a l ( s)

s e l f . lake = np . array ( lake )

s e l f . l a k e f l a t = s e l f . lake . reshape ( - 1)

s e l f . s l i p = s l i p

n s t a t e s = s e l f . lake . s i z e + 1

n actions = 4

pi = np . zeros ( n states , dtype=fl o a t )

pi [ np . where ( s e l f . l a k e f l a t == ’&’ ) [ 0 ] ] = 1.0

s e l f . absorbing state = n s t a t e s - 1

# TODO:

def step ( s e l f , action ) :

state , reward , done = Environment . step ( s e l f , action )

done = ( st ate == s e l f . absorbing state ) or done

return state , reward , done

def p( s e l f , next state , state , action ) :

# TODO:

def r ( s e l f , next state , state , action ) :

# TODO:

def render ( s e l f , policy=None , value=None ) :

i f policy i s None :

lake = np . array ( s e l f . l a k e f l a t )

i f s e l f . stat e < s e l f . absorbing state :

lake [ s e l f . state ] = ’@’

print ( lake . reshape ( s e l f . lake . shape ))

e l s e :

# UTF_8 arrows l o o k n i c e r , b u t cannot be used i n LaTeX # h t t p s : //www. w 3 s c h o o l s . com/ c h a r s e t s / r e f u t f a r r o w s . asp a c t i o n s = [ ’ ˆ ’ , ’<’ , ’ ’ , ’> ’ ]

print ( ’ Lake : ’ )

print ( s e l f . l a k e )

print ( ’ P o l i c y : ’ )

p o l i c y = np . a r r a y ( [ a c t i o n s [ a ] for a in p o l i c y [ : - 1 ] ] )

print ( p o l i c y . r e s h a p e ( s e l f . l a k e . shape ) )

print ( ’ Value : ’ )

with p r i n t o p t i o n s ( p r e c i s i o n =3 , s u p p r e s s=True ) :

print ( v a l u e [ : - 1 ] . r e s h a p e ( s e l f . l a k e . shape ) )

def p l a y ( env ) :

a c t i o n s = [ ’w ’ , ’ a ’ , ’ s ’ , ’ d ’ ]

s t a t e = env . r e s e t ( )

env . r e n d e r ( )

done = F a l s e

while not done :

c = input ( ’ InMove : ’ )

i f c not in a c t i o n s :

raise E x c e p t i o n ( ’ I n v a l i d a c t i o n ’ )

s t a t e , r , done = env . s t e p ( a c t i o n s . i n d e x ( c ) )

env . r e n d e r ( )

print ( ’ Reward : s 0 ( . ’ . format ( r ) )

The class EnvironmentModel represents a model of an environment. The constructor of this class receives a number of states, a number of actions, and a seed that controls the pseudorandom number generator. Its subclasses must implement two methods: p and r. The method p returns the probability of transitioning from state to next state given action. The method r returns the expected reward in having transitioned from state to next state given action. The method draw receives a pair of state and action and returns a state drawn according to p together with the corresponding expected reward. Note that states and actions are represented by integers starting at zero. We highly recommend that you follow the same convention, since this will facilitate immensely the implementation of reinforcement learning algorithms. You can use a Python dictionary (or equivalent data structure) to map (from and to) integers to a more convenient representation when necessary. Note that, in general, agents may receive rewards drawn probabilistically by an environment, which is not supported in this simpliﬁed implementation.

The class Environment represents an interactive environment and inherits from EnvironmentModel. The constructor of this class receives a number of states, a number of actions, a maximum number of steps for interaction, a probability distribution over initial states, and a seed that controls the pseudorandom number generator. Its subclasses must implement two methods: p and r, which were already explained above. This class has two new methods: reset and step. The method reset restarts the interaction between the agent and the environment by setting the number of time steps to zero and drawing a state according to the probability distribution over initial states. This state is stored by the class. The method step receives an action and returns a next state drawn according to p, the corresponding expected reward, and a ﬂag variable. The new state is stored by the class. This method also keeps track of how many steps have been taken. Once the number of steps matches or exceeds the pre-deﬁned maximum number of steps, the ﬂag variable indicates that the interaction should end.

The class FrozenLake represents the frozen lake environment. Your task is to implement the methods p and r for this class. The constructor of this class receives a matrix that represents a lake, a probability that the agent will slip at any given time step, a maximum number of steps for interaction, and a seed that controls the pseudorandom number generator. This class overrides the method step to indicate that the interaction should also end when the absorbing state is reached. The method render is capable of rendering the state of the environment or a pair of policy and value function. Your implementation must similarly be capable of receiving any matrix that represents a lake.

The function play can be used to interactively test your implementation of the environment.

Important: Implementing the frozen lake environment is deceptively simple. The ﬁle p.npy contains a numpy.array with the probability to be returned by the method p for each combination of next state, state, and action in the small frozen lake. You should load this ﬁle using numpy.load to check your implementation. The tiles are numbered in row-major order, with the absorbing state coming last. The actions are numbered in the following order: up, left, down, and right.

2 Tabular model-based reinforcement learning

Your next task is to implement policy evaluation, policy improvement, policy iteration, and value iteration. You may follow the interface suggested in Listing 2.

Listing 2: Tabular model-based algorithms.

def p o l i c y e v a l u a t i o n ( env , p o l i c y , gamma , theta , m a x i t e r a t i o n s ) : v a l u e = np . z e r o s ( env . n s t a t e s , dtype=np . fl o a t )

# TODO:

return v a l u e

def p o l i c y i m p r o v e m e n t ( env , value , gamma ) :

p o l i c y = np . z e r o s ( env . n s t a t e s , dtype=i n t )

# TODO:

return p o l i c y

def p o l i c y i t e r a t i o n ( env , gamma , theta , m a x i t e r a t i o n s , p o l i c y=None ) :

i f p o l i c y i s None :

p o l i c y = np . z e r o s ( env . n s t a t e s , dtype=i n t )

e l s e :

p o l i c y = np . a r r a y ( p o l i c y , dtype=i n t )

# TODO:

return p o l i c y , v a l u e

def v a l u e i t e r a t i o n ( env , gamma , theta , m a x i t e r a t i o n s , v a l u e=None ) :

i f v a l u e i s None :

v a l u e = np . z e r o s ( env . n s t a t e s )

e l s e :

v a l u e = np . a r r a y ( value , dtype=np . fl o a t )

# TODO:

return p o l i c y , v a l u e

The function policy evaluation receives an environment model, a deterministic policy, a discount factor, a tolerance parameter, and a maximum number of iterations. A deterministic policy may be represented by an array that contains the action prescribed for each state.

The function policy improvement receives an environment model, the value function for a policy to be improved, and a discount factor.

The function policy iteration receives an environment model, a discount factor, a tolerance parameter, a maximum number of iterations, and (optionally) the initial policy.

The function value iteration receives an environment model, a discount factor, a tolerance parameter, a maximum number of iterations, and (optionally) the initial value function.