smartstart.algorithms package¶
Submodules¶
smartstart.algorithms.counter module¶
Counter module
Describes Counter base class for TD-learning algorithms
-
class
Counter
(env)¶ Bases:
object
Base class for visitation counts.
Base class for keeping track of obs-action-obs_tp1 visitation counts in discrete state and discrete action reinforcement learning algorithms.
Parameters: env ( Environment
) – environment-
env
¶ Environment
– environment
-
count_map
¶ collections.defaultdict
(nested) – visitation counts for each obs-action-obs_tp1
-
total
¶ int
– total number of visitation counts
-
get_count
(obs, action=None, obs_tp1=None)¶ Returns visitation count
Visitation count for obs, obs-action or obs-action-obs_tp1 is returned. User can leave action and/or obs_tp1 empty to return a higher level count.
Note
When action is None and obs_tp1 is not None, the count for just obs will be returned and obs_tp1 will not be taken into account.
Parameters: - obs (
list
ofint
ornp.ndarray
) – observation - action (
int
) – action (Default value = None) - obs_tp1 (
list
ofint
ornp.ndarray
) – next observation (Default value = None)
Returns: Visitation count
Return type: int
- obs (
-
get_count_map
()¶ Returns state count map for environment.
Count-map will be a numpy array equal to the width (w) and height (h) of the environment. Each entry (state) will hold the count associated with that state.
Note
Only works for 2D-environments that have a w and h attribute.
Returns: Count map Return type: np.ndarray
-
get_density
(obs, action=None, obs_tp1=None)¶ Density for obs, obs-action or obs-action-obs_tp1 is returned.
Density is calculated by dividing the count with the total count.
Parameters: - obs (
list
ofint
ornp.ndarray
) – observation - action (
int
) – action (Default value = None) - obs_tp1 (
list
ofint
ornp.ndarray
) – next observation (Default value = None)
Returns: Density
Return type: float
- obs (
-
get_density_map
()¶ Returns state density map for environment
Density-map will be a numpy array equal to the width (w) and height (h) of the environment. Each entry (state) will hold the density associated with that state.
Note
Only works for 2D-environments that have a w and h attribute.
Returns: Density map Return type: np.ndarray
-
increment
(obs, action, obs_tp1)¶ Increment count for obs-action-obs_tp1 transition.
Parameters: - obs (
list
ofint
ornp.ndarray
) – current observation - action (
int
) – current action - obs_tp1 (
list
ofint
ornp.ndarray
) – next observation
- obs (
-
smartstart.algorithms.qlearning module¶
Q-Learning module
Module defining classes for Q-Learning and Q(lambda).
See ‘Reinforcement Learning: An Introduction by Richard S. Sutton and Andrew G. Barto for more information.
-
class
QLearning
(env, *args, **kwargs)¶ Bases:
smartstart.algorithms.tdlearning.TDLearning
Parameters: - env (
TDLearning
) – environment - *args – see parent
TDLearning
- **kwargs – see parent
TDLearning
-
get_next_q_action
(obs_tp1, done)¶ Off-policy action selection
Parameters: - obs_tp1 (
list
ofint
ornp.ndarray
) – Next observation - done (
bool
) – Boolean is True for terminal state
Returns: float
– Q-value for obs_tp1int
– action_tp1
- obs_tp1 (
- env (
-
class
QLearningLambda
(env, *args, **kwargs)¶ Bases:
smartstart.algorithms.tdlearning.TDLearningLambda
Note
Does not work properly, because q-learning is off-policy standard eligibility traces might fail.
Parameters: - env (
TDLearning
) – environment - *args – see parent
TDLearningLambda
- **kwargs – see parent
TDLearningLambda
-
get_next_q_action
(obs_tp1, done)¶ Off-policy action selection
Parameters: - obs_tp1 (
list
ofint
ornp.ndarray
) – Next observation - done (
bool
) – Boolean is True for terminal state
Returns: float
– Q-value for obs_tp1int
– action_tp1
- obs_tp1 (
- env (
smartstart.algorithms.sarsa module¶
SARSA module
Module defining classes for SARSA and SARSA(lambda).
See ‘Reinforcement Learning: An Introduction by Richard S. Sutton and Andrew G. Barto for more information.
-
class
SARSA
(env, *args, **kwargs)¶ Bases:
smartstart.algorithms.tdlearning.TDLearning
Parameters: - env (
TDLearning
) – environment - *args – see parent
TDLearning
- **kwargs – see parent
TDLearning
-
get_next_q_action
(obs_tp1, done)¶ On-policy action selection
Parameters: - obs_tp1 (
list
ofint
ornp.ndarray
) – Next observation - done (
bool
) – Boolean is True for terminal state
Returns: float
– Q-value for obs_tp1int
– action_tp1
- obs_tp1 (
- env (
-
class
SARSALambda
(env, *args, **kwargs)¶ Bases:
smartstart.algorithms.tdlearning.TDLearningLambda
Parameters: - env (
TDLearning
) – environment - *args – see parent
TDLearningLambda
- **kwargs – see parent
TDLearningLambda
-
get_next_q_action
(obs_tp1, done)¶ On-policy action selection
Parameters: - obs_tp1 (
list
ofint
ornp.ndarray
) – Next observation - done (
bool
) – Boolean is True for terminal state
Returns: float
– Q-value for obs_tp1int
– action_tp1
- obs_tp1 (
- env (
smartstart.algorithms.tdlearning module¶
Temporal-Difference module
Describes TDLearning and TDLearningLambda base classes for temporal difference learning without and with eligibility traces.
See ‘Reinforcement Learning: An Introduction by Richard S. Sutton and Andrew G. Barto for more information.
-
class
TDLearning
(env, num_episodes=1000, max_steps=1000, alpha=0.1, gamma=0.99, init_q_value=0.0, exploration='E-Greedy', epsilon=0.1, temp=0.5, beta=1.0)¶ Bases:
smartstart.algorithms.counter.Counter
Base class for temporal-difference methods
Base class for temporal difference methods Q-Learning, SARSA, SARSA(lambda) and Q(lambda). Implements all common methods, specific methods to each algorithm have to be implemented in the child class.
All the exploration methods are defined in this class and can be added by adding a new method that describes the exploration strategy. The exploration strategy must be added to the class attribute below and inserted in the get_action method.
Currently 5 exploration methods are implemented; no exploration, epsilon-greedy, boltzmann, count-based and ucb. Another exploration method is optimism in the face of uncertainty which can be used by setting the init_q_value > 0.
Parameters: - env (
Environment
) – environment - num_episodes (
int
) – number of episodes - max_steps (
int
) – maximum number of steps per episode - alpha (
float
) – learning step-size - gamma (
float
) – discount factor - init_q_value (
float
) – initial q-value - exploration (
str
) – exploration strategy, see class attributes for available options - epsilon (
float
orScheduler
) – epsilon-greedy parameter - temp (
float
) – temperature parameter for Boltzmann exploration - beta (
float
) – count-based exploration parameter
-
num_episodes
¶ int
– number of episodes
-
max_steps
¶ int
– maximum number of steps per episode
-
alpha
¶ float
– learning step-size
-
gamma
¶ float
– discount factor
-
init_q_value
¶ float
– initial q-value
-
Q
¶ np.ndarray
– Numpy ndarray holding the q-values for all state-action pairs
-
exploration
¶ str
– exploration strategy, see class attributes for available options
-
epsilon
¶ float
orScheduler
– epsilon-greedy parameter
-
temp
¶ float
– temperature parameter for Boltzmann exploration
-
beta
¶ float
– count-based exploration parameter
-
BOLTZMANN
= 'Boltzmann'¶
-
COUNT_BASED
= 'Count-Based'¶
-
E_GREEDY
= 'E-Greedy'¶
-
NONE
= 'None'¶
-
UCB
= 'UCB'¶
-
get_action
(obs)¶ Returns action for obs
Return policy based on exploration strategy of the TDLearning object.
When an exploration method is added make sure the method is added in the class attributes and below for ease of usage.
Parameters: obs ( list
ofint
or np.ndarray) – observationReturns: next action Return type: int
Raises: NotImplementedError
– Please choose from the available exploration methods, see class attributes.
-
get_next_q_action
(obs_tp1, done)¶ Returns next Q-Value and next action
Note
Has to be implemented in child class.
Parameters: - obs_tp1 (
list
ofint
ornp.ndarray
) – next observation - done (bool) – True when obs_tp1 is terminal
Raises: NotImplementedError
– use a subclass of TDLearning like QLearning or SARSA.- obs_tp1 (
-
get_q_map
()¶ Returns value map for environment
Value-map will be a numpy array equal to the width and height of the environment. Each entry (state) will hold the maximum action-value function associated with that state.
Returns: value map Return type: np.ndarray
-
get_q_value
(obs, action)¶ Returns Q-value for obs-action pair
Parameters: - obs (
list
ofint
ornp.ndarray
) – observation - action (int) – action
Returns: Q-Value
Return type: float
- obs (
-
get_q_values
(obs)¶ Returns Q-values and actions for observation obs
Parameters: obs ( list
ofint
ornp.ndarray
) – observationReturns: list
offloat
– Q-valueslist
ofint
– actions associated with each q-value in q_values
-
reset
()¶ Resets Q-function
The Q-function is set to the initial q-value for very state-action pair.
-
take_step
(obs, action, episode, render=False)¶ Takes a step and updates
Action action is executed and response is observed. Response is then used to update the value function. Data is stored in Episode object.
Parameters: - obs (
list
ofint
ornp.ndarray
) – observation - action (
int
) – action - episode (
Episode
) – Data container for all the episode data - render (
bool
) – True when rendering every time-step (Default value = False)
Returns: list
ofint
or np.ndarray – next observationint
– next actionbool
– done, True when obs_tp1 is terminal statebool
– render, True when rendering must continue
- obs (
-
train
(render=False, render_episode=False, print_results=True)¶ Runs a training experiment
Training experiment runs for self.num_episodes and each episode takes a maximum of self.max_steps.
Parameters: - render (
bool
) – True when rendering every time-step (Default value = False) - render_episode (
bool
) – True when rendering every episode (Default value = False) - print_results (
bool
) – True when printing results to console (Default value = True)
Returns: Summary Object containing the training data
Return type: Summary
- render (
-
update_q_value
(obs, action, reward, obs_tp1, done)¶ Update Q-value for obs-action pair
Updates Q-value according to the Bellman equation.
Parameters: - obs (
list
ofint
ornp.ndarray
) – observation - action (
int
) – action - reward (
float
) – reward - obs_tp1 (
list
ofint
ornp.ndarray
) – next observation - done (
bool
) – True when obs_tp1 is terminal
Returns: updated Q-value and next action
Return type: float
- obs (
- env (
-
class
TDLearningLambda
(env, lamb=0.75, threshold_traces=0.001, *args, **kwargs)¶ Bases:
smartstart.algorithms.tdlearning.TDLearning
Base class for temporal difference methods using eligibility traces
Child class of TDLearning, update_q_value and train method are modified for using eligibility traces.
Parameters: - env (
Environment
) – environment - lamb (
float
) – eligibility traces decay parameter - threshold_traces (
float
) – threshold for activation of trace - *args – see parent class TDLearning
- **kwargs – see parent class TDLearning
-
lamb
¶ float
– eligibility traces decay parameter
-
threshold_traces
¶ float
– threshold for activation of trace
-
traces
¶ np.ndarray
– numpy ndarray holding the traces for each state-action pair
-
get_next_q_action
(obs_tp1, done)¶ Returns next Q-Value and action
Note
Has to be implemented in child class.
Parameters: - obs_tp1 (
list
ofint
or np.ndarray) – next observation - done (
bool
) – True when obs_tp1 is terminal
Raises: - Use a subclass of TDLearning like
- QLearningLambda or SARSALambda.
- obs_tp1 (
-
train
(render=False, render_episode=False, print_results=True)¶ Runs a training experiment
Training experiment runs for self.num_episodes and each episode takes a maximum of self.max_steps.
Parameters: - render (
bool
) – True when rendering every time-step (Default value = False) - render_episode (
bool
) – True when rendering every episode (Default value = False) - print_results (
bool
) – True when printing results to console (Default value = True)
Returns: Summary Object containing the training data
Return type: Summary
- render (
-
update_q_value
(obs, action, reward, obs_tp1, done)¶ Update Q-value for obs-action pair
Updates Q-value according to the Bellman equation with eligibility traces included.
Parameters: - obs (
list
ofint
or np.ndarray) – observation - action (
int
) – action - reward (
float
) – reward - obs_tp1 (
list
ofint
or np.ndarray) – next observation - done (
bool
) – True when obs_tp1 is terminal
Returns: updated Q-value and next action
Return type: float
- obs (
- env (
smartstart.algorithms.valueiteration module¶
Value Iteration module
Describes ValueIteration class.
See ‘Reinforcement Learning: An Introduction by Richard S. Sutton and Andrew G. Barto for more information.
-
class
ValueIteration
(env, gamma=0.99, min_error=1e-05, max_itr=1000)¶ Bases:
object
Value Iteration method
Value iteration is a dynamic programming method. Requires full knowledge of the environment, i.e. transition model and reward function.
Note
This implementation only works with one goal (terminal) state
Parameters: - env (
Environment
) – environment - gamma (
float
) – discount factor - min_error (
float
) – minimum error for convergence of value iteration - max_itr (
int
) – maximum number of iteration of value iteration
-
env
¶ Environment
– environment
-
gamma
¶ float
– discount factor
-
min_error
¶ float
– minimum error for convergence of value iteration
-
max_itr
¶ int
– maximum number of iteration of value iteration
-
V
¶ collections.defaultdict
– value function
-
T
¶ collections.defaultdict
– transition model
-
R
¶ collections.defaultdict
– reward function
-
obses
¶ set
– visited states
-
goal
¶ tuple
– goal state (terminal state)
-
add_obs
(obs)¶ Adds observation to the obses set
Parameters: obs ( list
ofint
ornp.ndarray
) – observation
-
get_action
(obs)¶ Returns the policy for a certain observation
Chooses the action that has the highest value function. When multiple actions have the same value function a random action is chosen from them.
Parameters: obs ( list
ofint
ornp.ndarray
) – observationReturns: action Return type: int
-
get_transition
(obs, action, obs_tp1=None)¶ Returns transition probability of obs-action-obs_tp1
Parameters: - obs (
list
ofint
ornp.ndarray
) – observation - action (
int
) – action - obs_tp1 (
list
ofint
ornp.ndarray
) – next observation (Default value = None)
Returns: Transition probability
Return type: float
- obs (
-
get_value_map
()¶ Returns value map for environment
Value-map will be a numpy array equal to the width (w) and height (h) of the environment. Each entry (state) will hold the value function associated with that state.
Returns: value map Return type: numpy.ndarray
-
optimize
()¶ Run value iteration method
Runs value iteration until converged or maximum number of iterations is reached. Method iterates through all visited states (self.obses).
-
reset
()¶ Reset internal state
The following attributes are cleared: - self.V: Value function - self.T: Transition model - self.R: Reward function - self.obses: Visited observation - self.goal: Goal state
-
set_goal
(obs)¶ Set goal state
Goal state is added to the obses list and value function for the goal state is set to zero.
Parameters: obs ( list
ofint
ornp.ndarray
) – observation
-
set_reward
(obs, action, obs_tp1, value)¶ Set reward for obs-action-obs_tp1
Parameters: - obs (
list
ofint
ornp.ndarray
) – observation - action (
int
) – action - obs_tp1 (
list
ofint
ornp.ndarray
) – next observation - value (
float
) – reward
- obs (
-
set_transition
(obs, action, obs_tp1, value)¶ Set transition of obs-action-obs_tp1
Note
All transitions for a obs-action pair should add up to 1. This is not checked!
Parameters: - obs (
list
ofint
ornp.ndarray
) – observation - action (
int
) – action - obs_tp1 (
list
ofint
ornp.ndarray
) – next observation - value (
float
) – transition probability
- obs (
- env (