smartstart.algorithms package

Submodules

smartstart.algorithms.counter module

Counter module

Describes Counter base class for TD-learning algorithms

class Counter(env)

Bases: object

Base class for visitation counts.

Base class for keeping track of obs-action-obs_tp1 visitation counts in discrete state and discrete action reinforcement learning algorithms.

Parameters:env (Environment) – environment
env

Environment – environment

count_map

collections.defaultdict (nested) – visitation counts for each obs-action-obs_tp1

total

int – total number of visitation counts

get_count(obs, action=None, obs_tp1=None)

Returns visitation count

Visitation count for obs, obs-action or obs-action-obs_tp1 is returned. User can leave action and/or obs_tp1 empty to return a higher level count.

Note

When action is None and obs_tp1 is not None, the count for just obs will be returned and obs_tp1 will not be taken into account.

Parameters:
  • obs (list of int or np.ndarray) – observation
  • action (int) – action (Default value = None)
  • obs_tp1 (list of int or np.ndarray) – next observation (Default value = None)
Returns:

Visitation count

Return type:

int

get_count_map()

Returns state count map for environment.

Count-map will be a numpy array equal to the width (w) and height (h) of the environment. Each entry (state) will hold the count associated with that state.

Note

Only works for 2D-environments that have a w and h attribute.

Returns:Count map
Return type:np.ndarray
get_density(obs, action=None, obs_tp1=None)

Density for obs, obs-action or obs-action-obs_tp1 is returned.

Density is calculated by dividing the count with the total count.

Parameters:
  • obs (list of int or np.ndarray) – observation
  • action (int) – action (Default value = None)
  • obs_tp1 (list of int or np.ndarray) – next observation (Default value = None)
Returns:

Density

Return type:

float

get_density_map()

Returns state density map for environment

Density-map will be a numpy array equal to the width (w) and height (h) of the environment. Each entry (state) will hold the density associated with that state.

Note

Only works for 2D-environments that have a w and h attribute.

Returns:Density map
Return type:np.ndarray
increment(obs, action, obs_tp1)

Increment count for obs-action-obs_tp1 transition.

Parameters:
  • obs (list of int or np.ndarray) – current observation
  • action (int) – current action
  • obs_tp1 (list of int or np.ndarray) – next observation

smartstart.algorithms.qlearning module

Q-Learning module

Module defining classes for Q-Learning and Q(lambda).

See ‘Reinforcement Learning: An Introduction by Richard S. Sutton and Andrew G. Barto for more information.

class QLearning(env, *args, **kwargs)

Bases: smartstart.algorithms.tdlearning.TDLearning

Parameters:
get_next_q_action(obs_tp1, done)

Off-policy action selection

Parameters:
  • obs_tp1 (list of int or np.ndarray) – Next observation
  • done (bool) – Boolean is True for terminal state
Returns:

  • float – Q-value for obs_tp1
  • int – action_tp1

class QLearningLambda(env, *args, **kwargs)

Bases: smartstart.algorithms.tdlearning.TDLearningLambda

Note

Does not work properly, because q-learning is off-policy standard eligibility traces might fail.

Parameters:
  • env (TDLearning) – environment
  • *args – see parent TDLearningLambda
  • **kwargs – see parent TDLearningLambda
get_next_q_action(obs_tp1, done)

Off-policy action selection

Parameters:
  • obs_tp1 (list of int or np.ndarray) – Next observation
  • done (bool) – Boolean is True for terminal state
Returns:

  • float – Q-value for obs_tp1
  • int – action_tp1

smartstart.algorithms.sarsa module

SARSA module

Module defining classes for SARSA and SARSA(lambda).

See ‘Reinforcement Learning: An Introduction by Richard S. Sutton and Andrew G. Barto for more information.

class SARSA(env, *args, **kwargs)

Bases: smartstart.algorithms.tdlearning.TDLearning

Parameters:
get_next_q_action(obs_tp1, done)

On-policy action selection

Parameters:
  • obs_tp1 (list of int or np.ndarray) – Next observation
  • done (bool) – Boolean is True for terminal state
Returns:

  • float – Q-value for obs_tp1
  • int – action_tp1

class SARSALambda(env, *args, **kwargs)

Bases: smartstart.algorithms.tdlearning.TDLearningLambda

Parameters:
  • env (TDLearning) – environment
  • *args – see parent TDLearningLambda
  • **kwargs – see parent TDLearningLambda
get_next_q_action(obs_tp1, done)

On-policy action selection

Parameters:
  • obs_tp1 (list of int or np.ndarray) – Next observation
  • done (bool) – Boolean is True for terminal state
Returns:

  • float – Q-value for obs_tp1
  • int – action_tp1

smartstart.algorithms.tdlearning module

Temporal-Difference module

Describes TDLearning and TDLearningLambda base classes for temporal difference learning without and with eligibility traces.

See ‘Reinforcement Learning: An Introduction by Richard S. Sutton and Andrew G. Barto for more information.

class TDLearning(env, num_episodes=1000, max_steps=1000, alpha=0.1, gamma=0.99, init_q_value=0.0, exploration='E-Greedy', epsilon=0.1, temp=0.5, beta=1.0)

Bases: smartstart.algorithms.counter.Counter

Base class for temporal-difference methods

Base class for temporal difference methods Q-Learning, SARSA, SARSA(lambda) and Q(lambda). Implements all common methods, specific methods to each algorithm have to be implemented in the child class.

All the exploration methods are defined in this class and can be added by adding a new method that describes the exploration strategy. The exploration strategy must be added to the class attribute below and inserted in the get_action method.

Currently 5 exploration methods are implemented; no exploration, epsilon-greedy, boltzmann, count-based and ucb. Another exploration method is optimism in the face of uncertainty which can be used by setting the init_q_value > 0.

Parameters:
  • env (Environment) – environment
  • num_episodes (int) – number of episodes
  • max_steps (int) – maximum number of steps per episode
  • alpha (float) – learning step-size
  • gamma (float) – discount factor
  • init_q_value (float) – initial q-value
  • exploration (str) – exploration strategy, see class attributes for available options
  • epsilon (float or Scheduler) – epsilon-greedy parameter
  • temp (float) – temperature parameter for Boltzmann exploration
  • beta (float) – count-based exploration parameter
num_episodes

int – number of episodes

max_steps

int – maximum number of steps per episode

alpha

float – learning step-size

gamma

float – discount factor

init_q_value

float – initial q-value

Q

np.ndarray – Numpy ndarray holding the q-values for all state-action pairs

exploration

str – exploration strategy, see class attributes for available options

epsilon

float or Scheduler – epsilon-greedy parameter

temp

float – temperature parameter for Boltzmann exploration

beta

float – count-based exploration parameter

BOLTZMANN = 'Boltzmann'
COUNT_BASED = 'Count-Based'
E_GREEDY = 'E-Greedy'
NONE = 'None'
UCB = 'UCB'
get_action(obs)

Returns action for obs

Return policy based on exploration strategy of the TDLearning object.

When an exploration method is added make sure the method is added in the class attributes and below for ease of usage.

Parameters:obs (list of int or np.ndarray) – observation
Returns:next action
Return type:int
Raises:NotImplementedError – Please choose from the available exploration methods, see class attributes.
get_next_q_action(obs_tp1, done)

Returns next Q-Value and next action

Note

Has to be implemented in child class.

Parameters:
  • obs_tp1 (list of int or np.ndarray) – next observation
  • done (bool) – True when obs_tp1 is terminal
Raises:

NotImplementedError – use a subclass of TDLearning like QLearning or SARSA.

get_q_map()

Returns value map for environment

Value-map will be a numpy array equal to the width and height of the environment. Each entry (state) will hold the maximum action-value function associated with that state.

Returns:value map
Return type:np.ndarray
get_q_value(obs, action)

Returns Q-value for obs-action pair

Parameters:
  • obs (list of int or np.ndarray) – observation
  • action (int) – action
Returns:

Q-Value

Return type:

float

get_q_values(obs)

Returns Q-values and actions for observation obs

Parameters:obs (list of int or np.ndarray) – observation
Returns:
  • list of float – Q-values
  • list of int – actions associated with each q-value in q_values
reset()

Resets Q-function

The Q-function is set to the initial q-value for very state-action pair.

take_step(obs, action, episode, render=False)

Takes a step and updates

Action action is executed and response is observed. Response is then used to update the value function. Data is stored in Episode object.

Parameters:
  • obs (list of int or np.ndarray) – observation
  • action (int) – action
  • episode (Episode) – Data container for all the episode data
  • render (bool) – True when rendering every time-step (Default value = False)
Returns:

  • list of int or np.ndarray – next observation
  • int – next action
  • bool – done, True when obs_tp1 is terminal state
  • bool – render, True when rendering must continue

train(render=False, render_episode=False, print_results=True)

Runs a training experiment

Training experiment runs for self.num_episodes and each episode takes a maximum of self.max_steps.

Parameters:
  • render (bool) – True when rendering every time-step (Default value = False)
  • render_episode (bool) – True when rendering every episode (Default value = False)
  • print_results (bool) – True when printing results to console (Default value = True)
Returns:

Summary Object containing the training data

Return type:

Summary

update_q_value(obs, action, reward, obs_tp1, done)

Update Q-value for obs-action pair

Updates Q-value according to the Bellman equation.

Parameters:
  • obs (list of int or np.ndarray) – observation
  • action (int) – action
  • reward (float) – reward
  • obs_tp1 (list of int or np.ndarray) – next observation
  • done (bool) – True when obs_tp1 is terminal
Returns:

updated Q-value and next action

Return type:

float

class TDLearningLambda(env, lamb=0.75, threshold_traces=0.001, *args, **kwargs)

Bases: smartstart.algorithms.tdlearning.TDLearning

Base class for temporal difference methods using eligibility traces

Child class of TDLearning, update_q_value and train method are modified for using eligibility traces.

Parameters:
  • env (Environment) – environment
  • lamb (float) – eligibility traces decay parameter
  • threshold_traces (float) – threshold for activation of trace
  • *args – see parent class TDLearning
  • **kwargs – see parent class TDLearning
lamb

float – eligibility traces decay parameter

threshold_traces

float – threshold for activation of trace

traces

np.ndarray – numpy ndarray holding the traces for each state-action pair

get_next_q_action(obs_tp1, done)

Returns next Q-Value and action

Note

Has to be implemented in child class.

Parameters:
  • obs_tp1 (list of int or np.ndarray) – next observation
  • done (bool) – True when obs_tp1 is terminal
Raises:
  • Use a subclass of TDLearning like
  • QLearningLambda or SARSALambda.
train(render=False, render_episode=False, print_results=True)

Runs a training experiment

Training experiment runs for self.num_episodes and each episode takes a maximum of self.max_steps.

Parameters:
  • render (bool) – True when rendering every time-step (Default value = False)
  • render_episode (bool) – True when rendering every episode (Default value = False)
  • print_results (bool) – True when printing results to console (Default value = True)
Returns:

Summary Object containing the training data

Return type:

Summary

update_q_value(obs, action, reward, obs_tp1, done)

Update Q-value for obs-action pair

Updates Q-value according to the Bellman equation with eligibility traces included.

Parameters:
  • obs (list of int or np.ndarray) – observation
  • action (int) – action
  • reward (float) – reward
  • obs_tp1 (list of int or np.ndarray) – next observation
  • done (bool) – True when obs_tp1 is terminal
Returns:

updated Q-value and next action

Return type:

float

smartstart.algorithms.valueiteration module

Value Iteration module

Describes ValueIteration class.

See ‘Reinforcement Learning: An Introduction by Richard S. Sutton and Andrew G. Barto for more information.

class ValueIteration(env, gamma=0.99, min_error=1e-05, max_itr=1000)

Bases: object

Value Iteration method

Value iteration is a dynamic programming method. Requires full knowledge of the environment, i.e. transition model and reward function.

Note

This implementation only works with one goal (terminal) state

Parameters:
  • env (Environment) – environment
  • gamma (float) – discount factor
  • min_error (float) – minimum error for convergence of value iteration
  • max_itr (int) – maximum number of iteration of value iteration
env

Environment – environment

gamma

float – discount factor

min_error

float – minimum error for convergence of value iteration

max_itr

int – maximum number of iteration of value iteration

V

collections.defaultdict – value function

T

collections.defaultdict – transition model

R

collections.defaultdict – reward function

obses

set – visited states

goal

tuple – goal state (terminal state)

add_obs(obs)

Adds observation to the obses set

Parameters:obs (list of int or np.ndarray) – observation
get_action(obs)

Returns the policy for a certain observation

Chooses the action that has the highest value function. When multiple actions have the same value function a random action is chosen from them.

Parameters:obs (list of int or np.ndarray) – observation
Returns:action
Return type:int
get_transition(obs, action, obs_tp1=None)

Returns transition probability of obs-action-obs_tp1

Parameters:
  • obs (list of int or np.ndarray) – observation
  • action (int) – action
  • obs_tp1 (list of int or np.ndarray) – next observation (Default value = None)
Returns:

Transition probability

Return type:

float

get_value_map()

Returns value map for environment

Value-map will be a numpy array equal to the width (w) and height (h) of the environment. Each entry (state) will hold the value function associated with that state.

Returns:value map
Return type:numpy.ndarray
optimize()

Run value iteration method

Runs value iteration until converged or maximum number of iterations is reached. Method iterates through all visited states (self.obses).

reset()

Reset internal state

The following attributes are cleared: - self.V: Value function - self.T: Transition model - self.R: Reward function - self.obses: Visited observation - self.goal: Goal state

set_goal(obs)

Set goal state

Goal state is added to the obses list and value function for the goal state is set to zero.

Parameters:obs (list of int or np.ndarray) – observation
set_reward(obs, action, obs_tp1, value)

Set reward for obs-action-obs_tp1

Parameters:
  • obs (list of int or np.ndarray) – observation
  • action (int) – action
  • obs_tp1 (list of int or np.ndarray) – next observation
  • value (float) – reward
set_transition(obs, action, obs_tp1, value)

Set transition of obs-action-obs_tp1

Note

All transitions for a obs-action pair should add up to 1. This is not checked!

Parameters:
  • obs (list of int or np.ndarray) – observation
  • action (int) – action
  • obs_tp1 (list of int or np.ndarray) – next observation
  • value (float) – transition probability

Module contents