smartstart.algorithms package¶

Submodules¶

smartstart.algorithms.counter module¶

Counter module

Describes Counter base class for TD-learning algorithms

class Counter(env)¶

Bases: object

Base class for visitation counts.

Base class for keeping track of obs-action-obs_tp1 visitation counts in discrete state and discrete action reinforcement learning algorithms.

Parameters:	env (`Environment`) – environment

env¶: Environment – environment

count_map¶: collections.defaultdict (nested) – visitation counts for each obs-action-obs_tp1

total¶: int – total number of visitation counts

get_count(obs, action=None, obs_tp1=None)¶

Returns visitation count

Visitation count for obs, obs-action or obs-action-obs_tp1 is returned. User can leave action and/or obs_tp1 empty to return a higher level count.

Note

When action is None and obs_tp1 is not None, the count for just obs will be returned and obs_tp1 will not be taken into account.

Parameters:	obs (`list` of `int` or `np.ndarray`) – observation action (`int`) – action (Default value = None) obs_tp1 (`list` of `int` or `np.ndarray`) – next observation (Default value = None)
Returns:	Visitation count
Return type:	`int`

get_count_map()¶

Returns state count map for environment.

Count-map will be a numpy array equal to the width (w) and height (h) of the environment. Each entry (state) will hold the count associated with that state.

Note

Only works for 2D-environments that have a w and h attribute.

Returns:	Count map
Return type:	`np.ndarray`

get_density(obs, action=None, obs_tp1=None)¶

Density for obs, obs-action or obs-action-obs_tp1 is returned.

Density is calculated by dividing the count with the total count.

Parameters:	obs (`list` of `int` or `np.ndarray`) – observation action (`int`) – action (Default value = None) obs_tp1 (`list` of `int` or `np.ndarray`) – next observation (Default value = None)
Returns:	Density
Return type:	`float`

get_density_map()¶

Returns state density map for environment

Density-map will be a numpy array equal to the width (w) and height (h) of the environment. Each entry (state) will hold the density associated with that state.

Note

Only works for 2D-environments that have a w and h attribute.

Returns:	Density map
Return type:	`np.ndarray`

increment(obs, action, obs_tp1)¶

Increment count for obs-action-obs_tp1 transition.

Parameters:	obs (`list` of `int` or `np.ndarray`) – current observation action (`int`) – current action obs_tp1 (`list` of `int` or `np.ndarray`) – next observation

smartstart.algorithms.qlearning module¶

Q-Learning module

Module defining classes for Q-Learning and Q(lambda).

See ‘Reinforcement Learning: An Introduction by Richard S. Sutton and Andrew G. Barto for more information.

class QLearning(env, *args, **kwargs)¶

Bases: smartstart.algorithms.tdlearning.TDLearning

Parameters:	env (`TDLearning`) – environment args – see parent `TDLearning` *kwargs – see parent `TDLearning`

get_next_q_action(obs_tp1, done)¶

Off-policy action selection

Parameters:

obs_tp1 (list of int or np.ndarray) – Next observation
done (bool) – Boolean is True for terminal state

Returns:

float – Q-value for obs_tp1
int – action_tp1

class QLearningLambda(env, *args, **kwargs)¶

Bases: smartstart.algorithms.tdlearning.TDLearningLambda

Note

Does not work properly, because q-learning is off-policy standard eligibility traces might fail.

Parameters:	env (`TDLearning`) – environment args – see parent `TDLearningLambda` *kwargs – see parent `TDLearningLambda`

get_next_q_action(obs_tp1, done)¶

Off-policy action selection

Parameters:

obs_tp1 (list of int or np.ndarray) – Next observation
done (bool) – Boolean is True for terminal state

Returns:

float – Q-value for obs_tp1
int – action_tp1

smartstart.algorithms.sarsa module¶

SARSA module

Module defining classes for SARSA and SARSA(lambda).

See ‘Reinforcement Learning: An Introduction by Richard S. Sutton and Andrew G. Barto for more information.

class SARSA(env, *args, **kwargs)¶

Bases: smartstart.algorithms.tdlearning.TDLearning

Parameters:	env (`TDLearning`) – environment args – see parent `TDLearning` *kwargs – see parent `TDLearning`

get_next_q_action(obs_tp1, done)¶

On-policy action selection

Parameters:

obs_tp1 (list of int or np.ndarray) – Next observation
done (bool) – Boolean is True for terminal state

Returns:

float – Q-value for obs_tp1
int – action_tp1

class SARSALambda(env, *args, **kwargs)¶

Bases: smartstart.algorithms.tdlearning.TDLearningLambda

Parameters:	env (`TDLearning`) – environment args – see parent `TDLearningLambda` *kwargs – see parent `TDLearningLambda`

get_next_q_action(obs_tp1, done)¶

On-policy action selection

Parameters:

obs_tp1 (list of int or np.ndarray) – Next observation
done (bool) – Boolean is True for terminal state

Returns:

float – Q-value for obs_tp1
int – action_tp1

smartstart.algorithms.tdlearning module¶

Temporal-Difference module

Describes TDLearning and TDLearningLambda base classes for temporal difference learning without and with eligibility traces.

See ‘Reinforcement Learning: An Introduction by Richard S. Sutton and Andrew G. Barto for more information.

class TDLearning(env, num_episodes=1000, max_steps=1000, alpha=0.1, gamma=0.99, init_q_value=0.0, exploration='E-Greedy', epsilon=0.1, temp=0.5, beta=1.0)¶

Bases: smartstart.algorithms.counter.Counter

Base class for temporal-difference methods

Base class for temporal difference methods Q-Learning, SARSA, SARSA(lambda) and Q(lambda). Implements all common methods, specific methods to each algorithm have to be implemented in the child class.

All the exploration methods are defined in this class and can be added by adding a new method that describes the exploration strategy. The exploration strategy must be added to the class attribute below and inserted in the get_action method.

Currently 5 exploration methods are implemented; no exploration, epsilon-greedy, boltzmann, count-based and ucb. Another exploration method is optimism in the face of uncertainty which can be used by setting the init_q_value > 0.

Parameters:

env (Environment) – environment
num_episodes (int) – number of episodes
max_steps (int) – maximum number of steps per episode
alpha (float) – learning step-size
gamma (float) – discount factor
init_q_value (float) – initial q-value
exploration (str) – exploration strategy, see class attributes for available options
epsilon (float or Scheduler) – epsilon-greedy parameter
temp (float) – temperature parameter for Boltzmann exploration
beta (float) – count-based exploration parameter

num_episodes¶: int – number of episodes

max_steps¶: int – maximum number of steps per episode

alpha¶: float – learning step-size

gamma¶: float – discount factor

init_q_value¶: float – initial q-value

Q¶: np.ndarray – Numpy ndarray holding the q-values for all state-action pairs

exploration¶: str – exploration strategy, see class attributes for available options

epsilon¶: float or Scheduler – epsilon-greedy parameter

temp¶: float – temperature parameter for Boltzmann exploration

beta¶: float – count-based exploration parameter

BOLTZMANN = 'Boltzmann'¶

COUNT_BASED = 'Count-Based'¶

E_GREEDY = 'E-Greedy'¶

NONE = 'None'¶

UCB = 'UCB'¶

get_action(obs)¶

Returns action for obs

Return policy based on exploration strategy of the TDLearning object.

When an exploration method is added make sure the method is added in the class attributes and below for ease of usage.

Parameters:	obs (`list` of `int` or np.ndarray) – observation
Returns:	next action
Return type:	`int`
Raises:	`NotImplementedError` – Please choose from the available exploration methods, see class attributes.

get_next_q_action(obs_tp1, done)¶

Returns next Q-Value and next action

Note

Has to be implemented in child class.

Parameters:	obs_tp1 (`list` of `int` or `np.ndarray`) – next observation done (bool) – True when obs_tp1 is terminal
Raises:	`NotImplementedError` – use a subclass of TDLearning like QLearning or SARSA.

get_q_map()¶

Returns value map for environment

Value-map will be a numpy array equal to the width and height of the environment. Each entry (state) will hold the maximum action-value function associated with that state.

Returns:	value map
Return type:	`np.ndarray`

get_q_value(obs, action)¶

Returns Q-value for obs-action pair

Parameters:	obs (`list` of `int` or `np.ndarray`) – observation action (int) – action
Returns:	Q-Value
Return type:	`float`

get_q_values(obs)¶

Returns Q-values and actions for observation obs

Parameters:	obs (`list` of `int` or `np.ndarray`) – observation
Returns:	`list` of `float` – Q-values `list` of `int` – actions associated with each q-value in q_values

reset()¶

Resets Q-function

The Q-function is set to the initial q-value for very state-action pair.

take_step(obs, action, episode, render=False)¶

Takes a step and updates

Action action is executed and response is observed. Response is then used to update the value function. Data is stored in Episode object.

Parameters:

obs (list of int or np.ndarray) – observation
action (int) – action
episode (Episode) – Data container for all the episode data
render (bool) – True when rendering every time-step (Default value = False)

Returns:

list of int or np.ndarray – next observation
int – next action
bool – done, True when obs_tp1 is terminal state
bool – render, True when rendering must continue

train(render=False, render_episode=False, print_results=True)¶

Runs a training experiment

Training experiment runs for self.num_episodes and each episode takes a maximum of self.max_steps.

Parameters:	render (`bool`) – True when rendering every time-step (Default value = False) render_episode (`bool`) – True when rendering every episode (Default value = False) print_results (`bool`) – True when printing results to console (Default value = True)
Returns:	Summary Object containing the training data
Return type:	`Summary`

update_q_value(obs, action, reward, obs_tp1, done)¶

Update Q-value for obs-action pair

Updates Q-value according to the Bellman equation.

Parameters:	obs (`list` of `int` or `np.ndarray`) – observation action (`int`) – action reward (`float`) – reward obs_tp1 (`list` of `int` or `np.ndarray`) – next observation done (`bool`) – True when obs_tp1 is terminal
Returns:	updated Q-value and next action
Return type:	`float`

class TDLearningLambda(env, lamb=0.75, threshold_traces=0.001, *args, **kwargs)¶

Bases: smartstart.algorithms.tdlearning.TDLearning

Base class for temporal difference methods using eligibility traces

Child class of TDLearning, update_q_value and train method are modified for using eligibility traces.

Parameters:	env (`Environment`) – environment lamb (`float`) – eligibility traces decay parameter threshold_traces (`float`) – threshold for activation of trace args – see parent class TDLearning *kwargs – see parent class TDLearning

lamb¶: float – eligibility traces decay parameter

threshold_traces¶: float – threshold for activation of trace

traces¶: np.ndarray – numpy ndarray holding the traces for each state-action pair

get_next_q_action(obs_tp1, done)¶

Returns next Q-Value and action

Note

Has to be implemented in child class.

Parameters:	obs_tp1 (`list` of `int` or np.ndarray) – next observation done (`bool`) – True when obs_tp1 is terminal
Raises:	Use a subclass of TDLearning like QLearningLambda or SARSALambda.

train(render=False, render_episode=False, print_results=True)¶

Runs a training experiment

Training experiment runs for self.num_episodes and each episode takes a maximum of self.max_steps.

Parameters:	render (`bool`) – True when rendering every time-step (Default value = False) render_episode (`bool`) – True when rendering every episode (Default value = False) print_results (`bool`) – True when printing results to console (Default value = True)
Returns:	Summary Object containing the training data
Return type:	`Summary`

update_q_value(obs, action, reward, obs_tp1, done)¶

Update Q-value for obs-action pair

Updates Q-value according to the Bellman equation with eligibility traces included.

Parameters:	obs (`list` of `int` or np.ndarray) – observation action (`int`) – action reward (`float`) – reward obs_tp1 (`list` of `int` or np.ndarray) – next observation done (`bool`) – True when obs_tp1 is terminal
Returns:	updated Q-value and next action
Return type:	`float`

smartstart.algorithms.valueiteration module¶

Value Iteration module

Describes ValueIteration class.

See ‘Reinforcement Learning: An Introduction by Richard S. Sutton and Andrew G. Barto for more information.

class ValueIteration(env, gamma=0.99, min_error=1e-05, max_itr=1000)¶

Bases: object

Value Iteration method

Value iteration is a dynamic programming method. Requires full knowledge of the environment, i.e. transition model and reward function.

Note

This implementation only works with one goal (terminal) state

Parameters:	env (`Environment`) – environment gamma (`float`) – discount factor min_error (`float`) – minimum error for convergence of value iteration max_itr (`int`) – maximum number of iteration of value iteration

env¶: Environment – environment

gamma¶: float – discount factor

min_error¶: float – minimum error for convergence of value iteration

max_itr¶: int – maximum number of iteration of value iteration

V¶: collections.defaultdict – value function

T¶: collections.defaultdict – transition model

R¶: collections.defaultdict – reward function

obses¶: set – visited states

goal¶: tuple – goal state (terminal state)

add_obs(obs)¶

Adds observation to the obses set

Parameters:	obs (`list` of `int` or `np.ndarray`) – observation

get_action(obs)¶

Returns the policy for a certain observation

Chooses the action that has the highest value function. When multiple actions have the same value function a random action is chosen from them.

Parameters:	obs (`list` of `int` or `np.ndarray`) – observation
Returns:	action
Return type:	`int`

get_transition(obs, action, obs_tp1=None)¶

Returns transition probability of obs-action-obs_tp1

Parameters:	obs (`list` of `int` or `np.ndarray`) – observation action (`int`) – action obs_tp1 (`list` of `int` or `np.ndarray`) – next observation (Default value = None)
Returns:	Transition probability
Return type:	`float`

get_value_map()¶

Returns value map for environment

Value-map will be a numpy array equal to the width (w) and height (h) of the environment. Each entry (state) will hold the value function associated with that state.

Returns:	value map
Return type:	`numpy.ndarray`

optimize()¶

Run value iteration method

Runs value iteration until converged or maximum number of iterations is reached. Method iterates through all visited states (self.obses).

reset()¶

Reset internal state

The following attributes are cleared: - self.V: Value function - self.T: Transition model - self.R: Reward function - self.obses: Visited observation - self.goal: Goal state

set_goal(obs)¶

Set goal state

Goal state is added to the obses list and value function for the goal state is set to zero.

Parameters:	obs (`list` of `int` or `np.ndarray`) – observation

set_reward(obs, action, obs_tp1, value)¶

Set reward for obs-action-obs_tp1

Parameters:	obs (`list` of `int` or `np.ndarray`) – observation action (`int`) – action obs_tp1 (`list` of `int` or `np.ndarray`) – next observation value (`float`) – reward

set_transition(obs, action, obs_tp1, value)¶

Set transition of obs-action-obs_tp1

Note

All transitions for a obs-action pair should add up to 1. This is not checked!

Parameters:	obs (`list` of `int` or `np.ndarray`) – observation action (`int`) – action obs_tp1 (`list` of `int` or `np.ndarray`) – next observation value (`float`) – transition probability

smartstart.algorithms package¶

Submodules¶

smartstart.algorithms.counter module¶

smartstart.algorithms.qlearning module¶

smartstart.algorithms.sarsa module¶

smartstart.algorithms.tdlearning module¶

smartstart.algorithms.valueiteration module¶

Module contents¶